Highway to Scale - Part 6

Data Science model inference is a crucial component of BeeHero's technology - and scaling inference along with the data posed quite a few challenges. Using queues, batches, concurrency controls and code optimizations helped us to deliver the expected performance

Inbar Shani

Chief Software Architect

November 25, 2025

Scaling Data Science Model Inference

‍

One of BeeHero’s core customer values is the ability to indicate the strength of a hive without sending a beekeeper to directly inspect it. For large scale pollination and beekeeping this is crucial - you simply can’t inspect all the hives, as a pollination season will require hundreds of thousands of hives. But even sampling these hives takes a toll on the bees (and the beekeepers) as interrupting the hive by opening it for inspection has a negative effect on the bees׳ activity.

‍

We deliver the hive strength indicator through our ground breaking prediction models, which process the IoT sample and infer a prediction for the number of bee frames a hive holds, the number of brood frames, the likelihood of the hive collapsing and other key indicators.

‍

The model inferences are executed on a schedule, and require preparing the payload for the inference and then executing the models in sequential order (as some of the inference results are used as input for other models)

‍

This is where our ‘lambda chain’ comes in - our initial approach to running the inferences. It was composed of a lambda that triggered the process, then a second lambda that prepared the payload, a third lambda to map which inference models should run for each payload, then a succession of inference executions, and finally a lambda to save the results.

Each lambda in the chain ‘knew’ about the next lambda to run, and accumulated the results into the payload sent to the next lambda in the chain.

‍

We also supported ‘silent’ models - running inference and saving their results as logs, but not as customer-facing values. This allowed us to experiment with model development and contrast and compare results between various models with the same input.

‍

Original lambda chain executing model inference

‍

Then there was scale…

‍

This approach worked well while BeeHero established its technology, but when we scaled our operations and started accumulating millions of samples a day, our lambda chain started to struggle as well:

‍

API overload - multiple lambdas in the chain were invoking API calls: the lambda collecting the payload for each sensor and the lambda saving the chain results, obviously, but also each inference-invocation lambda may invoke API calls to save ‘silent’ model results (which do not continue in the chain).
While neither API calls take long for a single sensor, when we had thousands of lambda invocations issue API calls at the same time, we started getting API timeouts - which led to further retries and even more load on the API server.
‍
While lambdas can easily scale in the tens of thousands of invocations, it isn’t necessarily efficient. Each lambda has a startup time, which the AWS service optimizes by starting additional invocations on the same execution environment behind the scenes (i.e ‘warm start’). When we start these invocations more or less at the same time, this feature can not be leveraged and all the invocations are ‘cold’, resulting in longer processing time and additional costs.
‍
An additional issue stemming from this was that our lambdas loaded configuration parameters from AWS Parameter Store, which has a hard limit on API calls per second. When tens of thousands of lambdas launched at the same time, the Parameter Store API calls frequently failed, resulting in missing or wrong configuration during model execution, further resulting in unexpected behavior and results.
‍

Cue in: queues

‍

To gain better control and coordination in our ‘lambda chain’ we first introduced AWS SQS queues into the chain. Instead of triggering the lambdas in the chain directly, we added a queue for each lambda and triggered the lambda by messages sent to the queue. Each lambda worked the same as before, but instead of invoking the next lambda with a synchronous API call, it sent the accumulated sensor payload to the queue of the next lambda.

‍

‍

Using queues allowed us to streamline the API calls. First and foremost, SQS triggers support batching and concurrency, which enabled us to control how many lambdas are invoked simultaneously. Configuring the trigger’s concurrency setting limits the number of concurrent lambdas, which had two results: a limit on the number of concurrent API calls, and better usage of ‘warm start’. Configuring the batch setting allowed each lambda to be more efficient in handling sensors, especially in API server calls (which also supported batches of sensors) and in invoking the SQS API sending messages to the next queue (which is much more efficient when sending batches of messages).

‍

The ‘warm start’ further reduced the number of concurrent AWS Parameter Store API calls, as the invocations on the same execution environment could share a global state and we saved the parameters to this global state.

‍

Next, all the ‘silent’ model results were consolidated into a single lambda that engaged with the API server - all the model execution lambdas sent messages to this lambda instead of invoking the API server directly. The ‘saving’ lambda concurrency and batch properties allowed for efficient API calls with minimal timeouts. Removing the dependency on the API server in model execution lambdas also simplified their execution and made it more reliable at scale.

‍

Consolidating API calls into dedicated lambdas with their own queues

‍

Finally, retries and error handling were much improved. SQS message failures can be managed in partial batch replies, so that you don’t need to run the whole batch again - instead, only the failed messages are returned to the queue for retry. You can set the retries limit as usual, but when a message hits the max number of retries, it goes to a deadletter queue (DLQ) which in turn can be monitored and alert us when a message didn’t go through the chain - and then we can research the issue, fix if needed, and manually redrive the DLQ messages to try again.

‍

When the line is busy, you need smarter serving

‍

‘Model serving’ is the process of deploying and invoking a trained data science model for customer-facing inferences. Our ‘lambda chain’ started off with a very simple serving - we run all the models for all the sensors. But as we scaled and introduced more variance to our data and customers, we needed to be more agile - for example, some models were trained for a specific geography or specific seasonality, and should only be invoked for sensors of the same conditions.

‍

To support these emerging requirements, we introduced a mapping mechanism where our data science team could map inference models to sensor qualifiers. We debated different options for a standard way to define these qualifiers: JSON-based custom definitions, python code and SQL. We decided to go with SQL ‘WHERE’ clauses, as it neatly tied in with how we kicked off the ‘lambda chain’ - with a query on a DB View consolidating the list of sensors that qualify for inference. The mapped conditions could be just applied as part of the query to produce a list of sensors for each inference model.

‍

Adding model mapping configuration via an S3 file

‍

Once we had this mapping in place, we could apply it early in our lambda chain - in fact, right on start. By loading all the various inference mappings at the chain start, we could calculate in advance which inference models should run for each sensor, and include it as part of the sensor payload, so that subsequent model execution lambdas could just pull this information and invoke the right inference, without wasting additional time on loading the mapping and figuring out to which model the sensor maps. Since the model execution lambdas are the ones that scale out in proportion to the increasing number of sensors, every optimization of their run time was beneficial.

‍

Optimizing model mapping in the initial step of the chain

‍

So far and onwards

‍

Queues, batches, concurrency and smart serving supported our pollination seasons nicely, and streamlined execution of our lambda chain. But there was still much to be desired in terms of our end-to-end MLOps - for example, our data science team started working with a new model which required lengthy calculations of features. We started to pay more attention on how to scale our data science processes - but we will get back to this in a future post…

‍

Related Posts

Highway to Scale - Part 5

All Data Flows to the Lake - CDC from Postgres to AWS S3 Data Lake

The Story of Our Hero Gateway