Highway to Scale - Part 2

BeeHero's journey to scale our solution and meet customer expectations. Part two: at the beginning there were many sensors but only one server
Inbar Shani
Chief Software Architect
May 12, 2025

We are starting our journey early in 2022. BeeHero has just completed another pollination season of California almonds, with nearly 50K monitored hives and ~2 million samples a day. Everything worked, but some concerning signs were showing up: at times, our server was overloaded with HTTP traffic, leading to failed responses for both IoT devices and customer-facing applications. 

At the time, our cloud platform looked like this:

Our platform was created on AWS cloud. Sensor data was sent over HTTP to an AWS Beanstalk server, which hosted our Python Flask application. Our code saved the raw data to an S3 bucket and the file path was then saved to an RDS Postgres table - tracking which files were handled. It then triggered a python thread to process the raw data and produce a sample. The processed sample was saved to a different RDS table.

What already worked

We already had in place several techniques and mechanisms to handle load in our system, for example:

  • We had an automated scaling group on our Beanstalk which allowed it to scale up in order to handle influx of IoT data
  • We were ‘spreading’ the IoT devices ‘wake-up’ times across the ‘wake up’ window by adding (or subtracting) a random timespan from their next ‘wake up’ time
  • We optimized our Postgres tables indices to easily search and handle samples by their timestamp and upload time
  • For longer processing times, we run ETLs as AWS lambdas - aggregating and summarizing data insights from the enriched samples
  • Using a read-replica setup for our production Postgres. A read replica setup is supported by both RDS and Postgres natively (Postgres is leveraging the more robust physical replication for that, as opposed to logical replication). It automatically routes heavy-load read-only queries to the replica, offloading some of the load from the main instance.

What didn’t

Still, there were a few immediate issues with this flow once the samples started scaling up, during the pollination season of Feb 2022:

  • The beanstalk scaling group was not set to increase the number of instances enough during peak hours. Beanstalk scaling is also a bit slow - it takes a few minutes to launch a new EC2 instance and register it with the Beanstalk’s load balancer. This led to some requests simply timing out. On the other hand, keeping the Beanstalk with a high volume of instances was not cost effective.
  • The timeout issues were further exacerbated by the code flow which handled raw data. It performed several minor data transformations, then saved the data to an S3 bucket and then saved it to an RDS Postgres table. The resulting flow could take a few seconds and even minutes, depending mostly on the RDS query, and could cause the request to time out.
  • The sample processing threads were a different kind of issue - they were untrackable. They run on the Beanstalk’s EC2 instances, but outside of the request context. The Beanstalk could have removed instances as part of its scaling down operation, without realizing that they were still running sample processing. The result was that samples were lost at times.

Some other issues were less obvious to predict or detect:

  • The Beanstalk got API requests from both the IoT devices and the customer-facing applications, but when it was overloaded, it could have failed the customer-facing requests just as much as the sensor requests, leading to unexpected errors in our Web applications. At the same time, internal applications could have created additional load on the API by invoking code which had non-optimized queries.
  • When an IoT device failed to upload samples, it would go into a short sleep cycle and then retry. This would lead to cascading peak load times - the server will go into overload, devices will get failed responses, sleep shortly and come up again at the moment the server was finally getting better, overloading it once again.
  • The code flow handling raw data and the code flow producing the samples were both using the Postgres table that was tracking raw data files - essentially checking/updating what was handled and what is still to handle. This led at times to race conditions and blocks while one process was trying to add a new raw file, and the other was trying to update previous files and set their status to ‘handled’

These issues were inherent to the architecture we had at the time, and we needed time to refactor it. But it was also urgent to be able to handle the current scale immediately. So we did both - we put in place immediate measures to alleviate the current needs, and we started working on a longer term plan. I’ll discuss the long term in our next post, and expand here on the immediate actions.

Low-hanging fruits: immediate steps to increase scale

More beanstalks for the win

First, we duplicated the Beanstalk deployment and assigned different responsibilities to each duplicated Beanstalk. This allowed us to break the dependency between the customer facing applications’ API and the IoT processing API, without having to refactor and split the code base itself (which will take much longer to do). It also allowed us to customize different scaling groups for each Beanstalk, as the load requirement from the customer-facing apps were very different from the IoT devices APIs. Finally, we added schedule based scaling group to the IoT APIs Beanstalk, preparing it to the IoT devices ‘wake up’ window in advance by switching to a scaling group with higher minimal number of instances - and switching back to a lower number after the ‘wake up’ window to save costs.

Offload queries unto Redshift

Second, we introduced an AWS Redshift cluster for heavy-load queries. This cluster will be used by internal tools and manual R&D and Data Science queries, replacing executing these queries on our production postgres DB. That way we offloaded heavy-load queries from our production DB. To populate the Redshift cluster, we used AWS DMS replication tasks to migrate data from Postgres to Redshift. I will expand more on these DMS tasks at some later post, as they introduced both benefits and challenges to our system.

As a side note, at this time we expanded the use of DNS records (defined by AWS Route53), using custom subdomains to replace AWS subdomains, for example to access to the DB. This will allow us to use an emergency measure if needed - launching a new production DB from a snapshot and switching all traffic to it instead of our production DB if we needed DB downtime for maintenance or upgrade, later switching back to the upgraded DB. Obviously, when using this practice one should minimize (or even completely stop) any insertion or update queries, so there will be a minimal gap of data between the stand-by DB instance and the main instance.

Increase R&D velocity for our major refactors

A third immediate step was to create and maintain a test environment. While not directly related to the scale issues, the test environment will allow us to refactor our architecture faster and with greater quality assurance, and would improve our R&D processes.

To summarize, this was our architecture after these steps, and heading into our next major step - refactoring the IoT samples flow: