We are starting our journey early in 2022. BeeHero has just completed another pollination season of California almonds, with nearly 50K monitored hives and ~2 million samples a day. Everything worked, but some concerning signs were showing up: at times, our server was overloaded with HTTP traffic, leading to failed responses for both IoT devices and customer-facing applications.
At the time, our cloud platform looked like this:
Our platform was created on AWS cloud. Sensor data was sent over HTTP to an AWS Beanstalk server, which hosted our Python Flask application. Our code saved the raw data to an S3 bucket and the file path was then saved to an RDS Postgres table - tracking which files were handled. It then triggered a python thread to process the raw data and produce a sample. The processed sample was saved to a different RDS table.
We already had in place several techniques and mechanisms to handle load in our system, for example:
Still, there were a few immediate issues with this flow once the samples started scaling up, during the pollination season of Feb 2022:
Some other issues were less obvious to predict or detect:
These issues were inherent to the architecture we had at the time, and we needed time to refactor it. But it was also urgent to be able to handle the current scale immediately. So we did both - we put in place immediate measures to alleviate the current needs, and we started working on a longer term plan. I’ll discuss the long term in our next post, and expand here on the immediate actions.
First, we duplicated the Beanstalk deployment and assigned different responsibilities to each duplicated Beanstalk. This allowed us to break the dependency between the customer facing applications’ API and the IoT processing API, without having to refactor and split the code base itself (which will take much longer to do). It also allowed us to customize different scaling groups for each Beanstalk, as the load requirement from the customer-facing apps were very different from the IoT devices APIs. Finally, we added schedule based scaling group to the IoT APIs Beanstalk, preparing it to the IoT devices ‘wake up’ window in advance by switching to a scaling group with higher minimal number of instances - and switching back to a lower number after the ‘wake up’ window to save costs.
Second, we introduced an AWS Redshift cluster for heavy-load queries. This cluster will be used by internal tools and manual R&D and Data Science queries, replacing executing these queries on our production postgres DB. That way we offloaded heavy-load queries from our production DB. To populate the Redshift cluster, we used AWS DMS replication tasks to migrate data from Postgres to Redshift. I will expand more on these DMS tasks at some later post, as they introduced both benefits and challenges to our system.
As a side note, at this time we expanded the use of DNS records (defined by AWS Route53), using custom subdomains to replace AWS subdomains, for example to access to the DB. This will allow us to use an emergency measure if needed - launching a new production DB from a snapshot and switching all traffic to it instead of our production DB if we needed DB downtime for maintenance or upgrade, later switching back to the upgraded DB. Obviously, when using this practice one should minimize (or even completely stop) any insertion or update queries, so there will be a minimal gap of data between the stand-by DB instance and the main instance.
A third immediate step was to create and maintain a test environment. While not directly related to the scale issues, the test environment will allow us to refactor our architecture faster and with greater quality assurance, and would improve our R&D processes.
To summarize, this was our architecture after these steps, and heading into our next major step - refactoring the IoT samples flow: