Highway to Scale - Part 3

We focused on our IoT devices and server communication, improving our R&D processes and researching common scale issues
Inbar Shani
Chief Software Architect
July 2, 2025

Scaling the IoT traffic

As we discussed in our previous posts, the main challenge in terms of scale at BeeHero was not users’ traffic, but IoT traffic. During the almonds pollination season of 2022 we deployed over 75K sensors, and we expected that number to at least triple over the next couple of years. Each sensor collected a sample every 10 minutes, and these samples were sent in a batch to the cloud server every 4 hours or so:

With the expected growth, we would shortly bypass 10 million samples a day. We needed our devices and server to be resilient enough to handle this, and issues were starting to pop up at this time. So we looked at both the device and the server for improvements.

Scaling the IoT device

Our in-hive sensors and communication gateways are both a proprietary combination of hardware and software - which allowed us to examine different ways to make the devices more scalable.
A basic trait of scalable systems is being able to continue to function when faults occur - in other words, fault-tolerance. With IoT devices, errors can stem from the device hardware and software, from communication with the system’s servers, or from the servers’ software.

In early 2022 we already had a retry mechanism on the gateway devices. The gateway would cache sensor samples over a period of time - usually 4 hours - and then send them to the server in a batch. If the server communication fails, the gateway will go into a short sleep period, and then try again. If it fails again, it will ‘give up’ on this ‘wake up’ cycle and try again the batch (along with newer samples) at the next ‘wake up’ time. If the local cache of samples filled up during these failures, the gateway will override the older samples, resulting in some data loss.

To reduce the occurrences of communication failures, we started to research their causes. With IoT devices, this is quite the challenge - simulating communication conditions of devices that are deployed in wildly varying environments is difficult. In addition, we had various combinations of hardware and software versions deployed. Hardware degradation should also be taken into account.

While the majority of our devices were deployed in the US, we used local Israeli deployments to manage detailed research. Our main lab was in our office, and deployed additional ‘labs’ in field conditions with local beekeepers. An IoT lab allowed us to connect to the device hardware and software directly, measure and debug its performance and simulate extreme conditions as needed.

Device challenges

We identified several issues when reviewing the device logs. For example, we identified that the initial stage of opening an SSL connection from the modem to the cellular network was frequently failing, especially during peak network times (so more failures during day hours vs night hours). Once the SSL connection failed, the device will immediately go to sleep until the next ‘wake up’ cycle. This would sometimes cascade into additional issues, as the device would batch more data, resulting in ‘heavier’ traffic payloads, more failures and even loss of data.

Gateway firmware log, demonstrating an SSL connection failure and subsequent shutdown


To handle this issue, we introduced a retry mechanism for the SSL connection initialization (with a limit of 10 trials) - and saw that once a connection was established, it was stable. This mechanism helped us drop connectivity failures from 17% of ‘wake up’ cycles to under %1 of cycles.

Gateway firmware log, demonstraing SSL retries


Another example of a device issue was GPS commands - some of our older device modem versions didn’t support a specific set of GPS commands, which led to an infinite loop for the device, as the GPS state was never updated. Identifying this issue allowed us to resolve it by always setting the GPS state to ‘done’, with a separate flag for errors to indicate and handle issues.

Releasing these fixes as part of a new firmware version encouraged us to improve our version management - and by extension, our IoT team velocity and ability to quickly improve our devices. Up until this point new versions were mostly applied to new hardware, as it was easier to test and verify. This led to multiple combinations of hardware and software versions deployed in our beekeepers’ hives, and contributed to the complexity of analyzing issues and deploying fixes.

We worked to improve the stability of the remote firmware update process, both on the device and on the server side, and augmented it by creating a verification process in our office lab to verify new firmware versions applied to multiple hardware combinations. This allowed us to deploy new firmware versions across the board in the field, and get most of the devices to the latest versions.


The IoT server

While one team was working on scaling our devices, another team was hard at work scaling our IoT server. Diagnosing the issues was easier at this end, and we quickly determined that there were two main issues that led to HTTP request timeouts:


HTTP server scaling - the AWS Beanstalk service deploys a load balancer and manages a batch of EC2 instances that the load balancer can route requests to. Each instance’s capacity is determined by a combination of the number of working threads of the HTTP server on the instance (nginx in our case) and the number of threads the application allows (gunicorn and Flask in our case). The Beanstalk’s scaling group can be defined to scale (i.e. add more EC2 instances) based on various types of metrics - response time, CPU, memory and many more.

To improve our scaling, we started off by determining which metric was the best to use for scaling. We examined the beanstalk monitors for the various metrics. Handling the IoT device requests was not complicated nor did it take much memory to process - so the Network metric was the most aligned with the load trend, and we set our scaling by NetworkIn bytes. We’ve also set the scale up action to launch multiple instances (e.g. 4 instances) to compensate for the time it takes to launch an EC2 instance; this way, the capacity grew considerably when a load peak started, and descended more slowly (1 instance at a time).


Prolonged processing of requests - some of the IoT device requests were synchronous in nature (i.e. the device was looking for a meaningful response), while some were asynchronous (the device didn’t care about the response, other than it was successful). The synchronous requests required some processing, but the async ones should return as soon as we can guarantee the device payload will not be lost - however, our code was first processing the payload and only then sending the response to the device.

To improve our response time, we changed our code flow handling async device requests. We minimized the flow to do one thing: store the payload by sending it to an AWS SQS queue or by saving it as an S3 file (depending on the payload type). Once the payload was saved, we returned a 200 OK status code with an empty body to the device, which could then continue its wake-up cycle.

Conclusion

Our first milestone was improving the IoT device cycle and the IoT server to support higher volume of processing:

* IoT device - create a ‘lab’ infrastructure that can monitor device logs and network, and recreate failure conditions to identify root causes

* IoT device fleet - work to get devices hardware and software to the latest version as much as possible - supporting multiple combinations of hardware and software versions in the field is compounding IoT issues

* IoT server - auto-scale the AWS Beanstalk based on monitored metrics; ramp-up quickly and by a meaningful amount, ramp-down slowly

* IoT server - reduce processing-before-saving to minimum; respond to device as fast as possible; separate processing to async components


Next - scaling the raw data processing, but that’s for another post…