From November 15th to 23rd, the Helium Network experienced a number of disruptions for Hotspots and devices running on the Network. The goal of this report is to inform the community of the events that took place, why they occurred, and what has been done to address the issues and ultimately improve the Helium Network so it can continue to scale for the long term.
Throughout the disruptions, there was no risk to HNT holders, and while some devices were unable to use the Network for data transfer, the majority of devices were unaffected. For a full technical overview of the events, make sure to read the postmortem published in the Engineering Blog.
The Helium Blockchain has been running without serious interruption for over the two years and over 1 million total blocks. On November 15th at approximately 20:00 UTC, the blockchain came to a full halt due to an extremely large block, 145 MB in size (versus the typical < 1 MB size block). This large block could not be processed by Validators, so the community and core developers began work on identifying the root cause and applying fixes for the blockchain to run smoothly once again.
It was found that there was a bug in the Router codebase (the Network server that receives data transfer packets). Specifically, the bug exhibited in the community-hosted Discovery Mode Router instance and propagated many State Channels dispute transactions. Validators processed these transactions and tried to form the large block that halted the chain. All blockchain activity was halted until a fix was implemented on November 17th at 02:00 UTC, returning the Network to more normal operations. A new Hotspot firmware release and Validators release were pushed during the subsequent 24 hours to fix the Router transaction manager and improve block synchronization and handling.
Following the first disruption, a second blockchain halt occurred on November 20th at 18:05 UTC due to another large block caused by another flood of dispute transactions. Even though some optimizations had been made from the prior outage, the block was still too large in size (approximately 20MB) for Validators and Hotspots to consume. This halt lasted for approximately 12 hours, during which a new Validator version was released, a Rescue Block was issued, and new Hotspot firmware issued to all Hotspot manufacturers. After these releases, including further improvements to limiting block size, processing blocks, and processing snapshots, activity resumed for Hotspots and Validators.
The final disruption came on November 23rd at 00:45 UTC, when a new chain variable was activated that caused some miners to stop syncing at block 1,107,995. It was discovered that the root cause was a “nonce disagreement,” when miners consumed a snapshot including the Rescue Block from the second outage. During this outage, a subset of Hotspots and Validators were unable to sync the blockchain for approximately 4 hours. The blockchain continued moving during this time, and there was no impact to Proof-of-Coverage and packet transfer for those Hotspots that remained in sync.
Moving Forward From the Outages
During and following the Network outages, changes were rapidly developed and deployed by the Helium Community and core developers, the full details of which can be found on the Engineering Blog.
In short, some of these changes include significant performance enhancements, a cap to limit the size of non-reward blocks to a maximum (initial default limit of 50 MB with a chain variable to allow readjustment), a cap of individual consensus member transaction proposals to 1 MB, and improved Snapshot loading to require dramatically less memory for Hotspots.
In addition to these changes, a number of new firmware releases were made available to all Hotspot Makers to ensure their Hotspot fleets could sync, participate in Proof-of-Coverage, and transfer device data, as well as to Validators to ensure that the Consensus Group would run smoothly.
Overall, the performance improvements developed during the past few weeks and deployed by Hotspot manufacturers and Validator operators will benefit the Network well beyond resolution of these disruptions as it continues its unprecedented growth.
The core developers remain committed to acting quickly when issues do arise and supporting all currently approved Hotspot hardware until the Light Hotspot software upgrade is available for manufacturers to implement and migrate their fleets. At the time of Light Hotspot software release, Hotspots will no longer need to sync the blockchain, which will prevent many of the recent issues from reoccurring.
Thank You to the Community
Thank you to everyone in the Helium Community for their contributions and patience throughout the past couple of weeks. These early days of building the Network are certainly challenging.
We have learned a lot very quickly over the last two years in scaling the largest people-owned Network and are proud of the progress that has been made. With the latest changes, we believe significant enhancements have been made to ensure all continues to run smoothly in the near future and make for a stronger and more resilient Network for the years to come.