Earlier this year, Flow’s consensus team rolled out a major upgrade to our consensus algorithm. The upgrade reduced transaction finalization time by 30 percent and enabled faster, automatic recovery from network partitions, attacks and upgrades.
As part of this upgrade, the research team laid the technical foundation for an array of improvements to the Follower mode of the consensus algorithm, improving attack resilience and processing speed for all non-consensus nodes, including observer nodes, which anyone can operate without staking. We're excited to share that these improvements will be rolling out in the June 2023 network upgrade. In this article, we explain our improvements in more detail and their implications for the Flow network.
Flow processes transactions in a pipelined manner, where five specialized node types — access, collection, consensus, execution, and verification nodes (all staked) — work together. In addition, there are non-staked observer nodes. In a nutshell, consensus nodes schedule transactions for execution by proposing new blocks and voting on them. The other nodes track the chain as it advances and perform the work as orchestrated by the consensus nodes.
Flow's consensus protocol supports two modes: Participant and Follower. A Participant is a node that actively contributes to advancing consensus via proposing and voting on new blocks. A central responsibility of Participants is to enforce block validity, by only voting on valid blocks and raising slashing challenges on invalid blocks. Consensus nodes operate in the Participant mode for Flow's main consensus [1].
The Consensus Follower is the module which allows any non-consensus node to track consensus progress trustlessly. The follower independently applies consensus rules and locally asserts block finality. Essentially, it shields the higher-level processing logic within the node from any malicious nodes sending invalid blocks. Unlike Participants, Followers do not need to validate block payloads in full — instead they can validate block headers only, utilizing the verification work already done by Participants.
The new Follower implementation leverages the architectural improvements of our recent consensus upgrade by hardening nodes against various attacks, enabling the Follower to detect consensus violations, and lays the groundwork for extremely efficient light clients in the future.
The majority of improvements were unlocked by redesigning how and when Flow nodes validate Quorum Certificates (QCs). A QC is an aggregate of votes from a supermajority of consensus participants. While proposing or voting for an invalid block is a slashable protocol violation, the protocol still has to handle this scenario as individual consensus participants can be purposefully malicious (in technical terms ‘Byzantine’).
Flow employs Jolteon as its BFT consensus algorithm, which is an optimized flavor of HotStuff. In HotStuff, the QC for block B is included in B’s child block(s). A valid QC attests that a supermajority of consensus participants consider the block a valid extension of the chain. We say that a block B is certified, if and only if a valid QC for block B is known.
The Follower upgrade focuses on performance and resilience to Byzantine attacks. To mitigate attacks via invalid blocks, we re-designed the Follower’s internal processing of incoming blocks to:
In the previous version of the Follower, it was necessary to know a block's entire ancestry before we could validate the block’s header. For performance reasons, it is important to temporarily cache blocks, even if some ancestors are still missing.
The necessity to cache disconnected blocks could have been exploited as a vector for memory exhaustion attacks on the previous follower implementation:
In the depicted example, Evil Edward — a Byzantine consensus participant — is making up a long fork of blocks. As these blocks are fake, they haven’t been voted on by the consensus committee and no valid QCs exist for them. Therefore, Edward includes invalid QCs in his fork. The previous Follower cannot assess whether the QCs in Edward’s fork are valid since the Follower does not know the full ancestry for the fork.
With enough blocks and time, this attack could cause the Follower to crash with an out of memory exception. Even without a crash, this memory exhaustion attack can harm the Follower by flooding its cache with fake blocks, which delay the processing of valid blocks. In tests, we have seen that mounting a memory exhaustion attack on a node’s Follower caused the node to fall behind and fail to fulfill its function within the network in a timely manner.
The upgraded Follower can now validate blocks within the current epoch (one week) before their ancestors are known. Therefore, the new consensus follower would detect Edward’s fake blocks as invalid, discard them and raise a slashing challenge against Edward. Blocks outside the current epoch are immediately discarded.
The Follower uses a cache to store blocks which have not yet been incorporated into the blockchain. This cache is now separated into distinct sections for certified and uncertified blocks.
In the example above, the malicious primary has created 1,001 conflicting blocks. Consensus rules guarantee that only one of them will get certified (in our example block B). However, it is not clear in advance which one will eventually be certified to a Follower receiving the blocks. Therefore, the old Follower would store all blocks on disk, which is a vector for storage exhaustion attacks.
In comparison, the new Follower only stores certified blocks on disk. As only one of the many conflicting blocks can be certified, certified blocks pose little risk of storage exhaustion attacks.
Uncertified blocks have a valid proposer signature, but may nevertheless pose a threat in the case of a malicious leader. Therefore these blocks are stored separately in a size-limited cache which mitigates memory exhaustion attacks.
Flow's core consensus algorithm runs in a single-threaded event loop, meaning that one event is processed at a time. This is desirable for the simplicity of this mission-critical component, but limits performance.
A focus of recent architectural changes to Flow's consensus has been shifting work out of this event loop where possible. The team implemented parallel vote processing in early 2022, parallel timeout processing in the recent Jolteon upgrade, and we are extending this to include parallel QC validation in the Follower engine as part of Follower V2.
This change is particularly beneficial when a node which has fallen behind is catching up, because it can process more blocks in parallel.
When a block is certified, a supermajority of Consensus Nodes have agreed that the block is a valid extension of the state and voted for it. As a blockchain where each block builds upon its parent, a block is only valid if it is extending a valid block. In other words, if a block is certified, then all its ancestors are certified, too.
The new Follower can leverage this fact to avoid unnecessary work. When processing a sequential range of blocks, it is sufficient to ensure that the block at the end of the range is certified (and to check that the block hashes match). Hence, we only perform one expensive signature validation operation for the entire batch, instead of one for every block in the batch.
When a node is catching up, it receives blocks in sequential ranges by default, and can make use of this validation improvement to improve syncing speed.
When envisioning light clients running on embedded hardware, including smartphones, there are three main challenges to solve:
As discussed earlier in this article, the new Follower allows validating QCs for blocks without knowing their entire ancestry as long as those blocks are within a known epoch. For Flow, an epoch is typically one week long. Within this week, the set of authorized nodes remains unchanged. If a node is slashed, the network might revoke their right to propose blocks, but no new nodes are added mid epoch.
At the epoch transition, the nodes authorized to run Epoch N hand the control of the blockchain over to the nodes for Epoch N+1. Slightly simplified, this works as follows:
Note that any Follower that has the Identity table for Epoch N only needs a few blocks to construct the Identity Table for Epoch N+1. The figure below illustrates this process:
In summary, the Consensus Follower only needs to retrieve two chain segments of 10-20 blocks each to locally construct the identity table for the following epoch. The validity of the blocks is asserted through validating the tailing QC at the end of each batch and re-hashing the received blocks.
While roughly 460,000 blocks are finalized during an epoch, a consensus Follower only needs to retrieve 0.005 percent of those blocks. Thereby, the Follower can fast-forward through months or even years of blocks consuming negligible computational and networking resources.
Furthermore, consider an Epoch (past or current) for which the Follower knows the identity table. The Follower can retrieve a short chain segment at any height belonging to the epoch and assert its validity and finality by checking the QCs. Hence, the Follower can trustlessly obtain the state root hash for nearly any block of the epoch, as state root hashes are also contained in the blocks [2]. When knowing the state root hash at some block, it becomes possible to selectively retrieve part of the state in a trustless manner utilizing Merkle proofs.
Interested in building on Flow? Visit our developer portal to get started, and keep in touch with us on Twitter or Discord.