Consensus

How Distributed Systems Reach Agreement

Nov 21, 2024

Hello! For the last day of the week, let’s explore an essential concept in databases: consensus.

One fundamental problem in the context of distributed systems is the need to achieve consensus. In essence, consensus is about getting a set of processes to agree on something in a fault-tolerant way.

To be specific to the domain of databases, it means that a set of nodes—potentially distributed across different locations—have to agree on a value or a decision, even in the presence of failures or network partitions.

Why is consensus important? In a distributed database with multiple nodes, achieving strong consistency requires all nodes to agree on the same value for a given key, regardless of which node is queried. Consensus ensures that each node reflects the same data, guaranteeing consistent responses across the system.

Consider two conflicting queries: one reaches node 1 to set foo=42, while another reaches node 2 at the same time to set foo=200. Without consensus, these nodes might return different values. Consensus ensures all nodes agree on one value: either foo=42 or foo=200.

Consensus is used on many occasions in the context of databases, including:

Leader election: Ensures all nodes agree on a single leader.
Data replication: Guarantees that all nodes have the same state of data, avoiding discrepancies between replicas.
Data recovery: Ensures that during recovery from node failures, all nodes agree on the correct state of the database.
Distributed query execution: For complex queries that require interacting with multiple nodes, consensus ensures that all nodes agree on the execution plan and the results.

Consensus algorithms must satisfy the following properties:

Termination: The consensus process is not stuck forever.
Integrity: Once a decision is reached, no node will change its mind later.
Agreement: All the nodes agree on the same outcome.

As we discussed in Safety and Liveness, consensus algorithms should respect both safety and liveness properties. For example, multiple nodes reaching a consensus to agree on a given value must satisfy:

Safety: Ensures that the nodes won’t agree on conflicting values, e.g., some nodes agreeing on one value and others on another value.
Liveness: Ensures that the nodes eventually find an agreement, even if some node fails or there are network issues.

Also, let’s mention two widely-used consensus algorithms1:

Paxos: A foundational consensus algorithm used in systems like Apache Zookeeper.
Raft: An algorithm designed to be easier to implement and used in systems like etcd and CockroachDB.

In conclusion, consensus ensures that distributed nodes agree on a single state or value, which is essential for enforcing strong consistency models.

We will explore these two algorithms in more detail in future issues.

Consensus

How Distributed Systems Reach Agreement

Comments