Page 45 - MSDN Magazine, October 2019
P. 45
the theory of distributed systems in how we deal with trust. In the next section, we drill down further into distributed systems and their implementation using the well-known state machine model in computer science.
Distributed Systems and
the State Machine Approach
Distributed systems share a core set of special characteristics, as explored in the field of theoretical computer science. These include: Concurrency: Multiple activities across the distributed system may be executed simultaneously and independently of each other. This implies that there’s a need for coordination across the differ-
ent flows of execution.
Independent failure modes: Multiple components across the
distributed system may fail independently of each other.
No global time: Multiple flows of execution may be aligned with spatially independent local clocks. Even in the event that these clocks are initially synchronized, clock drift will eventually result. This implies that time and event ordering is a core challenge
in distributed systems.
Communications delay: There’s an inherent lag in how events
and their side effects propagate through the distributed system. Inconsistent state: Concurrency, independent failure modes, and communications delays together imply that the view of any
state will not be consistent throughout the distributed system. Collectively, these characteristics require that distributed sys- tems be designed to be fault-tolerant, in order to continue to operate in the event of one or more faults (or complete failure) of
one or more subsystems.
There’s an inherent lag in how events and their side effects propagate through the distributed system.
The state machine approach is a general method for imple- menting a fault-tolerant distributed system by replicating services and coordinating client interactions across service replicas. A replicated state machine is deterministic in that it consists of a set of state variables that encode its state and transactions. These state variables can cause the machine to transition from one valid state to the valid next state. Each transaction is executed determinis- tically (that is, transactions are atomic). Essentially, a replicated state machine is a distributed set of services where all the services start with the same initial state and then agree (that is, reach con- sensus) on each of the subsequent state transitions.
Consensus Across Replicated State Machines
Formally, the goal of a consensus algorithm is to satisfy three key properties. These are:
msdnmagazine.com
Termination: All non-faulty services in the system eventually decide on some output value. This is often referred to as liveness.
Integrity: If all of the non-faulty services propose the same out- put value, then any non-faulty service must decide on the same output value. A weaker form of integrity is one where the output value must equal a value that was proposed by some non-faulty service (not necessarily all of them).
Agreement: All non-faulty services in the system eventually decide on the same output value. This is often referred to as safety. Distributed systems theory has made tremendous leaps in the understanding of consensus algorithms, but consensus in a com- pletely asynchronous distributed system has proven impossible to achieve in the presence of even a single faulty service. This is called the FLP impossibility, named after the researchers (Michael J. Fischer, Nancy Lynch and Mike Patterson) who posited a defin- itive upper bound on what’s possible to achieve with distributed
processes in an asynchronous environment.
The FLP impossibility has spawned research spanning two
innovative approaches. One set of algorithms relies on the so-called Nakamoto consensus. It applies an unconventional approach that relies on non-determinism to address the inherent scale challenges in attempting to generate consensus in a distributed system. The brilliance of the Nakamoto consensus is that rather than every ser- vice agreeing on a value, the algorithm focuses on all of the services agreeing on the probability of the value being correct. However, this results in probabilistic agreement—that is, the lack of deterministically finalizing a value at every state transition creates a situation where there’s no guarantee of true finality. This leads to the so-called forking scenario with respect to the distributed system. For this reason, we’ll ignore the Nakamoto consensus for the remainder of this article.
A second set of practical fault-tolerant consensus algorithms has assumed some level of synchrony assumptions in order to make progress. What this means is that some protocols are designed to work in unreliable networks—such as, say, the Internet—that drop messages and may cause arbitrary delay, while other protocols are optimized for highly reliable network channels. These protocols are said to operate under different sets of synchrony assumptions. Synchrony assumptions may be explicit or implicit by relying on leader election algorithms, for instance. Consensus algorithms that are based on leader election are called Paxos algorithms.
Byzantine Fault Tolerant Consensus
Byzantine failures pose a challenge for leader-based consensus algorithms. These failures occur when components or sub-compo- nents of a distributed system fail, and there’s imperfect information about whether a component (or sub-component) has actually failed. Algorithmic proofs exist to demonstrate that a malicious leader can’t cause inconsistency, but distributed systems theory has yet to demonstrate that a malicious leader can’t prevent progress.
The so-called practical BFT (pBFT) algorithm by Castro and Liskov was the first attempt to describe an algorithm by which the system can detect lack of progress and choose a new leader. pBFT was devised to address the twin flaws in previous attempts—either the algorithm was too slow to be of practical use or synchrony had to be assumed to satisfy the “agreement” property.
October 2019 41