The Docker Swarm architecture
The architecture of a Docker Swarm from a 30,000-foot view consists of two main parts—a raft consensus group of an odd number of manager nodes, and a group of worker nodes that communicate with each other over a gossip network, also called the control plane. The following diagram illustrates this architecture:
Figure 14.1 – High-level architecture of a Docker Swarm
The manager nodes manage the swarm while the worker nodes execute the applications deployed into the swarm. Each manager has a complete copy of the full state of the Swarm in its local raft store. Managers synchronously communicate with each other, and their raft stores are always in sync.
The workers, on the other hand, communicate with each other asynchronously for scalability reasons. There can be hundreds if not thousands of worker nodes in a Swarm.
Now that we have a high-level overview of what a Docker Swarm is, let’s describe all of the individual elements of a Docker Swarm in more detail.
Swarm nodes
A Swarm is a collection of nodes. We can classify a node as a physical computer or Virtual Machine (VM). Physical computers these days are often referred to as bare metal. People say we’re running on bare metal to distinguish from running on a VM.
When we install Docker on such a node, we call this node a Docker host. The following diagram illustrates a bit better what a node and a Docker host are:
Figure 14.2 – Bare-metal and VM types of Docker Swarm nodes
To become a member of a Docker Swarm, a node must be a Docker host. A node in a Docker Swarm can have one of two roles: it can be a manager or it can be a worker. Manager nodes do what their name implies; they manage the Swarm. The worker nodes, in turn, execute the application workload.
Technically, a manager node can also be a worker node and hence run the application workload—although that is not recommended, especially if the Swarm is a production system running mission-critical applications.
Swarm managers
Each Docker Swarm needs to include at least one manager node. For high-availability reasons, we should have more than one manager node in a Swarm. This is especially true for production or production-like environments. If we have more than one manager node, then these nodes work together using the Raft consensus protocol. The Raft consensus protocol is a standard protocol that is often used when multiple entities need to work together and always need to agree with each other as to which activity to execute next.
To work well, the Raft consensus protocol asks for an odd number of members in what is called the consensus group. Hence, we should always have 1, 3, 5, 7, and so on manager nodes. In such a consensus group, there is always a leader. In the case of Docker Swarm, the first node that starts the Swarm initially becomes the leader. If the leader goes away, then the remaining manager nodes elect a new leader. The other nodes in the consensus group are called followers.
Raft leader election
Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A server remains in the follower state as long as it receives valid Remote Procedure Calls (RPCs) from a leader or candidate. Leaders send periodic heartbeats to all followers in order to maintain their authority. If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader. During the election, each server will start a timer with a random time chosen. When this timer fires, the server turns itself from a follower into a candidate. At the same time, it increments the term and sends messages to all its peers asking for a vote and waits for the responses back.
In the context of the Raft consensus algorithm, a “term” corresponds to a round of election and serves as a logical clock for the system, allowing Raft to detect obsolete information such as stale leaders. Every time an election is initiated, the term value is incremented.
When a server receives a vote request, it casts its vote only if the candidate has a higher term or the candidate has the same term. Otherwise, the vote request will be rejected. One peer can only vote for one candidate for one term, but when it receives another vote request with a higher term than the candidate it voted for, it will discard its previous vote.
In the context of Raft and many other distributed systems, “logs” refer to the state machine logs or operation logs, not to be confused with traditional application logs.
If the candidate doesn’t receive enough votes before the next timer fires, the current vote will be void and the candidate will start a new election with a higher term. Once the candidate receives votes from the majority of their peers, it turns itself from candidate to leader and immediately broadcasts the authorities to prevent other servers from starting the leader election. The leader will periodically broadcast this information. Now, let’s assume that we shut down the current leader node for maintenance reasons. The remaining manager nodes will elect a new leader. When the previous leader node comes back online, it will now become a follower. The new leader remains the leader.
All of the members of the consensus group communicate synchronously with each other. Whenever the consensus group needs to make a decision, the leader asks all followers for agreement. If the majority of the manager nodes gives a positive answer, then the leader executes the task. That means if we have three manager nodes, then at least one of the followers has to agree with the leader. If we have five manager nodes, then at least two followers have to agree with the leader.
Since all manager follower nodes have to communicate synchronously with the leader node to make a decision in the cluster, the decision-making process gets slower and slower the more manager nodes we have forming the consensus group. The recommendation of Docker is to use one manager for development, demo, or test environments. Use three managers nodes in small to medium-sized Swarms and use five managers in large to extra-large Swarms. Using more than five managers in a Swarm is hardly ever justified.
The manager nodes are not only responsible for managing the Swarm but also for maintaining the state of the Swarm. What do we mean by that? When we talk about the state of the Swarm, we mean all the information about it—for example, how many nodes are in the Swarm and what the properties of each node are, such as the name or IP address. We also mean what containers are running on which node in the Swarm and more. What, on the other hand, is not included in the state of the Swarm is data produced by the application services running in containers on the Swarm. This is called application data and is definitely not part of the state that is managed by the manager nodes:
Figure 14.3 – A Swarm manager consensus group
All of the Swarm states are stored in a high-performance key-value store (kv-store) on each manager node. That’s right, each manager node stores a complete replica of the whole Swarm state. This redundancy makes the Swarm highly available. If a manager node goes down, the remaining managers all have the complete state at hand.
If a new manager joins the consensus group, then it synchronizes the Swarm state with the existing members of the group until it has a complete replica. This replication is usually pretty fast in typical Swarms but can take a while if the Swarm is big and many applications are running on it.
Swarm workers
As we mentioned earlier, a Swarm worker node is meant to host and run containers that contain the actual application services we’re interested in running on our cluster. They are the workhorses of the Swarm. In theory, a manager node can also be a worker. But, as we already said, this is not recommended on a production system. On a production system, we should let managers be managers.
Worker nodes communicate with each other over the so-called control plane. They use the gossip protocol for their communication. This communication is asynchronous, which means that, at any given time, it is likely that not all worker nodes are in perfect sync.
Now, you might ask—what information do worker nodes exchange? It is mostly information that is needed for service discovery and routing, that is, information about which containers are running on with nodes and more:
Figure 14.4 – Worker nodes communicating with each other
In the preceding diagram, you can see how workers communicate with each other. To make sure the gossiping scales well in a large Swarm, each worker node only synchronizes its own state with three random neighbors. For those who are familiar with Big O notation, that means that the synchronization of the worker nodes using the gossip protocol scales with O(0).
Big O notation explained
Big O notation is a way to describe the speed or complexity of a given algorithm. It tells you the number of operations an algorithm will make. It’s used to communicate how fast an algorithm is, which can be important when evaluating other people’s algorithms, and when evaluating your own.
For example, let’s say you have a list of numbers and you want to find a specific number in the list. There are different algorithms you can use to do this, such as simple search or binary search. Simple search checks each number in the list one by one until it finds the number you’re looking for. Binary search, on the other hand, repeatedly divides the list in half until it finds the number you’re looking for.
Now, let’s say you have a list of 100 numbers. With simple search, in the worst case, you’ll have to check all 100 numbers, so it takes 100 operations. With binary search, in the worst case, you’ll only have to check about 7 numbers (because log2(100) is roughly 7), so it takes 7 operations.
In this example, binary search is faster than simple search. But what if you have a list of 1 billion numbers? Simple search would take 1 billion operations, while binary search would take only about 30 operations (because log2(1 billion) is roughly 30). So, as the list gets bigger, binary search becomes much faster than simple search.
Big O notation is used to describe this difference in speed between algorithms. In Big O notation, simple search is described as O(n), which means that the number of operations grows linearly with the size of the list (n). Binary search is described as O(log n), which means that the number of operations grows logarithmically with the size of the list.
Worker nodes are kind of passive. They never actively do anything other than run the workloads they get assigned by the manager nodes. The worker makes sure, though, that it runs these workloads to the best of its capabilities. Later on in this chapter, we will get to know more about exactly what workloads the worker nodes are assigned by the manager nodes.
Now that we know what master and worker nodes in a Docker Swarm are, we are going to introduce stacks, services, and tasks next.