Let's see whether we can solve the problem with PersistentVolumes through a StatefulSet. As a reminder, our goal (for now) is for each instance of a MongoDB to get a separate volume.
The updated definition is in the sts/go-demo-3-sts.yml file.
1 cat sts/go-demo-3-sts.yml
Most of the new definition is the same as the one we used before, so we'll comment only on the differences. The first in line is StatefulSet that replaces the db Deployment. It is as follows.
apiVersion: apps/v1beta2 kind: StatefulSet metadata: name: db namespace: go-demo-3 spec: serviceName: db replicas: 3 selector: matchLabels: app: db template: metadata: labels: app: db spec: terminationGracePeriodSeconds: 10 containers: - name: db image: mongo:3.3 command: - mongod - "--replSet" - rs0 - "--smallfiles" - "--noprealloc" ports: - containerPort: 27017 resources: limits: memory: "100Mi" cpu: 0.1 requests: memory: "50Mi" cpu: 0.01 volumeMounts: - name: mongo-data mountPath: /data/db volumeClaimTemplates: - metadata: name: mongo-data spec: accessModes: - ReadWriteOnce resources: requests: storage: 2Gi
As you already saw with Jenkins, StatefulSet definitions are almost the same as Deployments. The only important difference is that we are not defining PersistentVolumeClaim as a separate resource but letting the StatefulSet take care of it through the specification set inside the volumeClaimTemplates entry. We'll see it in action soon.
We also used this opportunity to tweak mongod process by specifying the db container command that creates a ReplicaSet rs0. Please note that this replica set is specific to MongoDB and it is in no way related to Kubernetes ReplicaSet controller. Creation of a MongoDB replica set is the base for some of the things we'll do later on.
Another difference is in the db Service. It is as follows.
apiVersion: v1 kind: Service metadata: name: db namespace: go-demo-3 spec: ports: - port: 27017 clusterIP: None selector: app: db
This time we set clusterIP to None. That will create a Headless Service. Headless service is a service that doesn't need load-balancing and has a single service IP.
Everything else in this YAML file is the same as in the one that used Deployment controller to run MongoDB.
To summarize, we changed db Deployment into a StatefulSet, we added a command that creates MongoDB replica set named rs0, and we set the db Service to be Headless. We'll explore the reasons and the effects of those changes soon. For now, we'll create the resources defined in the sts/go-demo-3-sts.yml file.
1 kubectl apply \ 2 -f sts/go-demo-3-sts.yml \ 3 --record
4 5 kubectl -n go-demo-3 get pods
We created the resources and retrieved the Pods. The output of the latter command is as follows.
NAME READY STATUS RESTARTS AGE api-... 0/1 Running 0 4s api-... 0/1 Running 0 4s api-... 0/1 Running 0 4s db-0 0/1 ContainerCreating 0 5s
We can see that all three replicas of the api Pods are running or, at least, that's how it seems so far. The situation with db Pods is different. Kubernetes is creating only one replica, even though we specified three.
Let's wait for a bit and retrieve the Pods again.
1 kubectl -n go-demo-3 get pods
Forty seconds later, the output is as follows.
NAME READY STATUS RESTARTS AGE api-... 0/1 CrashLoopBackOff 1 44s api-... 0/1 CrashLoopBackOff 1 44s api-... 0/1 Running 2 44s db-0 1/1 Running 0 45s db-1 0/1 ContainerCreating 0 9s
We can see that the first db Pod is running and that creation of the second started. At the same time, our api Pods are crashing. We'll ignore them for now, and concentrate on db Pods.
Let's wait a bit more and observe what happens next.
1 kubectl -n go-demo-3 get pods
A minute later, the output is as follows.
NAME READY STATUS RESTARTS AGE api-... 0/1 CrashLoopBackOff 4 1m api-... 0/1 Running 4 1m api-... 0/1 CrashLoopBackOff 4 1m db-0 1/1 Running 0 2m db-1 1/1 Running 0 1m db-2 0/1 ContainerCreating 0 34s
The second db Pod started running, and the system is creating the third one. It seems that our progress with the database is going in the right direction.
Let's wait a while longer before we retrieve the Pods one more time.
1 kubectl -n go-demo-3 get pods
The output is as follows.
NAME READY STATUS RESTARTS AGE api-... 0/1 CrashLoopBackOff 4 3m api-... 0/1 CrashLoopBackOff 4 3m api-... 0/1 CrashLoopBackOff 4 3m db-0 1/1 Running 0 3m db-1 1/1 Running 0 2m db-2 1/1 Running 0 1m
Another minute later, the third db Pod is also running but our api Pods are still failing. We'll deal with that problem soon.
What we just observed is an essential difference between Deployments and StatefulSets. Replicas of the latter are created sequentially. Only after the first replica was running, the StatefulSet started creating the second. Similarly, the creation of the third began solely after the second was running.
Moreover, we can see that the names of the Pods created through the StatefulSet are predictable. Unlike Deployments that create random suffixes for each Pod, StatefulSets create them with indexed suffixes based on integer ordinals.
The name of the first Pod will always end suffixed with -0, the second will be suffixed with -1, and so on. That naming will be maintained forever. If we'd initiate rolling updates, Kubernetes would replace the Pods of the db StatefulSet, but the names would remain the same.
The nature of the sequential creation of Pods and formatting of their names provides predictability that is often paramount with stateful applications. We can think of StatefulSet replicas as being separate Pods with guaranteed ordering, uniqueness, and predictability.
How about PersistentVolumes? The fact that the db Pods did not fail means that MongoDB instances managed to get the locks. That means that they are not sharing the same PersistentVolume, or that they are using different directories within the same volume.
Let's take a look at the PersistentVolumes created in the cluster.
1 kubectl get pv
The output is as follows.
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-... 2Gi RWO Delete Bound go-demo-3/mongo-data-db-0 gp2 9m pvc-... 2Gi RWO Delete Bound go-demo-3/mongo-data-db-1 gp2 8m pvc-... 2Gi RWO Delete Bound go-demo-3/mongo-data-db-2 gp2 7m
Now we can observe the reasoning behind using volumeClaimTemplates spec inside the definition of the StatefulSet. It used the template to create a claim for each replica. We specified that there should be three replicas, so it created three Pods, as well as three separate volume claims. The result is three PersistentVolumes.
Moreover, we can see that the claims also follow a specific naming convention. The format is a combination of the name of the claim (mongo-data), the name of the StatefulSet db, and index (0, 1, and 2).
Judging by the age of the claims, we can see that they followed the same pattern as the Pods. They are approximately a minute apart. The StatefulSet created the first Pod and used the claim template to create a PersistentVolume and attach it. Later on, it continued to the second Pod and the claim, and after that with the third. Pods are created sequentially, and each generated a new PersistentVolumeClaim.
If a Pod is (re)scheduled due to a failure or a rolling update, it'll continue using the same PersistentVolumeClaim and, as a result, it will keep using the same PersistentVolume, making Pods and volumes inseparable.
Given that each Pod in a StatefulSet has a unique and a predictable name, we can assume that the same applies to hostnames inside those Pods. Let's check it out.
1 kubectl -n go-demo-3 \ 2 exec -it db-0 -- hostname
We executed hostname command inside one of the replicas of the StatefulSet. The output is as follows.
1 db-0
Just as names of the Pods created by the StatefulSet, hostnames are predictable as well. They are following the same pattern as Pod names. Each Pod in a StatefulSet derives its hostname from the name of the StatefulSet and the ordinal of the Pod. The pattern for the constructed hostname is [STATEFULSET_NAME]-[INDEX].
Let's move into the Service related to the StatefulSet. If we take another look at the db Service defined in sts/go-demo-3-sts.yml, we'll notice that it has clusterIP set to None. As a result, the Service is headless.
In most cases we want Services to handle load-balancing and forward requests to one of the replicas. Load balancing is often "round-robin" even though it can be changed to other algorithms. However, sometimes we don't need the Service to do load-balancing, nor we want it to provide a single IP for the Service. That is certainly true for MongoDB. If we are to convert its instances into a replica set, we need to have a separate and stable address for each. So, we disabled Service's load-balancing by setting spec.clusterIP to None. That converted it into a Headless Service and let StatefulSet take over its algorithm.
We'll explore the effect of combining StatefulSets with Headless Services by creating a new Pod from which we can execute nslookup commands.
1 kubectl -n go-demo-3 \ 2 run -it \ 3 --image busybox dns-test \ 4 --restart=Never \ 5 --rm sh
We created a new Pod based on busybox inside the go-demo-3 Namespace. We specified sh as the command together with the -ti argument that allocated a TTY and standard input (stdin). As a result, we are inside the container created through the dns-test Pod, and we can execute our first nslookup query.
1 nslookup db
The output is as follows.
Server: 100.64.0.10 Address 1: 100.64.0.10 kube-dns.kube-system.svc.cluster.local
Name: db Address 1: 100.96.2.14 db-0.db.go-demo-3.svc.cluster.local Address 2: 100.96.2.15 db-2.db.go-demo-3.svc.cluster.local Address 3: 100.96.3.8 db-1.db.go-demo-3.svc.cluster.local
We can see that the request was picked by the kube-dns server and that it returned three addresses, one for each Pod in the StatefulSet.
The StatefulSet is using the Headless Service to control the domain of its Pods.
The domain managed by this Service takes the form of [SERVICE_NAME].[NAMESPACE].svc.cluster.local, where cluster.local is the cluster domain. However, we used a short syntax in our nslookup query that requires only the name of the service (db). Since the service is in the same Namespace, we did not need to specify go-demo-3. The Namespace is required only if we'd like to establish communication from one Namespace to another.
When we executed nslookup, a request was sent to the CNAME of the Headless Service (db). It, in turn, returned SRV records associated with it. Those records point to A record entries that contain Pods IP addresses, one for each of the Pods managed by the StatefulSet.
Let's do nslookup of one of the Pods managed by the StatefulSet.
The Pods can be accessed with a combination of the Pod name (for example, db-0) and the name of the Service. If the Pods are in a different Namespace, we need to add it as a suffix. Finally, if we want to use the full CNAME, we can add svc.cluster.local as well. We can see the full address from the previous output (for example, db-0.db.go-demo-3.svc.cluster.local). All in all, we can access the Pod with the index 0 as db-0.db, db-0.db.go-demo-3, or db-0.db.go-demo-3.svc.cluster.local. Any of the three combinations should work since we are inside the Pod running in the same Namespace. So, we'll use the shortest version.
1 nslookup db-0.db
The output is as follows.
Server: 100.64.0.10 Address 1: 100.64.0.10 kube-dns.kube-system.svc.cluster.local
Name: db-0.db Address 1: 100.96.2.14 db-0.db.go-demo-3.svc.cluster.local
We can see that the output matches part of the output of the previous nslookup query. The only difference is that this time it is limited to the particular Pod.
What we got with the combination of a StatefulSet and a Headless Service is a stable network identity. Unless we change the number of replicas of this StatefulSet, CNAME records are permanent. Unlike Deployments, StatefulSets maintain sticky identities for each of their Pods. These pods are created from the same spec, but they are not interchangeable. Each has a persistent identifier that is maintained across any rescheduling.
Pods ordinals, hostnames, SRV records, and A records are never changed. However, the same cannot be said for IP addresses associated with them. They might change. That is why it is crucial not to configure applications to connect to Pods in a StatefulSet by IP address.
Now that we know that the Pods managed with a StatefulSet have a stable network identity, we can proceed and configure MongoDB replica set.
1 exit
2 3 kubectl -n go-demo-3 \ 4 exec -it db-0 -- sh
We exited the dns-test Pod and entered into one of the MongoDB containers created by the StatefulSet.
1 mongo
2 3 rs.initiate( { 4 _id : "rs0", 5 members: [ 6 {_id: 0, host: "db-0.db:27017"}, 7 {_id: 1, host: "db-1.db:27017"}, 8 {_id: 2, host: "db-2.db:27017"} 9 ] 10 })
We entered into mongo Shell and initiated a ReplicaSet (rs.initiate). The members of the ReplicaSet are the addresses of the three Pods combined with the default MongoDB port 27017.
The output is { "ok" : 1 }, thus confirming that we (probably) configured the ReplicaSet correctly.
Remember that our goal is not to go deep into MongoDB configuration, but only to explore some of the benefits behind StatefulSets.
If we used Deployments, we would not get stable network identity. Any update would create new Pods with new identities. With StatefulSet, on the other hand, we know that there will always be db-[INDEX].db, no matter how often we update it. Such a feature is mandatory when applications need to form an internal cluster (or a replica set) and were not designed to discover each other dynamically. That is indeed the case with MongoDB.
We'll confirm that the MongoDB replica set was created correctly by outputting its status.
1 rs.status()
The output, limited to the relevant parts, is as follows.
... "members" : [ { "_id" : 0, ... "stateStr" : "PRIMARY", ... }, { "_id" : 1, ... "stateStr" : "SECONDARY", ... "syncingTo" : "db-0.db:27017", ... }, { "_id" : 2, ... "stateStr" : "SECONDARY", ... "syncingTo" : "db-0.db:27017", ... } ], "ok" : 1 }
We can see that all three MongoDB Pods are members of the replica set. One of them is primary. If it fails, Kubernetes will reschedule it and, since it's managed by the StatefulSet, it'll maintain the same stable network identity. The secondary members are all syncing with the primary one that is reachable through the db-0.db:27017 address.
Now that the database is finally operational, we should confirm that the api Pods are running.
1 exit
2 3 exit
4 5 kubectl -n go-demo-3 get pods
We exited MongoDB Shell and the container that hosts db-0, and we listed the Pods in the go-demo-3 Namespace.
The output of the latter command is as follows.
NAME READY STATUS RESTARTS AGE api-... 1/1 Running 8 17m api-... 1/1 Running 8 17m api-... 1/1 Running 8 17m db-0 1/1 Running 0 17m db-1 1/1 Running 0 17m db-2 1/1 Running 0 16m
If, in your case, api Pods are still not running, please wait for a few moments until Kubernetes restarts them.
Now that the MongoDB replica set is operational, api Pods could connect to it, and Kubernetes changed their statuses to Running. The whole application is operational.
There is one more StatefulSet-specific feature we should discuss.
Let's see what happens if, for example, we update the image of the db container.
The updated definition is in sts/go-demo-3-sts-upd.yml.
1 diff sts/go-demo-3-sts.yml \ 2 sts/go-demo-3-sts-upd.yml
As you can see from the diff, the only change is in the image. We'll update mongo version from 3.3 to 3.4.
1 kubectl apply \ 2 -f sts/go-demo-3-sts-upd.yml \ 3 --record
4 5 kubectl -n go-demo-3 get pods
We applied the new definition and retrieved the list of Pods inside the Namespace.
The output is as follows.
NAME READY STATUS RESTARTS AGE api-... 1/1 Running 6 14m api-... 1/1 Running 6 14m api-... 1/1 Running 6 14m db-0 1/1 Running 0 6m db-1 1/1 Running 0 6m db-2 0/1 ContainerCreating 0 14s
We can see that the StatefulSet chose to update only one of its Pods. Moreover, it picked the one with the highest index.
Let's see the output of the same command half a minute later.
NAME READY STATUS RESTARTS AGE api-... 1/1 Running 6 15m api-... 1/1 Running 6 15m api-... 1/1 Running 6 15m db-0 1/1 Running 0 7m db-1 0/1 ContainerCreating 0 5s db-2 1/1 Running 0 32s
StatefulSet finished updating the db-2 Pod and moved to the one before it.
And so on, and so forth, all the way until all the Pods that form the StatefulSet were updated.
The Pods in the StatefulSet were updated in reverse ordinal order. The StatefulSet terminated one of the Pods, and it waited for its status to become Running before it moved to the next one.
All in all, when StatefulSet is created, it, in turn, generates Pods sequentially starting with the index 0, and moving upwards. Updates to StatefulSets are following the same logic, except that StatefulSet begins updates with the Pod with the highest index, and it flows downwards.
We did manage to make MongoDB replica set running, but the cost was too high. Creating Mongo replica set manually is not a good option. It should be no option at all.
We'll remove the go-demo-3 Namespace (and everything inside it) and try to improve our process for deploying MongoDB.
1 kubectl delete ns go-demo-3