Diving into Kubernetes architecture in depth
Kubernetes has very ambitious goals. It aims to manage and simplify the orchestration, deployment, and management of distributed systems across a wide range of environments and cloud providers. It provides many capabilities and services that should work across all these diverse environments and use cases, while evolving and remaining simple enough for mere mortals to use. This is a tall order. Kubernetes achieves this by following a crystal-clear, high-level design and well-thought-out architecture that promotes extensibility and pluggability.
Kubernetes originally had many hard-coded or environment-aware components, but the trend is to refactor them into plugins and keep the core small, generic, and abstract.
In this section, we will peel Kubernetes like an onion, starting with various distributed systems design patterns and how Kubernetes supports them, then go over the surface of Kubernetes, which is its set of APIs, and then take a look at the actual components that comprise Kubernetes. Finally, we will take a quick tour of the source-code tree to gain even better insight into the structure of Kubernetes itself.
At the end of this section, you will have a solid understanding of Kubernetes architecture and implementation, and why certain design decisions were made.
Distributed systems design patterns
All happy (working) distributed systems are alike, to paraphrase Tolstoy in Anna Karenina. That means that, to function properly, all well-designed distributed systems must follow some best practices and principles. Kubernetes doesn’t want to be just a management system. It wants to support and enable these best practices and provide high-level services to developers and administrators. Let’s look at some of those described as design patterns. We will start with single-node patterns such as sidecar, ambassador, and adapter. Then, we will discuss multi-node patterns.
Sidecar pattern
The sidecar pattern is about co-locating another container in a pod in addition to the main application container. The application container is unaware of the sidecar container and just goes about its business. A great example is a central logging agent. Your main container can just log to stdout, but the sidecar container will send all logs to a central logging service where they will be aggregated with the logs from the entire system. The benefits of using a sidecar container versus adding central logging to the main application container are enormous. First, applications are not burdened anymore with central logging, which could be a nuisance. If you want to upgrade or change your central logging policy or switch to a totally new provider, you just need to update the sidecar container and deploy it. None of your application containers change, so you can’t break them by accident. The Istio service mesh uses the sidecar pattern to inject its proxies into each pod.
Ambassador pattern
The ambassador pattern is about representing a remote service as if it were local and possibly enforcing some policy. A good example of the ambassador pattern is if you have a Redis cluster with one master for writes and many replicas for reads. A local ambassador container can serve as a proxy and expose Redis to the main application container on the localhost. The main application container simply connects to Redis on localhost:6379
(Redis’s default port), but it connects to the ambassador running in the same pod, which filters the requests, and sends write requests to the real Redis master and read requests randomly to one of the read replicas. Just like with the sidecar pattern, the main application has no idea what’s going on. That can help a lot when testing against a real local Redis cluster. Also, if the Redis cluster configuration changes, only the ambassador needs to be modified; the main application remains blissfully unaware.
Adapter pattern
The adapter pattern is about standardizing output from the main application container. Consider the case of a service that is being rolled out incrementally: it may generate reports in a format that doesn’t conform to the previous version. Other services and applications that consume that output haven’t been upgraded yet. An adapter container can be deployed in the same pod with the new application container and massage their output to match the old version until all consumers have been upgraded. The adapter container shares the filesystem with the main application container, so it can watch the local filesystem, and whenever the new application writes something, it immediately adapts it.
Multi-node patterns
The single-node patterns described earlier are all supported directly by Kubernetes via pods scheduled on a single node. Multi-node patterns involve pods scheduled on multiple nodes. Multi-node patterns such as leader election, work queues, and scatter-gather are not supported directly, but composing pods with standard interfaces to accomplish them is a viable approach with Kubernetes.
Level-triggered infrastructure and reconciliation
Kubernetes is all about control loops. It keeps watching itself and correcting issues. Level-triggered infrastructure means that Kubernetes has a desired state, and it constantly strives toward it. For example, if a replica set has a desired state of 3 replicas and it drops to 2 replicas, Kubernetes (the ReplicaSet controller part of Kubernetes) will notice and work to get back to 3 replicas. The alternative approach of edge-triggering is event-based. If the number of replicas dropped from 2 to 3, create a new replica. This approach is very brittle and has many edge cases, especially in distributed systems where events like replicas coming and going can happen simultaneously.
After covering the Kubernetes architecture in depth let’s study the Kubernetes APIs.
The Kubernetes APIs
If you want to understand the capabilities of a system and what it provides, you must pay a lot of attention to its API. The API provides a comprehensive view of what you can do with the system as a user. Kubernetes exposes several sets of REST APIs for different purposes and audiences via API groups. Some APIs are used primarily by tools and some can be used directly by developers. An important aspect of the APIs is that they are under constant development. The Kubernetes developers keep it manageable by trying to extend (adding new objects, and new fields to existing objects) and avoid renaming or dropping existing objects and fields. In addition, all API endpoints are versioned, and often have an alpha or beta notation too. For example:
/api/v1
/api/v2alpha1
You can access the API through the kubectl CLI, via client libraries, or directly through REST API calls. There are elaborate authentication and authorization mechanisms we will explore in Chapter 4, Securing Kubernetes. If you have the right permissions you can list, view, create, update, and delete various Kubernetes objects. At this point, let’s get a glimpse into the surface area of the APIs.
The best way to explore the API is via API groups. Some API groups are enabled by default. Other groups can be enabled/disabled via flags. For example, to disable the autoscaling/v1
group and enable the autoscaling/v2beta2
group you can set the --runtime-config
flag when running the API server as follows:
--runtime-config=autoscaling/v1=false,autoscaling/v2beta2=true
Note that managed Kubernetes clusters in the cloud don’t let you specify flags for the API server (as they manage it).
Resource categories
In addition to API groups, another useful classification of available APIs is by functionality. The Kubernetes API is huge and breaking it down into categories helps a lot when you’re trying to find your way around. Kubernetes defines the following resource categories:
- Workloads: Objects you use to manage and run containers on the cluster.
- Discovery and load balancing: Objects you use to expose your workloads to the world as externally accessible, load-balanced services.
- Config and storage: Objects you use to initialize and configure your applications, and to persist data that is outside the container.
- Cluster: Objects that define how the cluster itself is configured; these are typically used only by cluster operators.
- Metadata: Objects you use to configure the behavior of other resources within the cluster, such as HorizontalPodAutoscaler for scaling workloads.
In the following subsections, we’ll list the resources that belong to each group with the API group they belong to. We will not specify the version here because APIs move rapidly from alpha to beta to GA (general availability) and from V1 to V2, and so on.
Workloads resource category
The workloads category contains the following resources with their corresponding API groups:
- Container: core
- CronJob: batch
- ControllerRevision: apps
- DaemonSet: apps
- Deployment: apps
- HorizontalPodAutoscaler: autoscaling
- Job: batch
- Pod: core
- PodTemplate: core
- PriorityClass: scheduling.k8s.io
- ReplicaSet: apps
- ReplicationController: core
- StatefulSet: apps
Controllers create containers within pods. Pods execute containers and offer necessary dependencies, such as shared or persistent storage volumes, as well as configuration or secret data injected into the containers.
Here is a detailed description of one of the most common operations, which gets a list of all the pods across all namespaces as a REST API:
GET /api/v1/pods
It accepts various query parameters (all optional):
fieldSelector
: Specifies a selector to narrow down the returned objects based on their fields. The default behavior includes all objects.labelSelector
: Defines a selector to filter the returned objects based on their labels. By default, all objects are included.limit
/continue
: Thelimit
parameter specifies the maximum number of responses to be returned in a list call. If there are more items available, the server sets thecontinue
field in the list metadata. This value can be used with the initial query to fetch the next set of results.pretty
: When set to'true'
, the output is formatted in a human-readable manner.resourceVersion
: Sets a constraint on the acceptable resource versions that can be served by the request. If not specified, it defaults to unset.resourceVersionMatch
: Determines how theresourceVersion
constraint is applied in list calls. If not specified, it defaults to unset.timeoutSeconds
: Specifies a timeout duration for the list/watch call. This limits the duration of the call, regardless of any activity or inactivity.watch
: Enables the monitoring of changes to the described resources and returns a continuous stream of notifications for additions, updates, and removals. TheresourceVersion
parameter must be specified.
Discovery and load balancing
Workloads in a cluster are only accessible within the cluster by default. To make them accessible externally, either a LoadBalancer
or a NodePort
Service needs to be used. However, for development purposes, internally accessible workloads can be accessed through the API server using the “kubectl proxy” command:
Endpoints
:core
EndpointSlice
:discovery.k8s.io/v1
Ingress
:networking.k8s.io
IngressClass
:networking.k8s.io
Service
:core
Config and storage
Dynamic configuration without redeployment and secret management are cornerstones of Kubernetes and running complex distributed applications on your Kubernetes cluster. The secret and configuration are not baked into container images and are stored in the Kubernetes state store (usually etcd). Kubernetes also provides a lot of abstractions for managing arbitrary storage. Here are some of the primary resources:
ConfigMap
:core
CSIDriver
:storage.k8s.io
CSINode
:storage.k8s.io
CSIStorageCapacity
:storage.k8s.io
Secret
:core
PersistentVolumeClaim
:core
StorageClass
:storage.k8s.io
Volume
:core
VolumeAttachment
:storage.k8s.io
Metadata
The metadata resources typically show up as sub-resources of the resources they configure. For example, a limit range is defined at the namespace level and can specify:
- The range of compute resource usage (minimum and maximum) for pods or containers within a namespace.
- The range of storage requests (minimum and maximum) per
PersistentVolumeClaim
within a namespace. - The ratio between the resource request and limit for a specific resource within a namespace.
- The default request/limit for compute resources within a namespace, which are automatically injected into containers at runtime.
You will not interact with these objects directly most of the time. There are many metadata resources. You can find the complete list here: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#-strong-metadata-apis-strong-.
Cluster
The resources in the cluster category are designed for use by cluster operators as opposed to developers. There are many resources in this category as well. Here are some of the most important resources:
Namespace
:core
Node
:core
PersistentVolume
:core
ResourceQuota
:core
Role
:rbac.authorization.k8s.io
RoleBinding
:rbac.authorization.k8s.io
ClusterRole
:rbac.authorization.k8s.io
ClusterRoleBinding
:rbac.authorization.k8s.io
NetworkPolicy
:networking.k8s.io
Now that we understand how Kubernetes organizes and exposes its capabilities via API groups and resource categories, let’s see how it manages the physical infrastructure and keeps it up with the state of the cluster.
Kubernetes components
A Kubernetes cluster has several control plane components used to control the cluster, as well as node components that run on each worker node. Let’s get to know all these components and how they work together.
Control plane components
The control plane components can all run on one node, but in a highly available setup or a very large cluster, they may be spread across multiple nodes.
API server
The Kubernetes API server exposes the Kubernetes REST API. It can easily scale horizontally as it is stateless and stores all the data in the etcd cluster (or another data store in Kubernetes distributions like k3s). The API server is the embodiment of the Kubernetes control plane.
etcd
etcd is a highly reliable distributed data store. Kubernetes uses it to store the entire cluster state. In small, transient clusters a single instance of etcd can run on the same node with all the other control plane components. But, for more substantial clusters, it is typical to have a 3-node or even 5-node etcd cluster for redundancy and high availability.
Kube controller manager
The Kube controller manager is a collection of various managers rolled up into one binary. It contains the replica set controller, the pod controller, the service controller, the endpoints controller, and others. All these managers watch over the state of the cluster via the API, and their job is to steer the cluster into the desired state.
Cloud controller manager
When running in the cloud, Kubernetes allows cloud providers to integrate their platform for the purpose of managing nodes, routes, services, and volumes. The cloud provider code interacts with Kubernetes code. It replaces some of the functionality of the Kube controller manager. When running Kubernetes with a cloud controller manager you must set the Kube controller manager flag --cloud-provider
to external
. This will disable the control loops that the cloud controller manager is taking over.
The cloud controller manager was introduced in Kubernetes 1.6, and it’s being used by multiple cloud providers already such as:
- GCP
- AWS
- Azure
- BaiduCloud
- Digital Ocean
- Oracle
- Linode
OK. Let’s look at some code. The specific code is not that important. The goal is just to give you a taste of what Kubernetes code looks like. Kubernetes is implemented in Go. A quick note about Go to help you parse the code: the method name comes first, followed by the method’s parameters in parentheses. Each parameter is a pair, consisting of a name followed by its type. Finally, the return values are specified. Go allows multiple return types. It is very common to return an error object in addition to the actual result. If everything is OK, the error object will be nil.
Here is the main interface of the cloudprovider
package:
package cloudprovider
import (
"context"
"errors"
"fmt"
"strings"
v1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/types"
"k8s.io/client-go/informers"
clientset "k8s.io/client-go/kubernetes"
restclient "k8s.io/client-go/rest"
)
// Interface is an abstract, pluggable interface for cloud providers.
type Interface interface {
Initialize(clientBuilder ControllerClientBuilder, stop <-chan struct{})
LoadBalancer() (LoadBalancer, bool)
Instances() (Instances, bool)
InstancesV2() (InstancesV2, bool)
Zones() (Zones, bool)
Clusters() (Clusters, bool)
Routes() (Routes, bool)
ProviderName() string
HasClusterID() bool
}
Most of the methods return other interfaces with their own method. For example, here is the LoadBalancer
interface:
type LoadBalancer interface {
GetLoadBalancer(ctx context.Context, clusterName string, service *v1.Service) (status *v1.LoadBalancerStatus, exists bool, err error)
GetLoadBalancerName(ctx context.Context, clusterName string, service *v1.Service) string
EnsureLoadBalancer(ctx context.Context, clusterName string, service *v1.Service, nodes []*v1.Node) (*v1.LoadBalancerStatus, error)
UpdateLoadBalancer(ctx context.Context, clusterName string, service *v1.Service, nodes []*v1.Node) error
EnsureLoadBalancerDeleted(ctx context.Context, clusterName string, service *v1.Service) error
}
Kube scheduler
The kube-scheduler
is responsible for scheduling pods into nodes. This is a very complicated task as it needs to consider multiple interacting factors, such as:
- Resource requirements
- Service requirements
- Hardware/software policy constraints
- Node affinity and anti-affinity specifications
- Pod affinity and anti-affinity specifications
- Taints and tolerations
- Local storage requirements
- Data locality
- Deadlines
If you need some special scheduling logic not covered by the default Kube scheduler, you can replace it with your own custom scheduler. You can also run your custom scheduler side by side with the default scheduler and have your custom scheduler schedule only a subset of the pods.
DNS
Starting with Kubernetes 1.3, a DNS service is part of the standard Kubernetes cluster. It is scheduled as a regular pod. Every service (except headless services) receives a DNS name. Pods can receive a DNS name too. This is very useful for automatic discovery.
We covered all the control plane components. Let’s look at the Kubernetes components running on each node.
Node components
Nodes in the cluster need a couple of components to interact with the API server, receive workloads to execute, and update the API server regarding their status.
Proxy
The kube-proxy does low-level network housekeeping on each node. It reflects the Kubernetes services locally and can do TCP and UDP forwarding. It finds cluster IPs via environment variables or DNS.
kubelet
The kubelet is the Kubernetes representative on the node. It oversees communicating with the API server and manages the running pods. That includes the following:
- Receive pod specs
- Download pod secrets from the API server
- Mount volumes
- Run the pod’s containers (via the configured container runtime)
- Report the status of the node and each pod
- Run container liveness, readiness, and startup probes
In this section, we dug into the guts of Kubernetes and explored its architecture from a very high level of vision and supported design patterns, through its APIs and the components used to control and manage the cluster. In the next section, we will take a quick look at the various runtimes that Kubernetes supports.