Understanding Kubernetes infrastructure design considerations
When it comes to Kubernetes infrastructure design, there are a few, albeit important, considerations to take into account. Almost every cloud infrastructure architecture shares the same set of considerations; however, we will discuss these considerations from a Kubernetes perspective, and shed some light on them.
Scaling and elasticity
Public cloud infrastructure, such as AWS, Azure, and GCP, introduced scaling and elasticity capabilities at unprecedented levels. Kubernetes and containerization technologies arrived to build upon these capabilities and extend them further.
When you design a Kubernetes cluster infrastructure, you should ensure that your architecture covers the following two areas:
- Scalable Kubernetes infrastructure
- Scalable workloads deployed to the Kubernetes clusters
To achieve the first requirement, there are parts that depend on the underlying infrastructure, either public cloud or on-premises, and other parts that depend on the Kubernetes cluster itself.
The first part is usually solved when you choose to use a managed Kubernetes service such as EKS, AKS, or GKE, as the cluster's control plane and worker nodes will be scalable and supported by other layers of scalable infrastructure.
However, in some use cases, you may need to deploy a self-managed Kubernetes cluster, either on-premises or in the cloud, and in this case, you need to consider how to support scaling and elasticity to enable your Kubernetes clusters to operate at their full capacity.
In all public cloud infrastructure, there is the concept of compute auto scaling groups, and Kubernetes clusters are built on them. However, because of the nature of the workloads running on Kubernetes, scaling needs should be synchronized with the cluster scheduling actions. This is where Kubernetes cluster autoscaler comes to our aid.
Cluster autoscaler (CAS) is a Kubernetes cluster add-on that you optionally deploy to your cluster, and it automatically scales up and down the size of worker nodes based on the set of conditions and configurations that you specify in the CAS. Basically, it triggers cluster upscaling when there is a pod that cannot schedule due to insufficient compute resources, or it triggers cluster downscaling when there are underutilized nodes, and their pods can be rescheduled and placed in other nodes. You should take into consideration the time a cloud provider takes to execute the launch of a new node, as this could be a problem for time-sensitive apps, and in this case, you may consider CAS configuration that enables node over provisioning.
For more information about CAS, refer to the following link: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler.
To achieve the second scaling requirement, Kubernetes provides two solutions to achieve autoscaling of the pods:
- Horizontal Pod Autoscaler (HPA): This works similar to cloud autoscaling groups, but at a pod deployment level. Think of the pod as the VM instance. HPA scales the number of pods based on a specific metrics threshold. This can be CPU or memory utilization metrics, or you can define a custom metric. To understand how HPA works, you can continue reading about it here: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/.
- Vertical Pod Autoscaler (VPA): This scales the pod vertically by increasing its CPU and memory limits according to the pod usage metrics. Think of VPA as upscaling/downscaling the VM instance by changing its type in the public cloud. VPA can affect CAS and triggers upscaling events, so you should revise the CAS and VPA configurations to get them aligned and avoid any unpredictable scaling behavior. To understand how VPA works, you can continue reading about it here: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler.
We highly recommend using HPA and VPA for your production deployments (it is not essential for non-production environments). We will give examples on how to use both of them in deploying production-grade apps and services in Chapter 8, Deploying Seamless and Reliable Applications.
High availability and reliability
Uptime means reliability and is usually the top metric that the infrastructure teams measure and target for enhancement. Uptime drives the service-level objectives (SLOs) for services, and the service level agreements (SLAs) with customers, and it also indicates how stable and reliable your systems and Software as a Service (SaaS) products are. High availability is the key for increasing uptime, and when it comes to Kubernetes clusters' infrastructure, the same rules still apply. This is why designing a highly available cluster and workload is an essential requirement for a production-grade Kubernetes cluster.
You can architect a highly available Kubernetes infrastructure on different levels of availability as follows:
- A cluster in a single public cloud zone (single data center): This is considered the easiest architecture among the others, but it brings the highest risk. We do not recommend this solution.
- A cluster in multiple zones (multiple data centers) but in a single cloud region: This is still easy to implement, it provides a higher level of availability, and it is a common architecture for Kubernetes clusters. However, when your cloud provider has a full region outage, your cluster will be entirely unavailable. Such full region outages rarely happen, but you still need to be prepared for such a scenario.
- Across multi-region clusters, but within the same cloud provider: In this architecture, you usually run multiple federated Kubernetes clusters to serve your production workloads. This is usually the preferred solution for high availability, but it comes at a cost that makes it hard to implement and operate, especially the possible poor network performance, and shared storage for stateful applications. We do not recommend this architecture since, for the majority of SaaS products, it is enough to deploy Kubernetes in a single region and multiple zones. However, if you have a multi-region as a requirement for a reason other than high availability, you may consider multi-region Kubernetes federated clusters as a solution.
- Multiple clusters across multi-cloud deployment: This architecture is still unpopular due to the incompatibility limitations across cloud providers, inter-cluster network complexity, and the higher cost associated with network traffic across providers, along with implementation and operations. However, it is worth mentioning the increase in the number of multi-cloud management solutions that are endeavoring to tackle and solve these challenges, and you may wish to consider a multi-cluster management solution such as Anthos from Google. You can learn more about it here: https://cloud.google.com/anthos.
As you may notice, Kubernetes has different architectural flavors when it comes to high availability setup, and I can say that having different choices makes Kubernetes more powerful for different use cases. Although the second choice is the most common one as of now, as it strikes a balance between the ease of implementation and operation, and the high availability level. We are optimistically searching for a time when we can reach the fourth level, where we can easily deploy Kubernetes clusters across cloud providers and gain all the high availability benefits without the burden of tough operations and increased costs.
As for the cluster availability itself, I believe it goes without saying that Kubernetes components should run in a highly available mode, that is, having three or more nodes for a control plane, or preferably letting the cloud manage the control plane for you, as in EKS, AKE, or GKE. As for workers, you have to run one or more autoscaling groups or node groups/pools, and this ensures high availability.
The other area where you need to consider achieving high availability is for the pods and workloads that you will deploy to your cluster. Although this is beyond the scope of this book, it is still worthwhile mentioning that developing new applications and services, or modernizing your existing ones so that they can run in a high availability mode, is the only way to make use of the raft of capabilities provided by the powerful Kubernetes infrastructure underneath it. Otherwise, you will end up with a very powerful cluster but with monolithic apps that can only run as a single instance!
Security and compliance
Kubernetes infrastructure security is rooted at all levels of your cluster, starting from the network layer, going through the OS level, up to cluster services and workloads. Luckily, Kubernetes has strong support for security, encryption, authentication, and authorization. We will learn about security in Chapter 6, Securing Kubernetes Effectively, of this book. However, during the design of the cluster infrastructure, you should give attention to important decisions relating to security, such as securing the Kubernetes API server endpoint, as well as the cluster network design, security groups, firewalls, network policies between the control plane components, workers nodes, and the public internet.
You will also need to plan ahead in terms of the infrastructure components or integrations between your cluster and identity management providers. This usually depends on your organization's security policies, which you need to align with your IT and security teams.
Another aspect to consider is the auditing and compliance of your cluster. Most organizations have cloud governance policies and compliance requirements, which you need to be aware of before you proceed with deploying your production on Kubernetes.
If you decide to use a multi-tenant cluster, the security requirements could be more challenging, and setting clear boundaries among the cluster tenants, as well as cluster users from different internal teams, may result in decisions such as deploying a service mesh, hardening cluster network policies, and implementing a tougher Role-Based Access Control (RBAC) mechanism. All of this will impact your decisions while architecting the infrastructure of your first production cluster.
The Kubernetes community is keen on compliance and quality, and for that there are multiple tools and tests to ensure that your cluster achieves an acceptable level of security and compliance. We will learn about these tools and tests in Chapter 6, Securing Kubernetes Effectively.
Cost management and optimization
Cloud cost management is an important factor for all organizations adopting cloud technology, both for those just starting and those who are already in the cloud. Adding Kubernetes to your cloud infrastructure is expected to bring cost savings, as containerization enables you to highly utilize your computer resources on a scale that was not possible with VMs ever before. Some organizations achieved cost savings up to 90% after moving to containers and Kubernetes.
However, without proper cost control, costs can rise again, and you end up with a lot of wasted infrastructure cost with uncontrolled Kubernetes clusters. There are many tools and best practices to consider in relation to cost management, but we mainly want to focus on the actions and the technical decisions that you need to consider during infrastructure design.
We believe that there are two important aspects that require decisions, and these decisions will definitely affect your cluster infrastructure architecture:
- Running a single, but multi-tenant, cluster versus multi clusters (that is, a single cluster per tenant)
- The cluster capacity: whether to run few large worker nodes or a lot of small workers nodes, or a mix of the two
There are no definitive correct decisions, but we will try to explore the choices in the next section, and how we can reach a decision.
These are other considerations to be made regarding cost optimization where an early decision can be made:
- Using spot/preemptible instances: This has proven to achieve huge cost savings; however, it comes at a price! There is the threat of losing your workloads at any time, which affects your product uptime and reliability. Options are available for overcoming this, such as using spot instances for non-production workloads, such as development environments or CI/CD pipelines, or any production workloads that can survive a disruption, such as data batch processing.
We highly recommend using spot instances for worker nodes, and you can run them in their node group/pool and assign to them the types of workloads where you are not concerned with them being disrupted.
- Kubernetes cost observability: Most cloud platforms provide cost visibility and analytics for all cloud resources. However, having cost visibility at the deployment/service level of the cluster is essential, and this needs to be planned ahead, so you use isolated workloads, teams, users, environments, and also using namespaces and assign resource quotas to them. By doing that, you will ensure that using a cost reporting tool will provide you with reports relating the usage to the service or cluster operations. This is essential for further decision making regarding cost reductions.
- Kubernetes cluster management: When you run a single-tenant cluster, or one cluster per environment for development, you usually end up with tons of clusters sprawled across your account which could lead to increased cloud cost. The solution to this situation is to set up a cluster management solution from day one. This solution could be as simple as a cluster auto scaler script that reduces the worker nodes during periods of inactivity, or it can be a full automation with dashboards and a master cluster to manage the rest of clusters.
In Chapter 9, Monitoring, Logging, and Observability, and Chapter 10, Operating and Maintaining Efficient Kubernetes Clusters, we will learn about cost observability and cluster operations.
Manageability and operational efficiency
Usually, when an organization starts building a Kubernetes infrastructure, they invest most of their time, effort, and focus in urgent and critical demands for infrastructure design and deployment, which we usually call Day 0 and Day 1. It is unlikely that an organization will devote its attention to operational and manageability concerns that we will face in the future (Day 2).
This is justified by the lack of experience in Kubernetes, and the types of operational challenges, or by being driven by gaining the benefits of Kubernetes that mainly relate to development, such as increasing a developer's productivity and agility, and automating releases and deployment.
All of this leads to organizations and teams being less prepared for Day 2. In this book, we try to maintain a balance between design, implementation, and operations, and shed some light on the important aspects of the operation and learn how to plan for it from Day 0, especially in relation to reliability, availability, security, and observability.
Operational challenges with Kubernetes
These are the common operational and manageability challenges that most teams face after deploying Kubernetes in production. This is where you need to rethink and consider solutions beforehand in order to handle these challenges properly:
- Reliability and scaling: When your infrastructure scales up, you could end up with tens or hundreds of clusters, or clusters with hundreds or thousands of nodes, and tons of configurations for different environment types. This makes it harder to manage the SLAs/SLOs of your applications, as well as the uptime goals, and even diagnosing a cluster issue could be very problematic. Teams need to develop their Kubernetes knowledge and troubleshooting skills.
- Observability: No doubt Kubernetes is complex, and this makes monitoring and logging a must-have service once your cluster is serving production, otherwise you will have a very tough time identifying issues and problems. Deploying monitoring and logging tools, in addition to defining the basic observability metrics and thresholds, are what you need to take care of in this regard.Â
- Updateability and cluster management: Updating Kubernetes components, such as the API server, kubelet,
etcd
,kube-proxy
, Docker images, and configuration for the cluster add-ons, become challenging to manage during the cluster life cycle. This requires the correct tools to be in place from the outset. Automation and IaC tools, such as Terraform, Ansible, and Helm, are commonly used to help in this regard. - Disaster recovery: What happens when you have a partial or complete cluster failure? What is the recovery plan? How do you mitigate this risk and decrease the mean time to recover your clusters and workloads. This requires deployment of the correct tools, and writing the playbooks for backups, recovery, and crisis management.
- Security and governance:Â You need to ensure that security best practices and governance policies are applied and enforced in relation to production clusters and workloads. This becomes challenging due to the complex nature of Kubernetes and its soft isolation techniques, its agility, and the rapid pace it brings to the development and release life cycles.
There are other operational challenges. However, we found that most of these can be mitigated if we stick to the following infrastructure best practices and standards:
- Infrastructure as Code (IaC): This is the default practice for modern infrastructure and DevOps teams. It is also a recommended approach to use declarative IaC tools and technologies over their imperative counterparts.
- Automation: We live in the age of software automation, as we tend to automate everything; it is more efficient and easier to manage and scale, but we need to take automation with Kubernetes to another level. Kubernetes comes with the ability to automate the life cycle of containers, and it also comes with advanced automation concepts, such as operators and GitOps, which are efficient and can literally automate automations.
- Standardization: Having a set of standards helps to reduce teams' struggles with aligning and working together, eases the scaling of the processes, improves the overall quality, and increases productivity. This becomes essential for companies and teams that are planning to use Kubernetes in production, as this involves integrating with different infrastructure parts, migrating services from on-premises to the cloud, and many further complexities.
Defining your set of standards covers processes for operation runbooks and playbooks, as well as technology standardization – using Docker, Kubernetes, and standard tools across teams. These tools should have specific characteristics: open source but battle-tested in production, the ability to support the other principles, such as IaC code, immutability, being cloud-agnostic, and being simple to use and deploy with a minimum of infrastructure.
- Single source of truth: Having a source of truth is a cornerstone and enabler to modern infrastructure management and configuration. Source code control systems such as Git are becoming the standard choice to store and version infrastructure code, where having a single and dedicated source code repository for infrastructure is the recommended practice to follow.
Managing Kubernetes infrastructure is about management complexity. Hence, having a solid infrastructure design, applying best practices and standards, increasing the team's Kubernetes-specific skills, and expertise will all result in a smooth operational and manageability journey.