This article is an excerpt from the book, Mastering Prometheus, by William Hegedus. Become a Prometheus master with this guide that takes you from the fundamentals to advanced deployment in no time. Equipped with practical knowledge of Prometheus and its ecosystem, you’ll learn when, why, and how to scale it to meet your needs.
In this article, readers will dive into techniques for optimizing Prometheus, a powerful open-source monitoring tool, by implementing sharding. As data volumes increase, so do the challenges associated with high cardinality, often resulting in strained single-instance setups. Instead of purging data to reduce load, sharding offers a viable solution by distributing scrape jobs across multiple Prometheus instances. This article explores two primary sharding methods: by service, which segments data by use case or team, and by dynamic relabeling, which provides a more flexible, albeit complex, approach to distributing data. By examining each method’s setup and trade-offs, the article offers practical insights for scaling Prometheus while maintaining efficient access to critical metrics across instances.
Chances are that if you’re looking to improve your Prometheus architecture through sharding, you’re hitting one of the limitations we talked about and it’s probably cardinality. You have a Prometheus instance that’s just got too much data in it, but… you don’t want to get rid of any data. So, the logical answer is… run another Prometheus instance!
When you split data across Prometheus instances like this, it’s referred to as sharding. If you’re familiar with other database designs, it probably isn’t sharding in the traditional sense. As previously established, Prometheus TSDBs do not talk to each other, so it’s not as if they’re coordinating to shard data across instances. Instead, you predetermine where data will be placed by how you configure the scrape jobs on each instance. So, it’s more like sharding scrape jobs than sharding the data. Th ere are two main ways to accomplish this: sharding by service and sharding via relabeling.
This is arguably the simpler of the two ways to shard data across your Prometheus instances. Essentially, you just separate your Prometheus instances by use case. This could be a Prometheus instance per team, where you have multiple Prometheus instances and each one covers services owned by a specific team so that each team still has a centralized location to see most of the data they care about. Or, you could arbitrarily shard it by some other criteria, such as one Prometheus instance for virtualized infrastructure, one for bare-metal, and one for containerized infrastructure.
Regardless of the criteria, the idea is that you segment your Prometheus instances based on use case so that there is at least some unifi cation and consistency in which Prometheus gets which scrape targets. This makes it at least a little easier for other engineers and developers to reason when thinking about where the metrics they care about are located.
From there, it’s fairly self-explanatory to get set up. It only entails setting up your scrape job in different locations. So, let’s take a look at the other, slightly more involved way of sharding your Prometheus instances.
Sharding via relabeling is a much more dynamic way of handling the sharding of your Prometheus scrape targets. However, it does have some tradeoff s. The biggest one is the added complexity of not necessarily knowing which Prometheus instance your scrape targets will end up on. As opposed to the sharding by service/team/domain example we already discussed, sharding via relabeling does not shard scrape jobs in a way that is predictable to users.
Now, just because sharding is unpredictable to humans does not mean that it is not deterministic. It is consistent, but just not in a way that it will be clear to users which Prometheus they need to go to to find the metrics they want to see. There are ways to work around this with tools such as Th anos (which we’ll discuss later in this book) or federation (which we’ll discuss later in this chapter).
The key to sharding via relabeling is the hashmod function, which is available during relabeling in Prometheus. The hashmod function works by taking a list of one or more source labels, concatenating them, producing an MD5 hash of it, and then applying a modulus to it. Then, you store the output of that and in your next step of relabeling, you keep or drop targets that have a specific hashmod value output.
What’s relabeling again?
For a refresher on relabeling in Prometheus, consult Chapter 4’s section on it. For this chapter, the type of relabeling we’re doing is standard relabeling (as opposed to metric relabeling) – it happens before a scrape occurs.
Let’s look at an example of how this works logically before diving into implementing it in our kubeprometheus stack. We’ll just use the Python REPL to keep it quick:
>>> from hashlib import md5
>>> SEPARATOR = ";"
>>> MOD = 2
>>> targetA = ["app=nginx", "instance=node2"]
>>> targetB = ["app=nginx", "instance=node23"]
>>> hashA = int(md5(SEPARATOR.join(targetA).encode("utf-8")).
hexdigest(), 16)
>>> hashA
>>> hashB = int(md5(SEPARATOR.join(targetB).encode("utf-8")).
hexdigest(), 16)
>>> hashB
>>> print(f"{targetA} % {MOD} = ", hashA % MOD)
['app=nginx', 'instance=node2'] % 2 = 0
>>> print(f"{targetB} % {MOD} = ", hashB % MOD)
['app=nginx', 'instance=node23'] % 2 = 1
As you can see, the hash of the app and instance labels has a modulus of 2 applied to it. For node2, the result is 0. For node23, the result is 1. Since the modulus is 2, those are the only possible values. Therefore, if we had two Prometheus instances, we would configure one to only keep targets where the result is 0, and the other would only keep targets where the result is 1 – that’s how we would shard our scrape jobs.
The modulus value that you choose should generally correspond to the number of Prometheus instances that you wish to shard your scrape jobs across. Let’s look at how we can accomplish this type of sharding across two Prometheus instances using kube-prometheus.
Luckily for us, kube-prometheus has built-in support for sharding Prometheus instances using relabeling by way of support via the Prometheus Operator. It’s a built-in option on Prometheus CRD objects.
Enabling it is as simple as updating our prometheusSpec in our Helm values to specify the number of shards.
Additionally, we’ll need to clean up the names of our Prometheus instances; otherwise, Kubernetes won’t allow the new Pod to start due to character constraints. We can tell kube-prometheus to stop including kube-prometheus in the names of our resources, which will shorten the names. To do this, we’ll set cleanPrometheusOperatorObjectNames: true.
The new values being added to our Helm values file from Chapter 2 look like this:
shards: 2
cleanPrometheusOperatorObjectNames: true
The full values file is available in this GitHub repository, which was linked at the beginning of this chapter.
With that out of the way, we can apply these new values to get an additional Prometheus instance running to shard our scrape jobs across the two. The helm command to accomplish this is as follows:
$ helm upgrade --namespace prometheus \
--version 47.0.0 \
--values ch6/values.yaml \
mastering-prometheus \
Once that command completes, you should see a new pod named prometheus-masteringprometheus-kube-shard-1-0 in the output of kubectl get pods.
Now, we can see the relabeling that’s taking place behind the scenes so that we can understand how it works and how to implement it in Prometheus instances not running via the Prometheus Operator.
Port-forward to either of the two Prometheus instances (I chose the new one) and we can examine the configuration in our browsers at http://localhost:9090/config:
$ kubectl port-forward \
pod/prometheus-mastering-prometheus-kube-shard-1-0 \
The relevant section we’re looking for is the sequential parts of relabel_configs, where hashmod is applied and then a keep action is applied based on the output of hashmod and the shard number of the Prometheus instance.
It should look like this:
[ . . . ]
- source_labels: [__address__]
separator: ;
regex: (.*)
modulus: 2
target_label: __tmp_hash
replacement: $1
action: hashmod
- source_labels: [__tmp_hash]
separator: ;
regex: "1"
replacement: $1
action: keep
As we can see, for each s crape job, a modulus of 2 is taken from the hash of the __address__
label, and its result is stored in a new label called __tmp_hash. You can store the result in whatever you want to name your label – there’s nothing special about __tmp_hash
. Additionally, you can choose any one or more source labels you wish – it doesn’t have to be __address__
. However, it’s recommended that you choose labels that will be unique per target – so instance
and __address__
tend to be your best options.
After calculating the modulus of the hash, the next step is the crucial one that determines which scrape targets the Prometheus shard will scrape. It takes the value of the __tmp_hash
label and matches it against its shard number (shard numbers start at 0), and keeps only targets that match.
The Prometheus Operator does the heavy lifting of automatically applying these two relabeling steps to all configured scrape jobs, but if you’re managing your own Prometheus configuration directly, then you will need to add them to every scrape job that you want to shard across Prometheus instances – there is currently no way to do it globally.
It’s worth mentioning that sharding in this way does not guarantee that your scrape jobs are going to be evenly spread out across your number of shards. We can port-forward to the other Prometheus instance and run a quick PromQL query to easily see that they’re not evenly distributed across my two shards. I’ll port forward to port 9091 on my local host so that I can open both instances simultaneously:
$ kubectl port-forward \
pod/prometheus-mastering-prometheus-kube-0 \
Then, we can run this simple query to see how many scrape targets are assigned to each Prometheus instance:
In my setup, there are eight scrape targets on shard 0 and 16 on shard 1. You can attempt to microoptimize scrape target sharding by including more unique labels in the source_label values for the hashmod operation, but it may not be worth the effort – as you add more unique scrape targets, they’ll begin to even out.
One of the practical pain points you may have noticed already with sharding is that it’s honestly kind of a pain to have to navigate to multiple Prometheus instances to run queries. One of the ways we can try to make this easier is through federating our Prometheus instances.
In conclusion, sharding Prometheus is an effective way to manage the challenges posed by data volume and cardinality in your system. Whether you opt for sharding by service or through dynamic relabeling, both approaches offer ways to distribute scrape jobs across multiple Prometheus instances. While sharding via relabeling introduces more complexity, it also provides flexibility and scalability. However, it is important to consider the trade-offs, such as uneven distribution of scrape jobs and the need for tools like Thanos or federation to simplify querying across instances. By applying these strategies, you can ensure a more efficient and scalable Prometheus architecture.
Will Hegedus has worked in tech for over a decade in a variety of roles, most recently in Site Reliability Engineering. After becoming the first SRE at Linode, an independent cloud provider, he came to Akamai Technologies by way of an acquisition.
Now, Will manages a team of SREs focused on building an internal observability platform for Akamai’s Connected Cloud. His team's responsibilities include managing a global fleet of Prometheus servers ingesting millions of data points every second.
Will is an open-source advocate with contributions to Prometheus, Thanos, and other CNCF projects related to Kubernetes and observability. He lives in central Virginia with his wonderful wife, 4 kids, 3 cats, 2 dogs, and bearded dragon.