With growth come more teams, more infrastructure, more applications. With time, running a single Prometheus server can start to become infeasible: changes in recording/alerting rules and scrape jobs become more frequent (thus requiring reloads which, depending on the configured scrape intervals, can take up to a couple of minutes), missed scrapes can start to happen as Prometheus becomes overwhelmed, or the person or team responsible for that instance may simply become a bottleneck in terms of organizational process. When this happens, we need to rethink the architecture of our solution so that is scales accordingly. Thankfully, this is something the community has tackled time and time again, and so there are some recommendations on how to approach this problem. These recommendations revolve around sharding.
In this context, sharding means splitting...