Yesterday, Uber Inc., open-sourced its robust and scalable metrics infrastructure, M3 for Prometheus, a popular monitoring and alerting solution. Uber has been using M3 for a long time to access metrics on their backend systems. However, by open sourcing M3 as a remote storage backend for Prometheus, Uber wants others in the broader community to benefit from their metrics platform.
Prior to releasing M3, Uber released M3DB, the scalable storage backend for M3. M3DB is a distributed time series database that can be used for storing real-time metrics at long retention
Along with M3, Uber also open sourced M3 Coordinator, a bridge that users can deploy to access the benefits of M3DB and Prometheus. The M3 Coordinator performs downsampling, ad hoc retention, and aggregation of metrics using retention and rollup rules. This helps in applying specific retention and aggregations to subsets of metrics on the go. The rules of the process are stored in etcd, which runs embedded in the binary of an M3DB seed node.
Although Prometheus is a popular monitoring and alerting solution, its scalability and durability is limited by single nodes. The M3 metric platform provides a turnkey, scalable, and configurable multi-tenant store for Prometheus metrics.
Source: Uber Engineering
Uber, before using M3, emitted metrics to a Graphite stack, which stored them using the Whisper file format in a sharded Carbon cluster. Uber then made use of Grafana for dashboarding and Nagios for alerting, issuing Graphite threshold checks via source-controlled scripts. However, expanding the Carbon cluster required a manual resharding process and, due to lack of replication, any single node’s disk failure caused permanent loss of its associated metrics. Thus, this solution was not worth continuing as Uber kept expanding.
This led them to build M3, a system which provides fault-tolerant metrics ingestion, storage, and querying as a managed platform. Released in the year 2015, M3 now houses over 6.6 billion time series.
With such a global view, upstream consumers need not navigate routing. This increases the overall simplicity of metrics discoverability. For workloads that failover applications between regions or workloads sharded across regions, the single global view makes it much easier to sum and query metrics across all regions in a single query.
This lets users see all operations of a specific type globally, and look at a longer retention to view historical trends in a single place.
To achieve this single pane view, metrics are written in M3 to local regional M3DB instances. In this setup, replication is local to a region and can be configured to be isolated by availability zone or rack. Queries fan out to both the local region’s M3DB instances and coordinators in remote regions where metrics are stored, returning compressed M3TSZ blocks for matched time series wherever possible.
Uber engineers plan to further upgrade M3 to push query aggregations to remote regions to execute before returning results, as well as to the local M3DB storage node wherever possible.
Read more about M3 in detail in Uber Engineering official blog post.
China’s Baidu launches Duer OS Prometheus Project to accelerate conversational AI
Log monitoring tools for continuous security monitoring policy [Tutorial]
Monitoring, Logging, and Troubleshooting