Prometheus has included a new capability in the 2.32.0 release to optimize the single pane of glass approach

From the new upcoming release of Prometheus v2.32.0, we will have an important new feature at our disposal: the Agent Mode. And there is a fantastic blog post announcing this feature from our of the rockstar from the Prometheus team: Bartlomiej Plotka, that I recommend reading. I will add a reference section at the end of the article. I will try to summarise some of the most relevant points here.
Another post about Prometheus, the most critical monitoring system in nowadays cloud-native architectures, has its inception in the Borgmon monitoring system created by Google in ancient times (around the 2010–2014 period).
Based on this importance, its usage has been growing incredibly and making its relationship with the Kubernetes ecosystem stronger. We have reached a point that Prometheus is the default option for monitoring in pretty much any scenario that has a Kubernetes workload related to it; some examples are the ones shown below:
- Prometheus is the default option, including the Openshift Monitoring System
- Prometheus has an Amazon Managed Service at your disposal to be used for your workloads.
- Prometheus is included in the Reference Architecture for Cloud-Native Azure Deployments.
Because of this popularity and growth, many different use-cases have raised some improvements that can be done. Some of them are related to specific use-cases such as edge deployment or providing a global view, or a single pane of glass.
Until now, if I have several Prometheus deployments, monitor a specific subset of your workloads because of their resides on different networks or because there are various clusters, you can rely on the remote write capability to aggregate that into a global view approach.
Remote Write is a capability that has existed in Prometheus since its inception. The metrics that Prometheus is scraping can be sent automatically to a different system using their integrations. This can be configured for all the metrics or just a subset. But even with all of these, they are jumping ahead on this capability, which is why they are introducing the Agent mode.
Agent Mode optimizes the remote write use case configuring the Prometheus instance in a specific mode to do this job in an optimized way. That model implies the following configuration:
- Disable querying and alerting.
- Switch the local storage with a customized TSDB WAL
And the remarkable thing is that everything else is the same, so we will still use the same API, discover capabilities, and related configuration. And what all of this will provide to you? Let’s take a look at the benefits you will get of doing so:
- Efficiency: Customised TSDB WAL will keep only the data that could not be sent to the target location; as soon as it succeeds, it will remove that piece of data.
- Scalability: It will improve scalability, enabling easier horizontal scalability for ingestion. This is because this agent mode disables some of the reasons auto-stability is complex in normal server-mode Prometheus. A stateful workload makes scalability complex, especially in scale-down scenarios. So this mode will lead to a “more-stateless” workload that will simplify this scenario and be close to the dream of an auto-scalability metric ingestion system.
This feature is available as an experimental flag in the new release, but this was already tested with Grafana Labs’ works, especially on the performance side.
If you want to take a look at more details about this feature, I would recommend taking a look at the following article: https://prometheus.io/blog/2021/11/16/agent/