System monitoring for a new generation with Prometheus
Big Watcher
Legacy monitoring solutions are fine for small-to-medium-sized networks, but complex environments benefit from a different approach. Prometheus is an interesting alternative to classic tools like Nagios.
Where monitoring is required, alerting and trending are never far away. Alerting plays a major role in practically any monitoring environment; the idea is to draw the administrator's attention to failures. And, trending is also important. Trending helps the admin detect potential bottlenecks at an early stage.
A quick look at the available monitoring solutions shows why Monitoring, Alerting, and Trending (MAT) are still an issue for many networks, particularly large and complex networks. Nagios, which has dominated the monitoring market for a long time, is a behemoth of complexity and comes with some inherent weaknesses.
Nagios alternatives such as Icinga have attempted to address some of the issues, but their scalability is limited. The ballast of compatibility with Nagios and its plugins aggravates the situation. A state-of-art feature like trending was not exactly designed into the legacy Nagios. PNP4Nagios [1], a performance-tracking Nagios add-on, is one of the few options for useful trending with Nagios (Figure 1).
SoundCloud as the Precursor
SoundCloud from UK was confronted with the challenge of implementing a monitoring solution. The company operates a streaming service along the lines of Spotify or Apple Music. The real challenge from the outset was to build a MAT system that would work reliably with thousands of nodes. Instead of combining existing components to create a better-than-nothing solution, SoundCloud decided to explore unknown territory. The company chose to develop its own monitoring system and the result was Prometheus [2].
Compared with established solutions like Nagios, Prometheus has one very special feature: It comes with its own storage system to manage the data acquired from the network. Prometheus' internal database is based on the concept of the time series database. And, Prometheus tends to think more in the dimension of a complete metric rather than focusing on individual alerts. To understand what that means, I will take a short detour into the storage universe.
How MAT Systems Manage Data
Classic monitoring systems, such as Nagios, do not have very sophisticated data management, and they don't actually need it. The important thing with monitoring is whether a service is running properly right now. When you add the topic of trending, things start to become more difficult: Trending means you need long-term records relating to the availability of the service or the load on the existing infrastructure.
PNP4Nagios, for example, supports a database such as MySQL in the background in order to store the required values for a long period. MySQL is actually not designed for this kind of use, which can lead to problems. The volume of data you need to manage will grow extremely quickly in any large installation. The persistent storage on which all your trending data resides thus needs to scale just as easily as the entire platform. This is particularly true of the storage, but it also applies to the way in which the database handles a continuously increasing volume of data.
Also, preparing the data is a challenge: the data reaches the MAT system sorted in order of time, but at the other end, you'll need to output the data to reflect specific services. For example: the MAT system is regularly supplied with data points from its target systems for various services in consecutive order, such as "9AM: CPU load 1, RAM utilization 30 percent, and disc space usage 15 percent." However, administrators will typically want to know what the CPU load looked like in a specific period, for example between 9AM today and the same time the previous morning.
Storing and manipulating large amounts of data in a database is an extremely resource-hungry process, and MySQL, in particular, loves taking its time with queries from tools like PNP4Nagios. A time-series database, such as the database used with Prometheus, offers an alternative approach.
Basically, a time series database is no more than a database that is designed for storing data in temporal relation. (See the box titled "Not the First, But the Best.") The data is converted by algorithms directly in the database. Prometheus is thus better equipped to take on a complex task such as trending thanks to its data model.
Not the First, But the Best
Prometheus is not the first attempt to apply the time-series database model to network monitoring. Graphite [3] was around long before Prometheus, but its data model is not as mature. Influx DB [4], which is typically combined with a frontend such as Sensu, is even younger than Prometheus, but it addresses a different user group and, according to our tests, doesn't scale as well Prometheus when faced with large volumes of data. And, then there is OpenTSDB [5], the Open Time Series Database, which fundamentally is very similar to Prometheus but requires external add-on components such as Hadoop. The fact that these external constraints do not apply to Prometheus is something that many admins really appreciate about the product.
Typical monitoring and alerting is then no more than a side product: If no results are received for a specific metric over a period of time, the system assumes the service is not running correctly and sounds the alarm.
Prometheus Modular Architecture
Under the hood, Prometheus relies on a modular architecture. The core of the application – that is, the time series database – is programmed in Go, just like most of the applications in the Prometheus distribution. The database comes with its own web interface and a separate tool for alert management (the Alert Manager). Exporters for the target host are important – exporter is basically another word for agent: The node exporter, for example, logs various data for metrics such as CPU load or RAM usage on the host on which it is running, giving the Prometheus database the ability to pull this data when needed. If the service needs to push its data to the MAT system, you can deploy the push gateway, which fields the data from the services and stages it for the database.
At the heart of the system is the Prometheus server (Figure 2). The server handles many tasks, the most important of which is storing the measurement data acquired in the cloud. Although Prometheus comes from the cloud camp, the service is lagging behind in scalability. Although you can easily run any number of Prometheus instances within the same setup, in contrast to many other solutions, Prometheus does not rely on shared storage on the back end.
The Prometheus developers cite complexity as a reason for avoiding shared storage. They mention their competitor OpenTSDB as a negative example. Many admins would love to deploy OpenTSDB, but they are put off by the enormous overhead of running a complete Hadoop cluster.
Instead, Prometheus relies on the sharding principle. You can configure multiple instances of the Prometheus server service to cover overlapping data areas. Before performing a search, the database determines the shard in which the data in question must reside and it only looks there.
At this level, you can replicate by letting logical pairs of servers collect the data from the same agent on the network. A record is thus available multiple times and still usable in scenarios in which one of the two nodes has failed.
The Prometheus developers are aware that there is a problem with this lack of a shared storage alternative. Right now, they are working on a solution that generates a superordinate instance for a cluster of Prometheus installations; the instance, in turn, picks up the data from the Prometheus shards.
This approach gives users centralized administration. And there are plans for the distant future: In the long term, the intent is for Prometheus to store data in OpenTSDB – and thus leverage its replication capabilities.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Fedora 41 Released with New Features
If you're a Fedora fan or just looking for a Linux distribution to help you migrate from Windows, Fedora 41 might be just the ticket.
-
AlmaLinux OS Kitten 10 Gives Power Users a Sneak Preview
If you're looking to kick the tires of AlmaLinux's upstream version, the developers have a purrfect solution.
-
Gnome 47.1 Released with a Few Fixes
The latest release of the Gnome desktop is all about fixing a few nagging issues and not about bringing new features into the mix.
-
System76 Unveils an Ampere-Powered Thelio Desktop
If you're looking for a new desktop system for developing autonomous driving and software-defined vehicle solutions. System76 has you covered.
-
VirtualBox 7.1.4 Includes Initial Support for Linux kernel 6.12
The latest version of VirtualBox has arrived and it not only adds initial support for kernel 6.12 but another feature that will make using the virtual machine tool much easier.
-
New Slimbook EVO with Raw AMD Ryzen Power
If you're looking for serious power in a 14" ultrabook that is powered by Linux, Slimbook has just the thing for you.
-
The Gnome Foundation Struggling to Stay Afloat
The foundation behind the Gnome desktop environment is having to go through some serious belt-tightening due to continued financial problems.
-
Thousands of Linux Servers Infected with Stealth Malware Since 2021
Perfctl is capable of remaining undetected, which makes it dangerous and hard to mitigate.
-
Halcyon Creates Anti-Ransomware Protection for Linux
As more Linux systems are targeted by ransomware, Halcyon is stepping up its protection.
-
Valve and Arch Linux Announce Collaboration
Valve and Arch have come together for two projects that will have a serious impact on the Linux distribution.