Why users are changing their monitoring solution

Change Track

Article from Issue 177/2015
Author(s):

Many enterprises use the free Nagios monitoring solution; some would prefer to change to something else. We talked to people who switched to find out how they fared.

The open source world has many free monitoring systems, but if you need a comprehensive and versatile solution, you will quickly end up in the Nagios camp. In the more than 15 years of its existence, Nagios [1] has gained a reach that is unrivaled, largely because of its plugin concept and an engaged and productive community.

But is everyone happy with Nagios? The free version, at least, has a reputation for suffering from performance issues as the number of checks increases, and many people view it as difficult to configure. The commercial version does not offer onsite support outside of the United States and, even within the US, support agreements are not exactly cheap.

More than a few enterprise users are looking into alternatives or have already made the move. Some opt for solutions derived from Nagios, including Icinga [2] (which no longer includes any Nagios code in its Icinga 2 embodiment) or Naemon [3]; others rely on new implementations such as Shinken [4] and commercial derivatives such as op5 [5], NetEye [6], GroundWork [7], or SNAG-View [8].

We interviewed some users who had shopped around for a different solution to learn more about the market for Nagios alternatives. We talked to European users who had only the email and forum support that is included with the Nagios license, although phone support is available for an additional cost [9]. We asked the users a few simple questions about their environments, their motivation for migrating, and how satisfied they currently are after making the move.

Two Problems

The criticism levied at Nagios and the origin of the desire for a change mainly relates to two points that correlate with the size of the environment to be monitored. The first issue is the complex configuration using large text files. Many thousands of lines of text with hundreds of dependencies are just too difficult to manage manually. The second complaint relates to performance and typically affects large-scale environments. Because the classic Nagios does not support multithreading and performs checks in series, run times experienced make it necessary to use lengthy check intervals.

If you do not extend the intervals, then successive checks tend to start before their predecessors finish. As a result, the legacy architecture scales poorly without any parallelization options. We can't say anything about the commercial Nagios XI version at this point, because none of the respondents had used it.

Both the problem with maintainability and with performance have existed for a long time, and the approaches are equally numerous: Modules such as Mod_Gearman [10], which distributes the checks across various queues that are processed in parallel by several computers, are useful ways of improving scalability, from which availability also benefits in many cases.

Shinken also supports distributable worker processes. Icinga implements distributed monitoring, Check_MK [11] has different approaches at the start to overcome performance shortages, op5 relies on the Merlin load balancer module, and so on.

Despite this diversity, the respondents for the most part only migrated to two other solutions: namely, to the kindred Check_MK or to a totally different beast in the form of Zabbix [12]. All respondents were more satisfied than before.

Everything on a Single Console

It is no surprise, if you look at the scale involved, that system architect Hubert Bösl had performance problems with Nagios. He currently monitors 5,714 systems with 63,684 services at Munich airport (Figure 1). The range of the devices is broad, including network components from Cisco; storage systems from EMC and NetApp; UPSs; computers with Windows, Linux, and Solaris operating systems; Fujitsu and HP servers; virtualized systems running ESX and Hyper-V, web servers; SAP and database servers (Oracle, MS SQL, MySQL); load balancers; Veritas clusters (Symantec HA); CCTV cameras; self-written applications, and much more.

Figure 1: Munich airport monitors a very large IT landscape today with Check_MK.

The decision in favor of Check_MK was influenced by a desire to group all error messages on a central console. That was not possible with off-the-shelf software. Developer Mathias Kettner presents the most compelling offer for programming one such console, which is why Check_MK was awarded the contract. At the start of 2014, the airport migrated from Nagios Core to Check_MK with Micro Core – in the meantime, the monitoring server load has been reduced by no less than 80 percent for the same number of services.

The second problem, configuration, can also be improved in this way. Bösl said: "To start with, I was not a fan of automatic inventory, as provided by Check_MK's agents. With Nagios, I was also used to defining each service that was to be monitored. However, in the last two years we have learned to appreciate this feature. We only fix one set of rules that determines what is monitored and what is not. Ninety percent of this ruleset is derived from our CMDB or network management tool. The Check_MK SNMP agents are very much what we need here. In some cases, you just need to specify the IP address and address the system via SNMP. You can then choose the services to be monitored."

Performance Problem

Things were similar for Thorsten Kohlhepp from DENIC, the central registry authority for all domains under the .de top-level domain. The registry, with data centers in Frankfurt and Amsterdam and multiple server locations worldwide, monitors around 1,000 systems, which are already distributed across multiple Nagios environments with a maximum of 530 systems (versions 3.00 to 3.2.2). The biggest problem here, too, was performance. The overhead required is also a problem.

At DENIC, this led to a decision in favor of Zabbix (Figure 2). Above all, users expected to have less overhead with a monolithic solution than by integrating multiple tools such as Graphite, Collectd, Nagios, or Mod_Gearman.

Figure 2: Zabbix is popular because of its flexible visualization options.

"Another positive was the automation of monitoring through auto-activation in Zabbix," reports Kohlhepp. "We currently allow newly installed virtual machines to register with Zabbix. The machines then receive their respective checks through templates via an automatic system. There is therefore no work involved in adding new hosts. We are very much satisfied with this, particularly because we are already performing more checks in Zabbix than we did in Nagios, and the monitoring environment performance is not showing any adverse effects."

A health insurance medical service has a similar view. Here too, the problems were primarily related to scaling, after monitoring some 3,000 services on approximately 400 hosts with Nagios Core 3.5.1. The institution remained with Nagios but used the broker module Mod_Gearman for load distribution at the same time. The users were satisfied with the results: A significant reduction in load was achieved through distribution to satellite systems, and the necessary configuration changes were kept manageable.

Hendrik Santos from drugstore chain Dirk Rossmann GmbH's IT department also faced performance problems. He had Icinga 1 monitoring 350 server systems with about 10,000 services. This resulted in the described involuntary extension of the check intervals, eventually forcing a move to Check_MK. Rossmann uses this today to monitor 900 network and server systems with around 30,000 services in 60-second cycles.

The passive checks by Check_MK have not only eliminated the speed bottlenecks, but Santos has also benefitted from other features: the large number of predefined checks and convenient administration via the WATO web GUI [13], which can also be operated by non-specialists, and the effective coordination of all components concerned.

Santos said: "We have had very good experiences in production from the outset with Check_MK as the agent. In the evaluation, commercial solutions turned out to be poor alternatives because of a lack of real benefits and the related strategy change (vendor lock-in). Other solutions in the Nagios environment did not pass our practical test. Using Check_MK with other add-ons (Multisite [14], WATO, PNP4Nagios [15], NagVis) worked straightaway in practice and helped us cover almost all of our requirements."

The classic problems with performance and configuration were also crucial for the IT group leader at a church institution that tried to keep track of 85 hosts and 1,600 services with Nagios 3.5. The performance problems became drastically worse, particularly when virtual systems were also targeted for inclusion in monitoring, and induced a switch to Check_MK.

Not least the ability to migrate successively over a period of time spoke in favor of this solution. The users are also satisfied with the numerous plugins, the clear-cut dashboard, and the performance, which is barely affected by the number of monitored systems.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus