Sensu — A powerful and scalable monitoring solution

Client Installation

The client systems only communicate with RabbitMQ, which makes it very easy to install the Sensu client on the computers you want to monitor. We use the same Sensu package and the same config.json as on the server but only enable the sensu-client service.

When the client first starts, it automatically registers with the server. From this moment on, the Sensu server expects at least regular signs of life in the form of keepalive messages. In the default configuration, Sensu raises an alarm if a client has not phoned home within the past three minutes.

Like any monitoring system, Sensu performs checks to verify the status of certain system components. Unlike Nagios, Sensu does not support host-based checks. Checks are always performed by a Sensu client. The client fields the check output and then dumps it on the central message bus for processing by the server. You can develop Sensu checks in any programming language that can output text to stdout.

If you are starting out with Sensu, you will probably want to begin with status checks that reflect the current state of the system. Sensu distinguishes among the following:

  • Passive checks requested by the Sensu server
  • Active checks that the Sensu client performs without a request
  • External events, which separate applications transmit to the Sensu client

Sensu expects the results of status checks in Nagios format. Sensu's support for Nagios format makes it extremely simple for users who are familiar with Nagios to start writing their own checks. Also, Nagios support means a huge number of ready-to-use checks are available from the outset. In fact, I was able to relieve my overtaxed Nagios system by passing many critical checks to Sensu, thus finally enjoying up-to-date monitoring results once again; #monitoringlove flooded the team.

The Sensu server initiates most of the checks. Sensu always addresses the prompt for a check to a group of subscribers. Listing 2 shows a simple configuration that connects the client to the all and test groups. Thanks to this publish/subscribe process, a single request to the server is all it takes to perform a routine task on a massive scale, such as querying the free disk space on several hundred clients.

Listing 2

/etc/sensu/conf.d/client.Json

01 ```Json
02 {
03   "client": {
04     "name": "<client1.example.com>",
05     "address": "10.0.10.1",
06     "subscriptions": [ "all", "test" ],
07     "disk_warn": "10%",
08     "disk_crit": "5%"
09   }
10 }
11 ```

Listing 3 shows the configuration for a typical check. Each client that has subscribed to receive all group messages will, when prompted by the server, perform the check defined in command (at 60-second intervals) and return its output to the server via RabbitMQ.

Listing 3

Disk Check

01 ~~~Json
02 {
03   "checks": {
04     "disk_free": {
05       "type": "status",
06       "subscribers": [ "all" ],
07       "handlers": [ "default" ],
08       "command": "/usr/lib/nagios/plugins/check_disk -w \
                      :::disk_warn::: -c :::disk_crit::: \
                      -A -x /dev/shm -X nfs -i /boot",
09       "interval": 60
10     }
11   }
12 }
13 ~~~

The sample check works with variables that can use specific values for the respective client. The name with three colons on the left and right serves as a placeholder for a variable. The Sensu client takes its local value from the client.conf file.

In addition to interval, Sensu also supports other options for managing checks. For example, you might want to configure the system to send a notice to the server only after several failed checks (occurrences) in a row. Sensu also has a feature for handling rapid state changes (flapping).

The standalone check is used if the client actively needs to initiate a check (i.e., independent of the server). Listing 4 shows an example of a locally controlled MySQL check that the client executes every 30 seconds. Active checks are simpler than passive checks because they do not require configuration and management on the server. A JSON file created manually on the client is all it takes to enable an active check.

Listing 4

Active Check

01 ```Json
02 {
03   "checks": {
04     "mysql_server": {
05       "standalone": true,
06       "interval": 30,
07       "handlers": [
08         "default"
09       ],
10       "command": "/usr/lib/nagios/plugins/check_mysql \
                      -u 'monitoring' -p 'db1ch3ck'"
11     }
12   }
13 }
14 ```

Active checks are useful for monitoring short-lived servers that do not justify the initial centralized configuration overhead. You can use the management tool that checks configurations to set up active checks (see the "Cooking with Chef" box). Active checks are also useful if you need them to run at specific times. The publish/subscribe process used with passive checks cannot guarantee a specific time.

Cooking with Chef

The Sensu cookbook for setting up active checks defines a simple Chef resource named sensu_check. Listing 5 contains a recipe fragment that sets up the check through Chef.

Listing 5

Chief Resource for Active Checks

01 ~~~ruby
02 sensu_check 'mysql_server' do
03   command "/usr/lib/nagios/plugins/check_mysql " + \
              "-u 'monitoring' " + \
              "-p '#{node['mysql']['server_mon_password']}'"
04   handlers ['default']
05   standalone true
06   interval 30
07 end
08 ~~~

You do not need to develop special checks if you want Sensu both to process status information from the system and to monitor events for an external application. Sensu can transmit its data to the local Sensu client directly via port 3030. Listing 6 shows how easy it is with an sample shell script. The use of the Sensu shell helper [5] has stood the test in practice because Sensu expects external events in JSON format, which can be difficult to create with shell commands. Besides status information, the Sensu client can also collect run-time metrics. Listing 7 shows the definition of a check that runs a Ruby script to increase the system load. As with status checks, the run-time metrics' output format is kept deliberately simple. As you can see from Listing 8, Sensu expects one measuring point per line, consisting of a hierarchical metric ID, the measured value, and a time stamp.

Listing 6

Transferring External Events

01 ~~~bash
02 echo '{ "name": "my_check", "output": "{ ... }", \
          "status": 0 }' > /dev/tcp/localhost/3030
03 ~~~

Listing 7

Check for Run-Time Metrics

01 ~~~Json
02 {
03   "checks": {
04     "load_metrics": {
05       "type": "metric",
06       "command": "load-metrics.rb",
07       "subscribers": [
08         "production"
09       ],
10       "interval": 10
11     }
12   }
13 }
14 ~~~

Listing 8

Metric Check

01 ~~~
02 $ ruby load-metrics.rb
03 srv3.local.load_avg.one 0.89  1365270842
04 srv3.local.load_avg.five  1.01  1365270842
05 srv3.local.load_avg.fifteen 1.06  1365270842
06 $ echo $?
07 0
08 ~~~

The event handlers on the server evaluate the event once the Sensu client has run the check and returned the results on the message bus. As soon as a new event arrives on the bus, Sensu passes it on (as usual in JSON format) to the relevant event handler.

Sensu distinguishes the following types of event handlers:

  • Pipe: A system command executes this type of routine and passes the event data to it via stdin.
  • TCP, UDP: Two types of write event data in a TCP or UDP socket.
  • Transport: This type internally publishes event data on a transport channel in Sensu, typically RabbitMQ.
  • Group: An event handler group sends the event data to a group of event handlers. Adding a single event handler to a group thereby effectively defines an alias name.

Sensu can associate a wide range of actions with an event. Possible actions include:

  • Notification via email or text message
  • Messages on chat channels
  • Alerting via pager duty
  • Forwarding of run-time metrics to Graphite
  • Generating log entries for evaluation in Logstash

Listing 9 shows how easy it is to process a monitoring event in an event handler. This simple Ruby script is stored in /etc/sensu/handlers/file.rb and receives events in JSON format, which it writes to files that are formatted to be readable by humans. The new event handler is configured in /etc/sensu/conf.d/handlers/default.json as a Pipe plugin (Listing 10). It might be easy to build your own event handler, but you can save yourself the trouble in most cases. The Sensu community has collected an extensive repository of ready-to-use plugins on GitHub [6]. The repository contains more than 600 checks, event handlers, and other Sensu extensions.

Listing 9

Event Handler

01 ~~~ruby
02 #!/usr/bin/env ruby
03
04 require 'rubygems'
05 require 'Json'
06
07 # Read event data
08 event = Json.parse(STDIN.read, :symbolize_names => true)
09 # Write the event data to a file
10 file_name = "/tmp/sensu_#{event[:client][:name]}_" + \
                "#{event[:check][:name]}"
11 File.open(file_name, 'w') do |file|
12   file.write(Json.pretty_generate(event))
13 end
14 ~~~

Listing 10

Integrating the Event Handler

01 ~~~Json
02 {
03   "handlers": {
04     "file": {
05       "type": "pipe",
06       "command": "/etc/sensu/handlers/file.rb"
07     }
08   }
09 }
10 ~~~

Automatic Remedies

Would it not be cool if your monitoring system could fix errors as well as detect and report them? Writing an event handler that initiates appropriate measures is not too difficult. However, because the event handler runs on the Sensu server and the error occurs on a client, you need a mechanism to bridge this gap.

At freistil IT, we experimented with the remote execution tool Serf for freistilbox.com. However, smart Sensu users realized that it was not necessary to use two different applications that both ultimately use their own messaging systems to transport actions and events. This realization led to the Sensu Remediator plugin.

Using this plugin, I could assign the check with a three-stage repair strategy. A suitable command was executed on the client at each stage; the plugin also smartly "misappropriated" the Sensu checks. In the example (Listing 11), the plugin first triggers a reload when entering a WARNING status. If the status remains unchanged, the plugin will try a restart instead. The system will respond by rebooting if a CRITICAL status occurs.

Listing 11

Self-Healing Infrastructure

01 ```Json
02 {
03   "checks": {
04     "check_foo": {
05       "command": "check-procs.rb ...",
06       "interval": 30,
07       "subscribers": ["application_server"],
08       "handlers": ["debug", "slack", "remediator"],
09       "remediation": {
10         "light_remediation": {
11           "occurrences": [1, 2],
12           "severities": [1]
13         },
14         "medium_remediation": {
15           "occurrences": ["3-5"],
16           "severities": [1]
17         },
18         "heavy_remediation": {
19           "occurrences": ["1+"],
20           "severities": [2]
21         }
22       }
23     },
24     "light_remediation": {
25       "command": "service foo reload",
26       "subscribers": [],
27       "handlers": ["debug"],
28       "publish": false
29     },
30     "medium_remediation": {
31       "command": "service foo restart",
32       "subscribers": [],
33       "handlers": ["debug", "slack"],
34       "publish": false
35     },
36     "heavy_remediation": {
37       "command": "sudo reboot",
38       "subscribers": [],
39       "handlers": ["debug", "slack"],
40       "publish": false
41     }
42   }
43 }
44 ```

The three repair "checks" are deliberately defined without subscribers; the plugin always prompts the affected client to run it. For this approach to work, this client must have a subscription using its own host name (Listing 12).

Listing 12

Self-Subscription

01 ```Json
02 {
03   "client": {
04     "name":"client1.example.com",
05     "address":"10.0.10.1",
06     "subscriptions":[
07       "all",
08       "test",
09       "client1.example.com"
10     ]
11   }
12 }
13 ```

Sensu is very unobtrusive in day-to-day operations – at least as long as no errors occur. As a sys admin, you hardly have direct interaction with Sensu, especially if you use external services such as Pager Duty for alerting. If you do have a need to interact with the monitoring system, doing so via the web dashboard is simple and efficient. You can acknowledge alerts or even shut them off for a while using the silence function.

Anyone who prefers to use Sensu without a mouse should try sensu-cli [7]. This command-line application can acknowledge alerts:

sensu-cli resolve server3 apache_http

or temporarily stop:

sensu-cli silence server3 reason "Shut up already" - expire 3600

Because a new Sensu client registers with the server, this registration must be deleted if the client no longer exists:

sensu-cli client delete server3

This step avoids unnecessary alerts and is easy to do.

ChatOps

Many companies, especially if their employees are geographically dispersed, use chat for team communication. The chat system becomes the central source of information if you enrich team messages with system messages. In this way, everyone finds out about new Git commits or changes in the wiki without delay, and team members can exchange information on the spot. Sensu comes with event handlers for several common chat systems (IRC, Slack, Campfire, etc.).

The quantum leap from the central source of information to ChatOps is achieved by implementing a back channel in the form of a chatbot. This bot is tasked with receiving instructions from the chat and interacting with various OPS services.

GitHub's Hubot [8] is the classic chatbot; on freistilbox, the team had great fun with Lita [9]. Besides simply acknowledging an alert with a simple pagerbot ack 1234 or quickly taking over standby duties for a colleague with pagerbot put me on firstlevel for 1 hour, members were also able to communicate these actions instantly to the rest of the team.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Prometheus

    Legacy monitoring solutions are fine for small-to-medium-sized networks, but complex environments benefit from a different approach. Prometheus is an interesting alternative to classic tools like Nagios.

  • IoT with RabbitMQ

    Connect multiple protocols and servers together on your IoT projects.

  • Programming Snapshot – Alexa

    Asking Alexa only for built-in functions like the weather report gets old quickly, and add-on skills from the skills store only go so far. With a few lines of code, Mike teaches this digital pet some new tricks.

  • Perl: Network Monitoring

    To discover possibly undesirable arrivals and departures on their networks, a Perl daemon periodically stores the data from Nmap scans and passes them on to Nagios via a built-in web interface.

  • ELK Stack Workshop

    ELK Stack is a powerful monitoring system known for efficient log management and versatile visualization. This hands-on workshop will help you take your first steps with setting up your own ELK Stack monitoring solution.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News