Cool and Collected with collectd

Tutorials – Collectd

Article from Issue 194/2017
Author(s):

The collectd tool harvests your system stats and stores them for plotting into colorful graphs.

Why do this?

  • Find out all there is to know about your Linux Box
  • Optimize your setup for maximum performance
  • Generate pretty graphs to impress your technically inept boss

Linux has a host of command-line tools for probing what's going on under the hood. To list just a few examples, there's top, which shows what processes are using which resources, df, which shows how much disk space is free, and iftop to show how much data is flowing through the network.

If you really want to get fine-grained details, you can poke about in the /proc filesystem that the kernel automatically populates with precise details of everything that happens, but it can be a little too low-level for mere mortals to understand. Although all these tools are useful, they all have one fatal flaw: They only tell you what's happening right now. In this article, we're going to look at collectd [1], which hoovers up your system stats and stores them, ready to be interrogated and plotted into colorful graphs.

You'll probably find collectd and kcollectd (the graphical front end we'll be using) in your package manager. On Debian, Ubuntu, and derivatives, you can install them with:

sudo apt install collectd kcollectd

On non-Debian-based distros, you might also have to start the collectd service to ensure it's collecting details. You can see if it's running with:

sudo service collectd status

If it's not running, you can start it with:

sudo service collectd start

You can ensure it starts every time you boot your machine with:

sudo chkconfig collectd on

If your distro doesn't have the service or chkconfig commands, you'll need to consult the documentation on managing services. At the most basic level, that really is all there is to it. Once you've installed collectd, it'll run in the background, and when you launch kcollectd, you can build graphs showing how system resources have been used over time.

On the left-hand side of the kcollectd window, a Sensordata widget should show a list of all the hosts for which you have data. Initially this will just be the host you're currently running on, but if you set up networking (see the "Networking" box), the other hosts will appear in the list on the server. If you expand the host, you'll see a list of all the plugin instances for which collectd has data. So, for example, the CPU monitoring plugin (Figure 1) monitors each CPU core separately, so you'll see cpu-0 to cpu-(n - 1) (on an n-core machine).

Networking

In this tutorial, we've looked at monitoring one machine with collectd. This works really well, and it's a great addition to your Linux setup. However, collectd is also useful for monitoring multiple machines at a time. Using the network plugin, each collectd instance can act as either a client sending data or a server collecting data. The server collates data from all the clients, and you can use any of the graphing tools to compare different machines. The basic setup for a client is:

LoadPlugin network
<Plugin network>
        Server <server>
</Plugin>

Where <server> is replaced with the IP address or hostname of the server. The basic setup for a server is:

LoadPlugin network
<Plugin network>
        Listen "0.0.0.0"
</Plugin>

Typically, this will all happen within a local network, so it's often fine to run without encryption. If you do need to send your data over a public network, then consult the collectd man page for details on how to ensure it remains private.

Figure 1: Graphing CPU usage. It turns out you could push this machine quite a bit harder.

Each of these plugin instances has one or more pieces of data associated with it, so sticking with the cpu plugin, if you open cpu-0, you'll see data for the various states of the CPU such as cpu-idle (the time the CPU's not doing anything), cpu-wait (the time the CPU's waiting for disk access), and cpu-user (the amount of time executing userspace code). If you drag all of these from the left side of the screen onto the right, you'll get a graph showing what your CPU's been up to. The buttons at the bottom allow you to select the time period to view. Because this graph has only just been set up, the last hour button will be most useful, because the others will be mostly empty.

You can plot more than one graph on a kcollectd window by right-clicking and selecting add new subgraph. When you have a view you like, you can save the set of graphs – this only saves the layout, not the data itself, which is stored in collectd. Kcollectd is a good starting point for exploring collectd, but you might want to monitor servers or other machines without graphical displays. A number of other front ends to collectd might be more suitable in this instance. See the "Collectd-web" box for one that's easy to use.

Collectd-web

If you want to use collectd on a server (or other headless box), kcollectd is a little awkward. You could launch it via SSH with X forwarding enabled (by adding the -X flag to your ssh command) or use collectd's networking options to send the data to a machine with a display, but neither of these options is really ideal. Because collectd just handles collection and storage of data, plenty of alternative graphical front ends are better suited to network access. Perhaps the easiest to use of these is collectd-web.

There's no installation for this software, but you will need a few dependencies, which you can download with:

sudo apt-get install librrds-perl libjson-perl libhtml-parser-perl

Once this is done, just download the code from GitHub with:

git clone https://github.com/httpdss/collectd-web.git

Then cd into the new collectd-web folder and run:

python runserver.py

By default, this will run a server on port 8888 that you can view by pointing your web browser to http://localhost:8888.

Collectd-web works slightly differently from kcollectd in that it automatically graphs all the data for a given plugin instance (Figure 2).

Although collectd-web does a good job of making it easy to view your data from other machines on the network (Figure 3), power users might want a little more functionality. Some really powerful visualization tools can work on top of collectd, but they can require a bit more setup and are beyond the scope of this article. Here are a couple of our favorites:

  • Elasticsearch, Logstash, and Kibana (aka ELK). Collectd can forward data to a Logstash server, which in turn stores it in Elasticsearch. This data can then be visualized with the Kibana web visualization tool. If you use Logstash for analyzing your logfiles as well, this can provide a really powerful way of monitoring everything that's happening on your system. The Timelion plugin to Kibana adds powerful additional graphing tools for really seeing what's going on.
  • Grafana can take data from collectd either through Elasticsearch (as above) or via the Graphite database. This is perhaps a little more flexible than ELK if you're not also using Logstash for managing your log setup.

Just as with the plugin, base monitoring is flexible enough to work with almost any system, the selection of front ends means that you can use collectd to work with the system that you have rather than having to orientate your monitoring around collectd. The best advice we can offer on this is to start monitoring with a simple setup and then configure it as you learn the most useful data and graphing tools for you.

Figure 2: The collectd-web interface makes it easy to access data across the network.
Figure 3: You can export graphs from collectd-web in a variety of formats, including PDF, which makes them easy to share with other people.

Every system has different limitations and problems, so there's no one set of graphs or data that are best for all collectd instances. Instead, collectd is incredibly flexible and lets you set it up for the data you need.

Although collectd will run on most distros with the default settings, you can tweak a lot of things to make it more useful. The collectd config file is usually located at /etc/collectd/collectd.conf (although some distros have it at /etc/collectd.conf). Open that file in your favorite text editor (you'll need root privileges), and you'll find an XML-like format for the configuration with some data (but not all) enclosed in tags.

By default, collectd will collect data every 10 seconds. This is a good starting point, but isn't always best. If you want more data and have plenty of storage space, you might want to decrease this number. On the other hand, if you're pulling data together from lots of different machines, you might want to increase this number so you avoid swamping your main collectd instance.

Let's look at some things that could be particularly interesting to the home PC user. Many modern machines come with the ability to scale the CPU frequency depending on system load. This means that as you run more things, the BIOS increases the clock rate of your processor so it can perform more work, but as the load decreases, the clock rate drops back so that it's more power efficient. This is a useful feature, but it does mean that it's not always clear how much of your CPU time you're using.

Connected to the CPU frequency is the temperature. Faster CPUs run hotter, and if they get too hot, your system might crash or, potentially, damage itself. Hot CPUs can be a sign of insufficient cooling (possibly caused by dust-clogged vents) or simply of the system running faster than it can handle.

The functionality of collectd comes from plugins [2], and the two you need are cpufreq and thermal (Figure 4). To load these, just add the following lines to collectd.conf:

LoadPlugin cpufreq
LoadPlugin thermal

Now restart collectd with:

service collectd restart
Figure 4: A lot of collectd plugins are available; head to the project's wiki to find out how to configure them.

Re-open kcollectd and you should see new options for cpufreq and thermal in the sensor data list (Figure 5). You can drag these over to the graph to see what's going on. If they aren't there, you might need to install the plugins separately on your distro, so check the package manager for available software.

Figure 5: You can put any data you want together on a graph in kcollectd, even if the units are completely different. It's up to you to ensure it makes sense.

Neither of these plugins required any configuration to use, but some others require a little bit of tweaking to make sure they are doing what you want. For example, by default, the processes plugin will gather information about the number of processes running on your machine, but not details about individual processes. With a little tweaking, you can change that.

You don't want to grab all the available info, because you'll quickly fill up your hard drive with information you'll never need, but it can be useful to pull out particular software to see what's going on. When using collectd on a server, this could be information about the various services that are running, but on your desktop, you find that you have most performance issues with the Firefox web browser. Let's change the configuration to see how many resources that's using.

Back in the collectd.conf file, you need to add a configuration. There are two ways to find particular processes to match: by name or by regular expression. Because Firefox is always launched with the firefox command, you can use this. However, if you need to monitor something that's launched in different ways (or you want to monitor a range of different processes), look at the collectd man page for more details.

The configuration can go anywhere in the configuration file, provided it's below the LoadPlugin Processes line.

<Plugin "processes">
        Process "firefox"
</Plugin>

Once you've made this change, you'll need to restart collectd for it to take effect.

Collectd sucks up vast quantities of data about what's going on in your computer, but it's not always easy to make sense of it. Now, let's look at some of the key information coming in from the processes plugin. If you open kcollectd, you should now see a processes-firefox list; inside is a range of options. The most important of these are the ps_cputime values for system and user, which give the average amount of time (in microseconds) the process uses every second. The system entry gives the amount of time spent in system calls, whereas the user entry gives the amount of time spend in userspace. The second most important is the ps_rss value – here, rss stands for resident segment size, which is the amount of physical memory (not including swap) that the process is currently using.

What we've looked at so far allows us to tune the behavior of collectd using prebuilt features. However, we all have different setups and different requirements, and it's unrealistic to expect a monitoring system to have all the capabilities we need for all our computers. Perhaps the best feature of collectd is that it's easy to write your own add-ons.

Although these can be full plugins that collectd loads, there's a much easier text protocol that you can take advantage of using the exec plugin. This plugin will run some command and take values from the output. You can use any programming language that can print text output to add data to collectd. We're going to use a Bash script, but as you'll see, it's trivial to convert this to another language.

One thing we're particularly prone to is filling our Downloads directories. Trying out different distros can quickly overwhelm even large hard drives, so we've made a plugin to monitor the amount of space taken up by just this directory. To find this out, enter the command:

du -sb /home/ben/Downloads | cut -f1

This uses du (disk usage) with the s (summarize) and b (bytes) flags to get the amount of data in Downloads. The cut tool then removes everything from the command's output except the first field, which contains the data you're looking for. You need to translate this into the collectd text file format, which means your script should output lines in the form:

PUTVAL ben-All-Series/downloads/used-counter interval=10 N:5638101902

In this format, PUTVAL is the command that sends data to collectd, ben-All-Series is the name of the computer this tutorial was written on, downloads is the collectd plugin instance, counter is the data type, and used is the identifier for the data. To create this, you just need to wrap up the du command in a loop and an echo statement to send this command to stdout at the frequency passed from collectd in the environmental variable INTERVAL. See Listing 1 for details.

Listing 1

Monitoring the Downloads Directory

01 #!/bin/bash
02
03 HOSTNAME="${COLLECTD_HOSTNAME:-`hostname -f`}"
04 INTERVAL="${COLLECTD_INTERVAL:-10}"
05
06 while sleep "$INTERVAL"
07 do
08
09   USED=$(du -sb /home/ben/Downloads | cut -f1)
10   echo "PUTVAL $HOSTNAME/downloads/counter-used interval=$INTERVAL N:$USED"
11
12 done

That's our collectd system up and running. Using these techniques, you should be able to get almost any monitoring information you need, and this should help you use your hardware more efficiently, whether that's working out the best settings for gaming or performance-tuning the enterprise software in your data center.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Performance Tools

    We describe five tools you can use to monitor and troubleshoot your system's performance.

  • Tool Tips

    Tool review: Dialog 0.7, virtenv 0.8.6, collectd 5.4.0, convmv 1.15, Drukkar 1.11, and ngIRCd 20.3.

  • Revive Your Old Laptop

    A few years ago, I bought an IBM ThinkPad T41p. It’s a great machine that served me well for a long time, although for the last year it mainly sat on the shelf. I had some free time at hand, so I decided to give it one more chance.

  • News

    Updates on technologies, trends, and tools.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News