Analyzing network flow records

Go with the Flow

© Lead Image © Artur Szydlowski,

© Lead Image © Artur Szydlowski,

Article from Issue 185/2016
Author(s): , Author(s):

Detect operating systems, installed software, and more from easily collected metadata.

What operating systems are installed on your network, and what software is running on them? Questions like these are often posed in IT departments – especially if users are operating their own shadow IT [1] or when documentation, automation, and software distribution need some care and attention. However, you have good reasons to ask these questions: Attackers are also interested in your systems.

Many methods of discovering the current status quo have been developed in recent years; they rely on either actively measuring [2] (e.g., with Nmap) or passively sniffing network traffic [3]. The passive method analyzes all or parts of your network traffic, from which you can draw conclusions. For example, a device that regularly visits the IP address for the domain name would lead to the conclusion that the operating system comes from Microsoft.

In this article, we present a new approach based on network traffic analysis that exclusively considers the widespread and often easily available network communication metadata in the form of flow records. Metadata analysis of network connections can offer many benefits: It requires far less memory and computational power than the analysis of complete packets, it is compliant with data protection, and it does not need port mirroring on the router; moreover, it is comparatively fast.

The example discussed here relies on records from network flows – that is, short snippets of information that state the source and target of an IP packet, among other things, including the protocol used (e.g., TCP, UDP, ICMP) and the transmission volume in bytes. Typical examples of flow records are NetFlow, CFlow, or IPFIX, which each originate with different products and contain different details.

The basic idea is to use the existing information, to the extent possible, to draw conclusions about the data transmitted. Calls to simple websites or any email messages sent will only differ marginally in terms of their footprints in the metadata. For this reason, you cannot say anything about the content of the transmitted messages or, for example, which subpage of a website has been viewed.

However, this is not true of downloads, which allow you to detect clearly distinguishing features based on size alone. Once you have established what the hosts under investigation are downloading, you can easily draw conclusions about the operating system software used – especially when it comes to updates of previously installed programs.

Recording Flow Records

As you can see from Figure 1, flow records can basically be recorded wherever you have access to network traffic. This can mean recording direct communication on your own subnet (1), but also communication within the enterprise on one of the intermediate switches (2). Where network packets are sent to external communication partners, you can also collect flow records on your own enterprise switches (3), the edge router (4), or on the transport route on the Internet (5).

Figure 1: Locations for collecting flow records.

To record metadata actively in the form of flow records, Linux offers you a number of useful tools. The test setup described in the following section uses a combination of Softflowd [4] and Flow-tools [5] to grab network flows from the network traffic and record them as flow records in files.

Recording the Flow

Softflowd can generate Cisco net flow data from the network traffic that it sniffs from a selected network interface or parse from a packet capture file recorded previously; it then sends these data to a flow collector. It always sends complete flows. Only the metadata from the connection is sent to the flow collector (e.g., the source and target address and ports, as well as the minimum, maximum, average, total bytes, and number of packets registered).

Because many devices on the network, such as switches and routers, also generate and transmit Cisco net flows in the same way, Softflowd is particularly well-suited for simulating these devices for test purposes in a virtual environment, without deploying any physical hardware.

As the flow collector that receives and processes the flows sent by Softflowd, we used flow-capture from the Flow-tools collection of programs. Flow-capture saves the received flows in files that can then be analyzed downstream. The files rotate automatically so that a file always stores the flows from a specific time window. All files can be deleted either by date or by volume of hard disk space used.

Both softflowd and the flow-tools are available in the package sources of Debian and other Linux distributions and can be installed from there. To record net flows, you only need to run softflowd specifying the option for the interface to use (-i) and the IP address and port of the target system (-n).

By default, Softflowd immediately runs as a daemon in the background. If you do want to run Softflowd in the foreground, you additionally need to set the -d option.

The example here generates net flows from the network traffic monitored on the eth0 interface and sends the net flows to localhost:

softflowd -i eth0 -n

To make sure Softflowd really is recording flows, the softflowctl program, which is part of the distribution, is a useful option.

The softflowctl statistics command (Listing 1) delivers up-to-date statistics on the analyzed packets. It tells you how many packets Softflowd has processed and how many flows it has detected as expired and exported. You will also want to run Softflowctl with the shutdown option to close down Softflowd gracefully. Before terminating, it sends to the collector any flows that have not yet been sent.

Listing 1

Softflow Statistics


To process the net flows collected by Softflowd, you need to launch flow-capture, which has many settings that determine how it creates the flow record files and specifies the criteria used to rotate or delete the files. The following example shows a simple configuration:

flow-capture -w /tmp/flows -n 287 0/

The -w option specifies the directory in which flow-capture will store the flow records. The -n option lets you specify how many new flow record files are created every day – 287 rotations per day gives you a new file every five minutes. A five-minute interval is a useful choice for test purposes, because you will see a number of flows during, but without the need to wait too long for a file to become available.

The final option specifies the IP and port on which to receive net flows from which host. A zero instead of an IP address means use all addresses; however, it does make sense to state explicitly the host sending the net flows to avoid confusing the results with flows from other systems.

The flow-print tool from the Flow-tools toolkit looks at what the flow records contain. You can see from the connections metadata in Listing 2 that the first flow belongs to a mail client and the others to an HTTPS connection.

Listing 2

flow-print Example


Analyzing Flow Record Metadata

The metadata collected from connections monitored in this way can be put to various uses. For example, you could use it to compute the bandwidth usage per IP or per subnet for billing purposes or to detect deviations from normal communication patterns, such as a massive increase in outgoing connections. The metadata is also suitable as raw material for inventorying the devices on your home network. However, if such data does get into the wrong hands, it can give attackers valuable hints. Also note that the collection of metadata falls under the data retention laws of some countries [6]. As you can see, they are definitely useful for drawing conclusions on content.

All of these reasons make it interesting to determine how much metadata can tell you about the content transferred and with what degree of precision conclusions can be drawn on the transferred data. To investigate this, we set up a web server in a lab environment that hosted 50 files of random size between 1 and 50MB.

The test team ran softflowd and flow-capture on the server, as described above, to collect flow records while other hosts were downloading the files. One system used, in addition to the Debian and SLES Linux distributions, was Windows Server 2008, so we could investigate any differences between operating systems in the lab environment.

The flow records were then stored in a tab-separated file for analysis with flow-print in the Pandas [7] Python data exploration framework. The data was filtered on the basis of source IP address and source port, so that only test file downloads were left. Because the files were always downloaded in the same order, we were able to correlate downloads with files. To classify the records, we deployed the sklearn [8] Python framework, which implements various classification methods.

To teach the classification method, the test team generated 1,250 flow records, from 25 repeats of 50 file downloads, on a Linux host located one hop from the web server. The most efficient classification method turned out to be the decision tree classifier, which only considers the number of bytes transmitted. Additionally taking into consideration the number of packets transmitted did not lead to any improvements; in fact, the results were between 1 and 10 percent worse. With this classification, we achieved an accuracy of 98% in the correlation of flow records to downloaded files.

For a Linux system six hops away, 88% of the downloads were correctly assigned to the corresponding files. However, if packets from the same system detoured via a VPN server, the detection rate dropped to 61%, illustrating that as distance increases, the total number of bytes transmitted deviates by too great an extent to achieve a high degree of accuracy given files of a similar size – probably because of packet loss and retransmission.

The tests on Linux used the wget download tool; on Windows, we downloaded the files with Windows Internet Explorer. What we discovered was that Windows does not open a new port for each download but immediately reuses the port after a download has completed; therefore, the downloads in the flow records cannot be clearly distinguished and the analysis fails. We would need to perform some more tests to determine whether this behavior is browser dependent or possibly also occurs on other systems.

Overall, the test told us that, given sufficient proximity to the network, very good accuracy is possible in terms of the ability to map monitored flow records to files previously analyzed on a test system. This will give the network operator an easy approach to detecting what updates have been installed, but without complex – and typically expensive – deep packet inspection.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Packet Telemetry with Host-INT

    Inband Network Telemetry and Host-INT can provide valuable insights on network performance – including information on latency and packet drops.

  • OpenFlow

    The OpenFlow protocol and its surrounding technologies are bringing the promise of SDN to real networks – and it might not be long before you see them on your real network.

  • Skydive

    If you don't speak fluent Ethernet, it sometimes helps to get a graphical view of what your network is doing. Skydive offers visual insights that could reveal complex error patterns.

  • Argus

    Argus helps you monitor the flow of data on your network, detect trends, discover worms and viruses, and analyze bandwidth usage.

  • Web Videos

    We’ll show you how to convert your videos to FLV format and play them from your website with FlowPlayer.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More