Fighting malware with static and dynamic code analysis

Code Patrol

Article from Issue 155/2013
Author(s):

Linux offers some sophisticated tools for understanding how malware can slip through the gaps in an unsuspecting application.

The potential danger of malware is a concern for many computer users. Your desktop PC, your smartphone, and even the networked smart gadgets around your home or office are potentially vulnerable to thousands of rootkits, spyware, and Trojans. As anti-malware software developers create new methods to discover and block malware, malware authors create new methods of circumventing these safeguards.

Linux users have long enjoyed the minimal attention they receive from malware authors. Malware does exist for the Linux platform, but it has a difficult time infecting most Linux systems. A major strength of open source software is the speed with which exploit-causing bugs are noticed, diagnosed, and fixed. This rapid response limits the time interval in which malicious software can exploit these vulnerabilities. Unless malware can somehow trick a user with a superuser account into installing it, or the system is not properly configured or updated, it often has difficulty infecting Linux.

Windows users have not been so lucky. More than a billion PCs running various flavors of Windows exist worldwide, and such a large installed base of similar software makes it a very tempting target for malware authors. The large size and complexity of the Windows codebase, as well as delays in the availability of patches for issues, provides ample opportunity for vulnerabilities to be discovered and exploited by malware authors.

Surprisingly, Windows has a pair of unusual allies in the fight against malware: Linux and open source software. A number of Linux tools are being utilized to perform sophisticated analyses of Windows malware and to improve software quality. Developers use these tools to create better code that is more secure, and researchers use them to better understand how malware exploits insecure code.

These software analysis tools fall into two general categories: static and dynamic. Static analysis tools examine the binary or source code of a program to better understand how that program is structured and how it reacts to particular inputs. Dynamic analysis tools watch software as it executes to observe its run-time behavior directly. Each of these analysis methods has its strengths and weaknesses, but both are very useful when investigating unknown software. In this article, I will introduce you to some of the concepts used in these analyses and show you a variety of open source tools that security researchers use today to analyze and understand malware.

Static Analysis

Static analysis of software has been around for a very long time, and you've most likely performed many of these analyses yourself. If you have compiled source code, you have performed a static analysis! Every compiler analyzes as it compiles to discover any duplicate symbol names, type-mismatched assignments, uninitialized variables, or unreachable lines of code.

The Clang [1] front end for the LLVM compiler infrastructure has a powerful static analyzer, which allows it to give very detailed compiler error messages and warnings. GCC's standard analysis pass (which is quite a powerful analyzer in its own right) can be augmented via the use of analysis plugins [2] for the compiler. Support for such plugins is available in versions 4.5 and later of GCC.

Every program contains groups of instructions that execute sequentially if no exceptions or context switches occur. Execution of the program flows through these instructions as a single, logical block. In compiler terminology, this logical group of instructions is known as a basic block (BB). When the execution of one BB ends, control flows to the start of another BB and execution continues. The end of a BB is typically where a jump or conditional branch takes place.

Static analysis tries to discover all of these BBs and determine where it is possible for control to flow from one BB to another. This information is collected and represented as a control flow graph (CFG) that describes the "big picture" of how a program executes. To better illustrate this, Listing 1 shows a simple C function, and Figure 1 shows the CFG that describes the BBs and control flow of the function.

Figure 1: The control flow graph (CFG) generated for the simple C function in Listing 1.

Listing 1

A simple C function

01 int simple(int num)
02 {
03     /* Start block A */
04     int i;
05     printf("Block A\n");
06     /* End block A */
07     for (i = 1; i <= num; i++)
08     {
09         /* Start block B */
10         printf("Block B\n");
11         /* End block B */
12     }
13     /* Start block C */
14     printf("Block C\n");
15     return(0);
16     /* End block C */
17 }

Static analysis of this function breaks the code up into three BBs: A, B, and C. Control is guaranteed to always flow through A and C because those BBs are the entrance and exit points for the function. B may never be executed, or it may execute multiple times. These behaviors can be observed by looking at the CFG without ever actually executing the function. Although the control flow of this sample function is very simple to visualize, extremely complex functions are much more difficult to analyze for security issues without generating a corresponding control flow graph.

All data that enters a system must come from some source, such as a file, network communications, or a peripheral like the keyboard or mouse. Data that originates from a source that is considered untrusted (e.g., the network connection) is also considered to be untrusted. Some BBs also contain instructions that implement a sensitive event, such as a system call (memory allocation, reading or writing files, socket communication, etc.).

If untrusted data is allowed to influence these events, there is a possibility of exploiting the software and compromising the system. CFGs simplify the task of figuring out which BBs generate untrusted data, which BBs contain sensitive events, and which control flows connect the two. A primary purpose of malware static analysis is to discover how untrusted data can potentially propagate to sensitive code.

When source code is available, performing a static analysis is straightforward. Complete and unambiguous descriptions of data types are available, and it is trivial to identify function calls, stack-frame boundaries, and control-flow structures. But, the source code for malware is often not available, meaning that the binaries must be statically analyzed to determine their behavior. Although this can be done by examining the binary instruction-by-instruction (which many malware studies have done), it is often more useful to analyze a higher-level source code representation of the binary by decompiling it.

Generating a high-level language representation of a binary's functionality is not trivial. Decompilers must determine which segments of the binary contain code and data, and they must reconstruct complex data types. No decompiler generates source code that is as clear as the original code that the binary was compiled from, and expert manual assistance is often needed to complete and clarify the translation. Several open source decompilers exist for Linux, but the Boomerang decompiler [3] is probably the best one available today. Boomerang is under active development, and it does a relatively good job of generating a C code representation of a binary.

Bytecode-based Java class files are much simpler to decompile because they contain a lot of contextual information (e.g., inheritance and data typing) within their bytecode instructions. This contextual information makes decompiling bytecode simpler and less ambiguous than decompiling native binaries. Many security researchers use the open source Soot suite of tools [4] to decompile Java class files, although a variety of other tools are available under Linux for performing Java decompilation. The Dare (Dalvik Retargeting) tool [5] can be used to convert Android's Dalvik VM dex files into Java class files, which can then be decompiled by tools made for Java bytecode.

Dynamic Analysis

Static analysis is very powerful, but it cannot do a very good job of catching concurrency-related problems in multithreaded software. If it did, compilers would already exist that could detect and warn developers about race conditions and other concurrency-related problems present in code. It is also common for malware to use compressed or encrypted binaries, which adds a layer of obfuscation that frustrates decompilation and static analysis.

Because of these limitations, sometimes the only way to correctly analyze malware is to observe its run-time behavior. This observation is called dynamic analysis. Dynamic analysis is performed at many different levels of detail, ranging from watching the occasional event from an individual processes to watching the instruction-by-instruction execution of the kernel and every single process in the whole system.

Dynamic analysis is extremely powerful, but it also has some big disadvantages. Analysis is only performed on the portions of a program that execute, so the malicious behavior of malware must be executed while the analysis is running. Otherwise, the resulting analysis is pointless. There is also a significant negative impact on system performance because executing monitoring instrumentation every time a CPU instruction is executed can cause a program or system's execution to slow down by an order of magnitude or more!

In 2005, a group of researchers proposed a new method of dynamic analysis called "dynamic taint analysis" that addresses the problem of tracking untrusted data. Instead of trying to perform a static analysis on each and every zero-day malware as soon as it shows up in the wild (which is quite a task, with thousands of new malwares appearing daily) and then creating software to protect against it, why not try to protect systems automatically by tracking untrusted data dynamically as it propagates throughout the system? If execution is stopped before the untrusted data is used to perform a dangerous action, the malware is neutralized without knowing the exact details of the malware's implementation and design.

Dynamic taint analysis considers an untrusted data source to be a "taint source." Any data created by a taint source is "tainted" by marking the data with a "taint tag" identifier. Taint tags indicate which taint source the data originated from. Locations where tainted data can do something potentially dangerous are called "taint sinks." Tainted data is tracked as it is created, copied, transformed, moved, and destroyed by copying and destroying the taint tags associated with the data. If tainted data reaches a taint sink, execution of the system halts and forensic analysis can begin.

It is dangerous to store a taint tag within the same memory space as ordinary data because other data writes can potentially overwrite the taint tag and corrupt or destroy it. It is also difficult to store taint tag information alongside tainted data if the data is moved into a CPU register, because there is no extra space in the register to hold the taint tag. To avoid these problems, all taint tags are stored in "shadow memories." For every potential location where tainted data can reside (system RAM, CPU registers, peripheral device buffers, etc.), there are analogous, separate shadow memories that hold the associated taint tags.

A single taint tag represents the taint associated with a single piece of data. The granularity of the single piece of data depends upon the needs of the analysis. In many cases, representing a single byte of data with each taint tag is usually good enough. For a very detailed analysis, it might be necessary to create a taint tag for each bit of data. The number of different types of taint tags can also vary. For some analyses, having a single taint tag type for every piece of tainted data to indicate "this is untrusted data" is enough. For others, having a different type of taint tag for each taint source is necessary. The most extreme case would be a unique taint tag for every bit of untrusted data that enters the system, although this setup would require a great deal of shadow memory to implement.

Figure 2 shows an example of how taint propagates using taint tags. In this example, each piece of data is a single byte, and the taint tag associated with that data is also a single byte. An operation is about to be performed that will add the data located at address 0x1000 to the data located at address 0x1002. The result will be stored at address 0x1000. The data at address 0x1002 is tainted because there is a taint tag in the shadow memory location associated with address 0x1002. Once the addition operation completes and the result is stored at 0x1000, the taint tag in shadow memory propagates from address 0x1002 to address 0x1000 because the tainted data at 0x1002 has influenced the data at 0x1000. The data held at 0x1000 is now tainted, and any operation using this data will result in the taint tag propagating further.

Figure 2: The states of data memory and its corresponding shadow memory before and after an addition operation.

The propagation of tainted data from taint source to a taint sink reveals a lot of information about how malware compromises systems. Not only can the source of malicious data be determined, but any copies of that data are also known by looking for the appropriate taint tags. In fact, this form of taint analysis has become so useful for security that the Perl, PHP, Lua, Python, and Ruby languages all have built-in support for tainting. Security-conscious developers that use these languages have been using tainting to double-check the safety of their software for years.

A good example of a process-level dynamic analysis tool is Intel's Pin [6]. Pin is a framework for the binary instrumentation of processes on x86 platforms. It attaches to a running process, much like the GDB debugger does, and then it steps forward through the execution of the program one assembly instruction at a time. Custom plugins for Pin, called "Pintools," are developed in C/C++ by using libraries and build scripts that are distributed with Pin. Pintools implement instrumentation callbacks that trigger before the execution of each instruction as the binary runs. This approach allows for the creation of analysis tools that are used for tasks like instruction logging, profiling, taint analysis, and verifying that system call arguments are valid.

When creating tools that perform system-level dynamic analysis, adding instrumentation logic to VM hypervisors (Xen, VMware) or whole-system emulators (QEMU, Bochs) offers several advantages. It is dangerous to execute analysis software inside of the same environment where malware executes because the malware can potentially detect the analysis software and then attack it or hide from it. By using the virtualization capabilities of VMs to isolate the analysis logic from the malware, the analysis becomes tamperproof. Using a VM also gives the benefit of being able to observe the state of CPU registers, stack and heap memories, and peripherals at any time during the analysis.

Brewing with DECAF

A full-featured, full-system dynamic analysis platform is the Dynamic Executable Code Analysis Framework (DECAF) [7]. DECAF is open source software that is under active development by both researchers and hobbyists. Using the QEMU full-system emulator [8] at its core, DECAF boasts a rich set of analysis features including a plugin API for creating custom analysis tools, lower execution overhead than most other full-system analysis platforms, and the ability to execute and analyze newer OS environments like Windows 7 and 8.

Figure 3 shows the overall architecture of DECAF. DECAF executes the guest OS being analyzed within its guest VM. An emulated CPU and memory management unit (MMU) are used by the guest VM. The hardware devices that the guest OS uses, (e.g., hard drive, NIC, keyboard, and mouse) are emulated in software by DECAF and presented to the guest VM as hardware devices. From the guest OS's point of view, it is running on real hardware and is completely unaware that it is being monitored.

Figure 3: The overall architecture of the DECAF system.

The DECAF system architecture also shows some of the shadow memories that are used to store taint tags. All of these shadow memories are located outside the guest VM to avoid taint tag tampering by the guest OS. The MMU of the guest VM uses a shadow RAM to hold taint tags for any tainted data located in the VM's physical RAM. Some peripheral devices, such as the NIC, have a shadow buffer that shadows peripheral internal transmit/receive buffers.

Secondary storage has its own shadow buffer to mark which data on disk is tainted. Even if the reading and writing of tainted data from/to files is not of interest in a particular full-system analysis, maintaining this shadow buffer is very important because of OS memory paging. If a page of physical RAM is swapped out to disk by the kernel, the taint tags associated with that data must be copied into the secondary storage shadow buffer. When swapping data from disk back to physical RAM, the associated taint tags must be copied back into shadow RAM. Otherwise, taint information will be lost when physical RAM is swapped out to disk.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News