How compilers work

Meticulous Transformer

Article from Issue 207/2018
Author(s):

Compilers translate source code into executable programs and libraries. Inside modern compiler suites, a multistage process analyzes the source code, points out errors, generates intermediate code and tables, rearranges a large amount of data, and adapts the code to the target processor.

Below the surface, a black box compiler handles complex processes that require good knowledge of machine theory and formal languages. Given the importance of compilers, it is not surprising that compiler construction is standard curriculum for computer science students. If you have never been to a college-level lecture on compiler theory – or if you went to the lecture but need a refresher course – this article summarizes the basics. (See the box titled "Teaching Material for Compiler Construction.")

Teaching Material for Compiler Construction

When students of computer science need to learn what holds the programming language world together, the best bet is to attend a lecture on compiler construction. Mathematics professor Peter Köhler held a lecture on compilers in the 2016 summer term at the University of Giessen [1], Germany. With his permission, Linux Magazine worked through his notes and summarized the most important points for this article.

In simple terms, a compiler goes through three steps: It parses the source code, analyzes it, and synthesizes the finished program (Figure 1).

Figure 1: Rough structure of a compiler: parse code, analyze it, and create an executable program.

In the first step, the compiler parses the source code character by character and tries to identify keywords, variable names, numbers, strings, and all other components – including comments. The scanner takes care of this lexical analysis.

In the second phase, another component checks the syntax and semantics of the program. For example, if a programmer uses a variable that has not yet been declared, or if a semicolon at the end of an expression is missing, the compiler will discover the problem in this phase.

Modern compilers distinguish between syntactic and semantic analysis. Syntax is all about structure, whereas semantics is focused on meaning. The component of the compiler called the parser tries to detect the syntax elements. If this phase is successful, the parser calls the semantic routine.

This delineation makes it possible to optimize tasks and analyze the source code more efficiently.

However, syntactic and semantic analysis turns out to be a non-trivial task, as a simple example shows: the compiler for a dumb calculator needs to parse a text file with a calculation expression. The calculator can only add and multiply numbers, where multiplication/division takes priority over addition/subtraction. If in doubt, brackets specify the sequence. For example, the following is allowed,

(1+3)*(4+5)

but this is not:

(1+)*4

The compiler must understand which of the two expressions is valid and which is not.

For the compiler to identify that the second expression as faulty, the compiler builder first tells the compiler how the correct source code is structured. This information is conveyed through rule sets. The compiler checks whether the source code follows the rules defined in the rule sets. In the example, a single number is obviously a valid expression. If two x and y expressions exist, then x + y, x * y and (x) are valid expressions.

The rules can also be written in shorthand. The Backus-Naur Form (BNF) is a very popular form (see Listing 1): The program (program) consists of an expression (expr). This expression, in turn, exists if it is either a term or if it is a term with the character + and then another expression after. The last rule is necessary because numbers can consist of several digits (digits).

Listing 1

Backus-Naur Form

 

These rules, known as production or derivation rules, form the grammar of the source language. Like English grammar, this grammar provides rules that build the programming language. Abbreviations, such as term or factor, are known as symbols. All symbols on the left-hand side are referred to as non-terminals, all others are terminals. Terminals would include digit, +, *, ), and (.

One programming language may include several grammars. The choice would depend on the language and requirements. The grammar used in the preceding example can lead to infinitely long expressions and programs because some rules have a recursive structure. The grammar used in the example is considered a context-free grammar (or type 2). A grammar is considered regular if all rules follow one of the two forms:

U ::= t
U ::= Vt

In this case, t is a terminal character; U and V are non-terminal characters.

Top-Down Parser

To the delight of developers, a grammar can be used to create a compiler quite elegantly: For all non-terminals, the developer creates a function that takes care of the corresponding evaluation. If necessary, the function calls on other functions to help. For example, the function expr(), which belongs to expr, has a structure like the one briefly outlined in Listing 2.

According to the second rule in Listing 1, a correct expression such as 1+2 always starts with a term. In Listing 2, the expr() function includes term(), which still needs to be implemented for an evaluation in the first step. In this example, term() returns 1.

Listing 2

expr()

§§nonumberint expr() {
        int value = term();
        if (File.ReadNextChar() == '+') value += expr();
        return value;
}

In the second step, expr() then looks at the next character in the source code. If the next character is a +, the second rule of Listing 1 requires an expression to follow, which is why expr() calls itself. The result is that expr() returns 2, which the compiler also adds to the previously computed intermediate result. The compiler would then have arrived at the end of a valid expression, which is why expr() returns the calculated value in the example.

At this point, a mature compiler could generate a suitable instruction in machine code. The developer generates functions according to the same principle for all other non-terminals. If you then apply program() to the source code file, you automatically get the translated program – in the example, this would be the calculated number. Because processing starts with program, computer scientists also call this symbol the start symbol.

The calls to the individual functions can be displayed as a tree in this scenario: the syntax tree.

Bottom-Up Parsers

In the process described so far, the parser simply proceeds from the start symbol (in the example program) and then tries to find appropriate grammar rules for each character that has been read. Such parsers are known as top-down parsers.

Other parsers are bottom-up parsers and proceed according to an opposite scheme: They try to replace the source code with grammar rules (reduction) until only the start symbol remains. In practice, such bottom-up parsers usually use an empty stack at the beginning, into which they dump all previously detected, non-terminal characters and all the characters they read. After each new character is read, they check whether the topmost elements on the stack match the right side of a grammar rule. If this is the case, parsers take the elements from the stack and place the non-terminals from the left side of the identified grammar rule on top. The whole process runs until the start symbol is left and the end of the source code is reached. If that doesn't work, there's probably a mistake. A bottom-up parser that works according to this pattern is also known as a shift-reduction parser.

The programming of the scanner and parser is a straightforward consequence of the grammar of the language. This process can also be automated: Tools such as Flex [2] or Bison [3] automatically generate the source code for matching scanners and parsers from the grammar (Figure 2).

Figure 2: Bison is a well-known parser generator for the Linux environment (from the Bison Tutorial by Lan Gao (http://alumni.cs.ucr.edu/~lgao/teaching/bison.html).

Whenever the parser needs more information, it requests it from the scanner. In the example in Listing 2, the expr() function would thus not read the next character itself; instead, File.ReadNextChar() simply consults the scanner.

Scanner Internals: Machines

The source code of most programming languages consists of numbers, variable names, keywords, operators, and separators (delimiters). Names and keywords are composed of several letters and characters.

A finite-state machine helps the scanner detect int, &&&, and all other components of the language. Imagine it like a ticket vending machine: The customer throws in 20 cents, and the machine changes its status to "20 cents paid." If the customer inserts a dollar, the machine then changes the status to "120 cents paid". This continues until the machine reaches the "price paid in full" status. Similarly, you can make a machine that accepts letters from the source code. The machine switches each character read to a different status: If the scanner has read an i, it changes to the state "i read"; if an n is added, it changes to the state "n read." If the t follows, the machine has finally read the whole term int and changes its status to "int identified."

This machine has an initial status and at least one final status. The machine for detecting int has detected the initial "no character read yet" status and the two final "int recognized" and "not the keyword int" statuses. When the final status occurs, computer scientists say the machine has accepted the word int.

You can also envision a scenario in which boxes (nodes) represent the individual states (Figure 3). The start and end states are highlighted accordingly. Arrows (edges) indicate from which status the machine transitions to another state. The transition will only take place if the machine has read one of the characters written on the arrow.

Figure 3: The machine changes status as soon as it encounters a character, digit, or symbol.

A regular grammar can be used to construct not only a parser, but also a finite-state machine that accepts the language of the grammar: First, create a status for all non-terminals, plus the start and end statuses. Then, when it reads a character t, it makes a transition from the s state to the n state if there is a grammar rule in the n ::= st form. Incidentally, this process creates a bottom-up parser for the grammar.

In a simple scenario, you could read the next character and check, with if queries or a switch construction, which character it is and finally note, in the form of a variable, the new state to which the machine has changed.

Once the scanner has identified the next element in the source code (such as the keyword int ), it replaces the element with a symbol and returns it to the parser. The aforementioned Flex tool also uses the strategy with finite-state machines.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Perl: Parser

    Lexers and parsers aren’t only for the long-bearded gurus. We’ll show you how you can build a parser for your own custom applications.

  • Parrot

    Parrot is an all-in-one tool for developing and executing new programming languages. Perl 6 runs on Parrot; chances are your language can run on it, too.

  • Kconfig Deep Dive

    The Kconfig configuration system makes it easy to configure and customize the Linux kernel. But how does it work? We'll take a deep dive inside Kconfig.

  • Fuzz Testing

    Fuzzing is an important method for finding bugs and security vulnerabilities in software. Read on to find out what fuzzing is and which methods are commonly used today.

  • Oil Shell

    With its innovative scripting language, Oil, the Bash-compatible Oil shell aims to make life easier for script developers.

comments powered by Disqus