How compilers work
Meticulous Transformer
Compilers translate source code into executable programs and libraries. Inside modern compiler suites, a multistage process analyzes the source code, points out errors, generates intermediate code and tables, rearranges a large amount of data, and adapts the code to the target processor.
Below the surface, a black box compiler handles complex processes that require good knowledge of machine theory and formal languages. Given the importance of compilers, it is not surprising that compiler construction is standard curriculum for computer science students. If you have never been to a college-level lecture on compiler theory – or if you went to the lecture but need a refresher course – this article summarizes the basics. (See the box titled "Teaching Material for Compiler Construction.")
Teaching Material for Compiler Construction
When students of computer science need to learn what holds the programming language world together, the best bet is to attend a lecture on compiler construction. Mathematics professor Peter Köhler held a lecture on compilers in the 2016 summer term at the University of Giessen [1], Germany. With his permission, Linux Magazine worked through his notes and summarized the most important points for this article.
In simple terms, a compiler goes through three steps: It parses the source code, analyzes it, and synthesizes the finished program (Figure 1).
In the first step, the compiler parses the source code character by character and tries to identify keywords, variable names, numbers, strings, and all other components – including comments. The scanner takes care of this lexical analysis.
In the second phase, another component checks the syntax and semantics of the program. For example, if a programmer uses a variable that has not yet been declared, or if a semicolon at the end of an expression is missing, the compiler will discover the problem in this phase.
Modern compilers distinguish between syntactic and semantic analysis. Syntax is all about structure, whereas semantics is focused on meaning. The component of the compiler called the parser tries to detect the syntax elements. If this phase is successful, the parser calls the semantic routine.
This delineation makes it possible to optimize tasks and analyze the source code more efficiently.
However, syntactic and semantic analysis turns out to be a non-trivial task, as a simple example shows: the compiler for a dumb calculator needs to parse a text file with a calculation expression. The calculator can only add and multiply numbers, where multiplication/division takes priority over addition/subtraction. If in doubt, brackets specify the sequence. For example, the following is allowed,
(1+3)*(4+5)
but this is not:
(1+)*4
The compiler must understand which of the two expressions is valid and which is not.
For the compiler to identify that the second expression as faulty, the compiler builder first tells the compiler how the correct source code is structured. This information is conveyed through rule sets. The compiler checks whether the source code follows the rules defined in the rule sets. In the example, a single number is obviously a valid expression. If two x
and y
expressions exist, then x + y
, x * y
and (x)
are valid expressions.
The rules can also be written in shorthand. The Backus-Naur Form (BNF) is a very popular form (see Listing 1): The program (program
) consists of an expression (expr
). This expression, in turn, exists if it is either a term
or if it is a term
with the character +
and then another expression after. The last rule is necessary because numbers can consist of several digits (digits
).
Listing 1
Backus-Naur Form
These rules, known as production or derivation rules, form the grammar of the source language. Like English grammar, this grammar provides rules that build the programming language. Abbreviations, such as term
or factor
, are known as symbols. All symbols on the left-hand side are referred to as non-terminals, all others are terminals. Terminals would include digit
, +
, *
, )
, and (
.
One programming language may include several grammars. The choice would depend on the language and requirements. The grammar used in the preceding example can lead to infinitely long expressions and programs because some rules have a recursive structure. The grammar used in the example is considered a context-free grammar (or type 2). A grammar is considered regular if all rules follow one of the two forms:
U ::= t U ::= Vt
In this case, t
is a terminal character; U
and V
are non-terminal characters.
Top-Down Parser
To the delight of developers, a grammar can be used to create a compiler quite elegantly: For all non-terminals, the developer creates a function that takes care of the corresponding evaluation. If necessary, the function calls on other functions to help. For example, the function expr()
, which belongs to expr
, has a structure like the one briefly outlined in Listing 2.
According to the second rule in Listing 1, a correct expression such as 1+2
always starts with a term
. In Listing 2, the expr()
function includes term()
, which still needs to be implemented for an evaluation in the first step. In this example, term()
returns 1
.
Listing 2
expr()
§§nonumberint expr() { int value = term(); if (File.ReadNextChar() == '+') value += expr(); return value; }
In the second step, expr()
then looks at the next character in the source code. If the next character is a +
, the second rule of Listing 1 requires an expression to follow, which is why expr()
calls itself. The result is that expr()
returns 2
, which the compiler also adds to the previously computed intermediate result. The compiler would then have arrived at the end of a valid expression, which is why expr()
returns the calculated value in the example.
At this point, a mature compiler could generate a suitable instruction in machine code. The developer generates functions according to the same principle for all other non-terminals. If you then apply program()
to the source code file, you automatically get the translated program – in the example, this would be the calculated number. Because processing starts with program
, computer scientists also call this symbol the start symbol.
The calls to the individual functions can be displayed as a tree in this scenario: the syntax tree.
Bottom-Up Parsers
In the process described so far, the parser simply proceeds from the start symbol (in the example program
) and then tries to find appropriate grammar rules for each character that has been read. Such parsers are known as top-down parsers.
Other parsers are bottom-up parsers and proceed according to an opposite scheme: They try to replace the source code with grammar rules (reduction) until only the start symbol remains. In practice, such bottom-up parsers usually use an empty stack at the beginning, into which they dump all previously detected, non-terminal characters and all the characters they read. After each new character is read, they check whether the topmost elements on the stack match the right side of a grammar rule. If this is the case, parsers take the elements from the stack and place the non-terminals from the left side of the identified grammar rule on top. The whole process runs until the start symbol is left and the end of the source code is reached. If that doesn't work, there's probably a mistake. A bottom-up parser that works according to this pattern is also known as a shift-reduction parser.
The programming of the scanner and parser is a straightforward consequence of the grammar of the language. This process can also be automated: Tools such as Flex [2] or Bison [3] automatically generate the source code for matching scanners and parsers from the grammar (Figure 2).
Whenever the parser needs more information, it requests it from the scanner. In the example in Listing 2, the expr()
function would thus not read the next character itself; instead, File.ReadNextChar()
simply consults the scanner.
Scanner Internals: Machines
The source code of most programming languages consists of numbers, variable names, keywords, operators, and separators (delimiters). Names and keywords are composed of several letters and characters.
A finite-state machine helps the scanner detect int
, &&&
, and all other components of the language. Imagine it like a ticket vending machine: The customer throws in 20 cents, and the machine changes its status to "20 cents paid." If the customer inserts a dollar, the machine then changes the status to "120 cents paid". This continues until the machine reaches the "price paid in full" status. Similarly, you can make a machine that accepts letters from the source code. The machine switches each character read to a different status: If the scanner has read an i
, it changes to the state "i read"; if an n
is added, it changes to the state "n read." If the t
follows, the machine has finally read the whole term int
and changes its status to "int identified."
This machine has an initial status and at least one final status. The machine for detecting int
has detected the initial "no character read yet" status and the two final "int recognized" and "not the keyword int" statuses. When the final status occurs, computer scientists say the machine has accepted the word int
.
You can also envision a scenario in which boxes (nodes) represent the individual states (Figure 3). The start and end states are highlighted accordingly. Arrows (edges) indicate from which status the machine transitions to another state. The transition will only take place if the machine has read one of the characters written on the arrow.
A regular grammar can be used to construct not only a parser, but also a finite-state machine that accepts the language of the grammar: First, create a status for all non-terminals, plus the start and end statuses. Then, when it reads a character t
, it makes a transition from the s
state to the n
state if there is a grammar rule in the n ::= st
form. Incidentally, this process creates a bottom-up parser for the grammar.
In a simple scenario, you could read the next character and check, with if
queries or a switch
construction, which character it is and finally note, in the form of a variable, the new state to which the machine has changed.
Once the scanner has identified the next element in the source code (such as the keyword int
), it replaces the element with a symbol and returns it to the parser. The aforementioned Flex tool also uses the strategy with finite-state machines.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Rhino Linux Announces Latest "Quick Update"
If you prefer your Linux distribution to be of the rolling type, Rhino Linux delivers a beautiful and reliable experience.
-
Plasma Desktop Will Soon Ask for Donations
The next iteration of Plasma has reached the soft feature freeze for the 6.2 version and includes a feature that could be divisive.
-
Linux Market Share Hits New High
For the first time, the Linux market share has reached a new high for desktops, and the trend looks like it will continue.
-
LibreOffice 24.8 Delivers New Features
LibreOffice is often considered the de facto standard office suite for the Linux operating system.
-
Deepin 23 Offers Wayland Support and New AI Tool
Deepin has been considered one of the most beautiful desktop operating systems for a long time and the arrival of version 23 has bolstered that reputation.
-
CachyOS Adds Support for System76's COSMIC Desktop
The August 2024 release of CachyOS includes support for the COSMIC desktop as well as some important bits for video.
-
Linux Foundation Adopts OMI to Foster Ethical LLMs
The Open Model Initiative hopes to create community LLMs that rival proprietary models but avoid restrictive licensing that limits usage.
-
Ubuntu 24.10 to Include the Latest Linux Kernel
Ubuntu users have grown accustomed to their favorite distribution shipping with a kernel that's not quite as up-to-date as other distros but that changes with 24.10.
-
Plasma Desktop 6.1.4 Release Includes Improvements and Bug Fixes
The latest release from the KDE team improves the KWin window and composite managers and plenty of fixes.
-
Manjaro Team Tests Immutable Version of its Arch-Based Distribution
If you're a fan of immutable operating systems, you'll be thrilled to know that the Manjaro team is working on an immutable spin that is now available for testing.