How compilers work

Symbol Table

To map the source code compactly in memory, the scanner replaces each recognized keyword, each variable name, and all other elements with a symbol. You could replace the rather long variable names with two numbers: The first number is used as a substitute for the variable name. The second number specifies the row in a table, the symbol table, which contains the variable name used by the developer. During syntax and semantics analysis, additional information, such as the type of variable, appears in the symbol table.

To find an entry in the symbol table as quickly as possible, many compilers run the variable name through a hash function that can be calculated quickly. This step spits out a number, which the compiler then uses as an index in the symbol table. If the compiler encounters the variable name again later on, it simply calculates the hash value and immediately gets the location in the symbol table with all important information about the variable – such as its type. The compiler does not have to search through the entire table.

Both the scanner and the semantic routine generate a number of tables in addition to the symbol table. Among other things, these tables contain the nesting structure for the loops and the loop variables. The compiler repeatedly accesses the information in the tables, even later on.

Interpreter, Assembler, and Translator

Unlike the compiler, an interpreter reads the source code and executes it directly; thus, no object code is generated. The classic interpreters analyze each command in the source code one after another.

Modern interpreters convert the complete source code into a special optimized internal representation. The interpreter then executes this intermediate or byte code much faster. Sometimes a just-in-time compiler translates the internal representation into machine language, which further increases the execution speed. Java uses this procedure.

An assembler is a special form of the compiler that translates a program into assembler or machine language. Since assembler is usually a symbolic representation of the machine commands, both languages are similar. The generic term translator usually refers to all three (i.e., compilers, assemblers, and interpreters).

Internal Representation of the Source Code

The scanner passes the symbols that it has determined to the parser. The syntax and semantic analysis ends with an internal representation of the source code. The compiler and the languages are responsible for what this representation looks like. The program could be present as a syntax tree or in Polish notation. Many compilers also use quadruples, for example, from the A = B + A statement it would be:

+,      B,      A,      T1
=,      T1,             A

T1 is a temporary variable created by the compiler.

So far, the compiler has only analyzed the source program. For this reason, experts also refer to this first phase as the analysis phase.

Generating Code

In the next step, another component optimizes the internal presentation. As a rule, the compiler optimizes the run time and assigns memory locations to the variables. In the preceding example, the compiler would try to eliminate the temporary variable T1.

In the last phase, the compiler finally generates the executable machine code. Generally, programmers call this object code or simply code. Under Linux, it is usually either a (dynamic) library or the executable program.

Some compilers also produce assembler code, which is then converted into machine language by a downstream assembler. The compiler could generate the following code from A = A + B:

lda a   ; load a in the accumulator
add b   ; add b to the accumulator
sto a   ; store accumulator to a

Control structures such as if, while, and for can usually be mapped with the jump instructions of the processor. Complex loops, such as for, may optionally replace a (longer) while loop.

Because it knows the processor instruction set, and the information from the tables, the compiler also makes the code more compact. The stack is used for function calls: Before starting a function, the compiler dumps its arguments and the return address onto the stack. The processor then performs the function. Finally, the compiler has to clean up the stack; current processors use special commands to support the compiler in this task.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Perl: Parser

    Lexers and parsers aren’t only for the long-bearded gurus. We’ll show you how you can build a parser for your own custom applications.

  • Parrot

    Parrot is an all-in-one tool for developing and executing new programming languages. Perl 6 runs on Parrot; chances are your language can run on it, too.

  • Kconfig Deep Dive

    The Kconfig configuration system makes it easy to configure and customize the Linux kernel. But how does it work? We'll take a deep dive inside Kconfig.

  • Fuzz Testing

    Fuzzing is an important method for finding bugs and security vulnerabilities in software. Read on to find out what fuzzing is and which methods are commonly used today.

  • Oil Shell

    With its innovative scripting language, Oil, the Bash-compatible Oil shell aims to make life easier for script developers.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More