Skip to main content

Section 9.1 A Brief Overview of Syntactic Analysis

The purpose of syntax analysis is to check if the stream of input characters that constitute the input program adheres to the concrete syntax 26  of the programming language and, if that is the case, to construct the abstract syntax tree. The abstract syntax tree is the first data structure of a compiler that represents the program to be compiled. It is a so-called program representation. Because this text focuses on two compiler phases that operate on the abstract syntax tree, we do not cover syntax analysis in detail here but only give a very high-level overview.

In principle, syntax analysis consists of two steps: lexing and parsing. Lexing forms “words” out of the input characters and parsing checks if the stream of “words” is syntactically correct.

Lexing.

The first step in syntax analysis is lexing. The name lexer comes from the greek word lexis which means “word”. The job of the lexer is to form “words” from the sequence of input characters. These words are called tokens. Hence, the lexer translates the stream of input characters that constitute the program text into a stream of tokens which represent the “words” of the program. Each token consists of

  1. a lexical category (“identifier”, “addition sign”, “open parentheses”, etc.).

  2. the original text (if not apparent from the category)

  3. the source code coordinates (file, line, column) for error reports.

All other things (mostly whitespace and comments) are discarded by the lexer because they are not relevant for the syntactic correctness of the program. Figure 9.1.1 shows an example program with its corresponding token stream.

q = 0;
r = x;
while (y <= r) {
    r = r - y;
    q = q + 1;
}
ID("q") ASSIGN INT_CONST("0") SEMI
ID("r") ASSIGN ID("x") SEMI
WHILE LPAREN VAR("y") LE VAR("r") RPAREN LBRACE
ID("r") ASSIGN ID("r") MINUS ID("y") SEMI
ID("q") ASSIGN ID("q") PLUS INT_CONST("1") SEMI
RBRACE
Figure 9.1.1. An example program and its corresponding token stream. Each token consists of a name and, if important, the original text in the program that corresponds to the token.

The lexer can already detect some simple errors: If a sequence of characters cannot constitute a valid token of the programming language, the lexer can report this error and abort the compilation process. For example 123abc is not a valid token in C or Java because numbers cannot contain letters and identifiers must start with a non-numeral character. Lexers can however not detect structural errors such as 123 abc + - *. Each part of that text is a valid C token, however the way they are combined is not correct. It is the job of the parser to identify such errors.

Parsing.

The parser analyzes the token stream based on the definition of the concrete syntax and creates, if the program is syntactically correct, the abstract syntax tree. If the token stream does not adhere to the concrete syntax of the programming language, the parser reports (possibly multiple) error messages that point to the syntax error(s). In this text, we don't discuss concrete syntax analysis techniques and refer to standard compiler text books such as [16].

Implementing the AST.

The abstract syntax of C0 has two main categories: statements and expressions. As already discussed in Section 8.10, we create an interface per syntactic category (statement and expression) and subclasses for each syntactical element (while loop, if-then-else, add operator, occurrence of a variable, etc.). Our small compiler will consist of a type checker (Section 9.2) and a code generator (Section 9.3) which we will discuss in the following sections. Both operate on the abstract syntax tree and will therefore be implemented as methods to the respective AST classes.

public interface Statement {
  void checkType(...);
  void genCode(...);
}

public interface Expression {
  Type checkType(...);
  void genCode(...);
}
See also Section 6.1 for a somewhat more detailed discussion of the syntax of programming languages.