Skip to main content

Section 6.1 The Syntax of Programming Languages

First of all, we need to define what a C0 program actually is. When formalizing the syntax of a programming language, one distinguishes between two different kinds of syntaxes: The abstract syntax defines the structure of the language's programs. By structure we mean the way how one syntactical construct can be composed from other components (i.e. a while loop consists of a body and an expression). In the abstract syntax, each program corresponds to a tree, the so-called abstract syntax tree (AST). The nodes of the tree correspond to syntactical constructs and the edges denote their composition.

The concrete syntax defines how a program in a language can be encoded as a sequence of symbols (characters, digits, etc.). The concrete syntax is relevant for the programmers because they have to type in their programs somehow. When reasoning about programs formally, we focus on the abstract syntax because it captures the structure of the program and is not “contaminated” with syntactical artifacts like parentheses and other separators which are only meaningful in a linear (textual) representation of the program.

Remark 6.1.1. Syntax Analysis.

The process of obtaining the abstract syntax tree from a textual representation of the program is called syntax analysis. Syntax analysis is a core component of each interpreter or compiler and is usually is performed in two separate phases: The first phase is called lexing and combines subsequent characters into a stream of tokens. Each token stands for a lexical category such as “integer number”, “identifier”, “plus symbol”, and so on. Lexing accounts for the fact that the concrete characters such as the names of variables are not important for syntax analysis because they do not influence the structure of the program. Of course, lexing can already identify some ill-formed programs. For example, it is illegal to put the character sequence 123abc outside of a comment in a C program. Such errors are identified during lexing.

The second phase is called parsing. Parsing constructs the AST from the input token sequence or reports an error if the input token sequence does not lead to a program that adheres to the rules of the concrete syntax.

Remark 6.1.2. Parsing.

At this point you may wonder if every program text that is syntactically correct with respect to the concrete syntax actually has a unique abstract syntax tree that it corresponds to. The answer is that this depends on the definition of the concrete syntax. For example, if the concrete syntax is defined by the grammar \(e\to e+e \mathrel{|} e\ast e \mathrel{|} c\text{,}\) then 1 * 2 + 3 actually has two ASTs:

In this case we say that the concrete syntax is ambiguous. To make the concrete syntax in this case unambiguous, we need to add parentheses and encode operator precedence in the grammar.

The abstract syntax of a language is given by a syntax definition using a table like the following:

\begin{equation*} \begin{array}{rclll} \text{Category} \amp \amp \text{Abstract Syntax} \amp \text{Concrete Syntax} \amp \text{Description} \\ \hline \mathit{Expr}\ni e \amp \syndef \amp \mathtt{Add}[e_1,e_2] \amp e_1+e_2 \amp \text{Addition} \\ \amp | \amp \mathtt{Mul}[e_1,e_2] \amp e_1*e_2 \amp \text{Multiplication} \\ \amp | \amp \mathtt{Const}_c \amp c \amp \text{Constant} \\ \amp | \amp \mathtt{Var}_x \amp x \amp \text{Variable} \\ \end{array} \end{equation*}

The column “Category” defines a specific syntactic category (expression, statement, and so on). For each category there may be multiple rows. Each row defines one syntactic construct of the respective category. The column “Abstract Syntax” defines the structure for that specific construct: A piece of abstract syntax consists of

  • a label (such as \(\mathtt{Add}\)) that denotes the construct and is also used as the label of a node in an abstract syntax tree that represents the instance of such a construct in a concrete program.

  • a list of children given in brackets. These are given by variable names of specific categories, for example \(\mathtt{Add}[e_1, e_2]\) means that an \(\mathtt{Add}\) construct consists of two sub-expressions.

  • possibly some attributes given as subscripts to the label. These attributes constitute additional information for a piece of abstract syntax such as the concrete name of an identifier or the value of a constant. These are not important for the structure of the program but may be important later on to analyze, execute, or transform the program.

The column “Concrete Syntax” shows how the construct looks like in written program text. In both columns one can use variables which stand for other syntactic constructs with the same name (e.g. \(e\) for expressions). Each variable must be defined in the category column to indicate from which category a construct comes. The example above for instance defines a construct \(\mathtt{Add}\) which belongs to the category \(\Expr\) and stands for an addition expression and consists of two other constructs that are from the category \(\Expr\) as well.

Remark 6.1.4.

When writing down the abstract syntax of a program on paper, one often resorts to the concrete syntax again instead of drawing an AST because the textual rendition of the AST saves space and looks concise for humans. We will do the same for the rest of this chapter and write down the programs we discuss in concrete syntax although we are always concerned with their abstract syntax. Because we require the concrete syntax to be unambiguous, it is always clear which abstract syntax tree we are referring to.