The aim of this project is to provide an example of a static code analyser and to remind people of the basics of creating languages.
The analyser has been implemented in Scala for several reasons:
- Scala natively supports Java libraries, including ANTLR, which is one of the best-known libraries for writing programming languages.
- Scala supports Algebraic data type and has powerful Pattern matching capabilities, which makes it very easy to define a language and perform analysis on it.
- It provides a very good developer experience because there is a built-in Gradle plugin for Scala and Gradle plugin for ANTLR.
- Scala can treat errors as values using Monads (Here is a great video on the subject), and this makes the management of errors much clearer.
The application requires:
- Java >= 17
You can download the application on the downloads page.
You can execute a program using the application by running the following command:
java -jar ExampleAnalyser-X.X.X-all.jar run <path-to-your-program>
You can run a zero analysis on a program using the application by running the following command:
java -jar ExampleAnalyser-X.X.X-all.jar zero-analysis <path-to-your-program>
The application requires:
- Jdk >= 17
You can install the project using Gradle and build the application with the following commands:
./gradlew.bat build
./gradlew build
The application will be built in the build/libs/ExampleAnalyser-X.X.X-all.jar
file.
You can run the tests of the project using Gradle with the following commands:
./gradlew.bat test
./gradlew test
The syntax of a programming language determines what constitutes a valid program in terms of text. In almost all cases for programming languages, this syntax is defined using a Context-free grammar. In the course and in this example, the grammars are defined formally using the EBNF (extended Backus-Naur form) notation.
In this example, we will use the following grammar to define the syntax of our language:
By using this grammar, it is possible to check whether a text follows a certain format and therefore to verify the syntax of the programming language we want. However, this grammar cannot be directly converted into code that automatically checks whether text follows the syntax of our language. This is why we usually use tools such as ANTLR to do this for us. Even if this involves rewriting the syntax in a format that the tool supports, the use of ANTLR saves a lot of time by creating code that can recognise a language automatically. You can find the grammar described above in the format supported by ANTLR if you go there. If you open it, you'll see that it looks very similar to our grammar using the format the EBNF notation. The main difference is that the file is divided into 2 parts, the Lexer part and the Parser part. The parser is responsible for converting text into words and removing unnecessary characters. The order of the rules written here is important because the parser will use the first rule that matches the text to create the words. This is why the keywords are above the identifiers. Without this, ANTLR would not be able to properly find the keywords. Then, there is the parser which is responsible for converting the sequence of words created by the lexer into a CST (Concrete Syntax Tree). This CST uses the Visitor design pattern, which makes it easy to browse for the information we want.
Here is an example of how some code can be converted into a CST using ANTLR and our grammar:
int a;
a = {
int b;
b = -a + b - 7;
3 * 4 / b
};
As you can see, with just 6 lines of code, this already represents a fairly large CST (If you are interested, this CST was generated using the ANTLR extension for IntelliJ). In addition, this tree has a strange shape because of the rules on expressions. Even if this seems strange, it's actually intentional because it allows to enforce the Order of operations directly at the syntax level and this is something that can be found in almost all programming languages (Example with the Python grammar).
Once the syntactic analysis has been performed using ANTLR, it is then possible to perform a Semantic analysis. The semantics of a programming language define what has a meaning, i.e. which operations make sense and which don't, what happens when an instruction is executed, and so on. Indeed, does it make sense to assign a boolean expression to a variable of type int, given that the syntax allows it? In the case of our language, we would rather display an error to the user to indicate that it doesn't make sense because there is a mismatch between the type of the variable and the type of the expression. That's why we need to formally define the semantics of our language.
The semantics of our language are defined by the rules of inference below:
with:
- The notation
$< Var >$ corresponds to the set of variables in the program. - The symbol
$\oplus$ corresponds to the operators+
,-
,*
,/
,<
,>
,<=
,>=
,!=
,==
,and
andor
with their mathematical semantics. - The symbol
$\mathcal{E}$ corresponds to the set of all possible environments. - The symbol
$\sigma$ , with$\sigma \in \mathcal{E}$ and$\sigma : < Var > \mapsto \mathbb{Z} \cup {True, False}$ , corresponds to the environment of the function currently being executed. - The symbol
$\Sigma$ , with$\Sigma = \langle \sigma_0, ..., \sigma_n \rangle$ , corresponds to the execution stack which is a sequence of environments. - The notation
$\Sigma \bullet \sigma$ splits the execution stack into the environment of the function currently being executed$\sigma$ and the rest of the execution stack$\Sigma$ . - The notation
$(e, \Sigma \bullet \sigma) \leadsto v$ corresponds to the evaluation of an expression$e$ with respect to an environment$\sigma$ and the rest of the execution stack$\Sigma$ and which produces$v$ as a value. - The notation
$\sigma[x \mapsto v]$ corresponds to the update of the environment$\sigma$ with the fact that$v$ is associated to$x$ . - The notation
$(I, \Sigma \bullet \sigma) \leadsto \Sigma \bullet \sigma'$ corresponds to the fact of executing an instruction$I$ with respect to a state$\Sigma \bullet \sigma$ and yielding a state$\Sigma \bullet \sigma'$ . - The notation
$stdout(v)$ corresponds to the fact that the value$v$ has been printed to$stdout$ .
With these semantic rules, it is now possible to perform our semantic analysis. In statically typed languages like the one in our example, this can be done statically using the Visitor design pattern and the CST defined earlier. If the programmer tries to perform an operation that doesn't make sense, an error is produced in the visitor and then displayed to the programmer. To help us in this step, it is usually a good idea to use a Symbol table to store the variables that have been defined and their type, potential functions (even if there aren't any in our case), and so on. The symbol table used in this project can be found here. In general, it is also during this step that we try to simplify our CST into an AST (Abstract Syntax Tree) that allows us to keep only the information we need and to create a set of data structures that are easier to use in the rest of our application. The code that performs the semantic analysis and creates the AST can be found here.
In our example, the AST is defined using Scala classes that allow easy pattern matching and the use of visitors. These include sequences, instructions, types and expressions. Regarding the visitors, there is one for each type of AST class, as this allows us to return different types depending on what we need and not necessarily have to visit everything every time. In addition, a data
parameter has been added because we often need to pass data from one node to another, and this parameter makes this easier.
Here is what the AST looks like after being built from the CST mentioned above:
With our AST and our visitors, it's very easy to create an interpreter for our language that will follow the semantics defined earlier. All we need to do is create a class that will represent the environment and then implement the different rules using our visitors.
In this section, we will define a Zero Analysis, similar to the one used in the course, by using the Worklist algorithm implemented here. The file containing the entire analysis code is available here.
First, we need to define the set of abstract values
with:
-
$Bottom$ representing the fact that there is no value assigned yet. -
$Z$ representing zero. -
$NZ$ representing any value different from zero -
$U$ representing the fact that it is not known whether the value is zero or different from zero.
Next, we need to convert our AST into a CFG (Control-flow graph). The code that achieves this can be found here.
A CFG is composed of the following elements:
-
$PRED(p)$ : A function which returns the predecessors of the program point$p$ . -
$SUCC(p)$ : A function which returns the successors of the program point$p$ . -
$COND(p, p')$ : A function which returns a boolean expression that must be$True$ to go from the program point$p$ to the program point$p'$ .
After that, we can define our abstract environment:
with:
- The notation
$< Var >$ corresponds to the set of variables in the program.
Next, we can define our control-flow function by specifying its instances:
with:
These control-flow function instances are almost the same as the ones in the course. It was a choice to make them not that precise.
Moreover, the Worklist algorithm also requires a function to update the abstract environments according to the boolean expressions in the
Here are the function instances that we will use in our analysis:
In addition, the
Finally, we can define processing rules in order to interpret the results of the analysis:
Line (PP) | Instruction (I) | Condition (C) | Type (T) | Message (M) |
---|---|---|---|---|
p | x = y / z | Error | "Division by zero detected !" | |
p | x = y / z | Warning | "Possible division by zero !" |
These rules are the same as the ones in the course.