Skip to content

1. Specification File

Paul edited this page Oct 1, 2019 · 5 revisions

Syntax

The syntax of specification file is similar to yacc. You should first declare several <headers> and then define <rules> which present the context-free grammar (CFG) of your language. The EBNF is shown as follows:

<spec>    ::= <headers> %% <rules>

<headers> ::= <header>*

<header>  ::= <package> | <imports> | <class> | <tokens> | <start> | <sem>

<package> ::= %package <pkg>

<imports> ::= %import <pkg>*

<class>   ::= %class (public | abstract | final)* class <ident> <ident>*

<tokens>  ::= %tokens <token>*

<token>   ::= '<char>' | <ident>

<start>   ::= %start <ident>

<sem>     ::= %sem <ident>

<rules>   ::= <rule>*

<rule>    ::= <ident> : <right>*| [;]

<right>   ::= <term>* [{ <code> }]

<term>    ::= '<char>' | <ident>

<pkg>     ::= [a-zA-Z0-9_.*]+ 

<ident>   ::= [_A-Za-z][_A-Za-z0-9]*

<char>    ::= .

Note that for terminals <pkg> presenting a valid Java package name, <ident> presenting a valid identifier, and <char> presenting a signal character, are defined with regular expressions. For non-terminals, special notations are

  • s1 | s2 means either s1 or s2,
  • s* means repeat s any times,
  • [s] means s is optional,
  • s*, means repeat s any times with separator ,,
  • and similarly s*| means repeat s any times with separator |.

Any other symbols are literals that should be seen in the specification file.

Headers

You have to declare these six headers:

<header>  ::= <package> | <imports> | <class> | <tokens> | <start> | <sem>
  • <package> declares which package the parser will be located in.
  • <imports> declares all dependent packages that should be imported into the parser.
  • <class> declares a Java class which will be the generated parser.
  • <tokens> declares all terminals of the CFG.
  • <start> declares the starting symbols of the CFG.
  • <sem> declares the name of the semantic value class, will be discussed later.

Consider the following header declaration from demo arith:

%package arith
%import
arith.Expr
arith.Expr.*
arith.error.CompileError
java.util.*

%class public class Parser extends BaseParser
%sem SemValue
%start E

%tokens NUM '+' '*' '(' ')'

Our tool generates the parser class:

package arith; // corresponding to `<package>`

// corresponding to `<imports>`
import arith.Expr;
import arith.Expr.*;
import arith.error.CompileError;
import java.util.*;

public class Parser extends BaseParser { // corresponding to `<class>`
    ...
    // corresponding to `<tokens>`
    public static final int NUM = 257;
    ...
}

The correlations between the specification declarations and the generated Java code are shown in comments. For <tokens>, since we present all signal-character tokens with its ASCII code (0~256 are reserved for these tokens), we only need to decode the identifier tokens into integers (start from 257). On this circumstance, NUM is the only identifier token. In this manner, we can map each token into an integer.

Rules

To describe the syntax of your target language, you have to design a CFG. The definition of rules are similar to the productions in CFG. Consider the CFG of arith:

E  -> T E1
E1 -> + T E1 | <empty>
T  -> F T1
T1 -> * F T1 | <empty>
F  -> (E) | a

To define the rule for E1 -> + T E1 | <empty>, simply write

E1  :   '+' T E1
    |   /* empty */
    ;

Here, : is used instead of -> in CFG. The right-hand side is a sequence of terms. Terms that are already declared in <tokens> are terminals, while other undeclared terms are non-terminals. For each non-terminal, it must have a definition in the specification file. You shouldn't use any undefined term like F1 which has no definition in the CFG above. Consider '+' T E1, '+' is a terminal declared in <tokens>, T is defined with T -> F T1 and E1 is recursively defined. Thus, this rule is valid.

Similar to yacc, a crucial thing is that when some tokens are consumed by successfully applying some rule, user-defined actions should be executed to yield some result. Under an usual condition, the user wants to store the parsing result in an abstract syntax tree (AST). For instance,

E1  :   '+' T E1
        {
            $$.terms.add(new Term(Expr.ADD, $2.expr));
            $$.terms.addAll($3.terms);
        }
    |   /* empty */
    ;

when rule E1 -> '+' T E1 is applied, actions

$$.terms.add(new Term(Expr.ADD, $2.expr));
$$.terms.addAll($3.terms);

are executed. For each term in the rule E1 -> '+' T E1, we label it with a semantic value object: $$ for the left-hand side E1, and $k for the k-th right-hand side terms, i.e., $1 for '+', $2 for T, and $3 for the right-hand side E1. A common situation is that you read some data from right-hand side semantic values ($2.expr, $3.terms) and you store some data into the left-hand side semantic value ($$.terms.add, $$.terms.addAll). Later, the parsed result for the left-hand side non-terminal are available to access.

The action block surrounded by a pair of {} are just Java code excluding the special notation for semantic value object $$, $1, .... Thereby, it is up to you to inject any black magic. In the generation phase, these action codes are directly embedded into the parser functions. Our tool never compiles these action codes, we leave it to the java complier after the potentially faulty parser is generated.

To have a better understanding of how actions are exactly performed, take a closer look at

in project arith.

Clone this wiki locally