-
Notifications
You must be signed in to change notification settings - Fork 66
Extended PEG Syntax
As we saw in PEG Basics and Declaring a Grammar, Pegged implements the entire PEG syntax, exactly as it was defined by its author.
Now, I felt the need to extend this a little bit. At that time, semantic actions were not implemented in Pegged so now that they are, some of these extensions are not strictly necessary, but they are useful shortcuts.
Given a parsing expression e
, :e
will discard e
's parse result. The mother expression calling e
will forget e
result (that's deliberate). No node nor match will be in the final parse tree. It allows one to write:
mixin(grammar("
Rule1 <- Rule2 :Spaces Rule3
# Rest of grammar ...
"));
On the first rule, see the colon before Spaces
. That means that the parse tree will contain only a Rule2
and a Rule3
node, and no Spaces
node. It's a way to cut the parse tree down, so as to keep only the interesting parts, not the syntactic signs necessary to structure a grammar.
Note that Pegged automatically drops the rules that are outside a grammar (See Four Levels of Encapsulation for more thoughts on this). In particular, char and string literals terminals are always dropped unless Pegged is told otherwise (see next extension). That allows one to write:
JSON:
JSONObject < '{' (Pair (',' Pair)*)? '}'
Pair < String ':' Value
Array < '[' (Value (',' Value)* )? ']'
# Rest of JSON grammar...
And given an input such as {"Hello":42, "World":0}
, the parse tree will look like this:
JSON(
JSONObject(
Pair(
String (...),
Number (...)),
Pair(
String (...),
Number (...))))
And no '{' nor ':' will be in the final output.
Completly discarding a node and its matches is sometimes a bit extreme. Since I regularly needed to cut a node, but keep its matches in the parent node, I added the ;
operator (cousin to :
) to do just that. It's an intermediate between the default behavior and :
's behavior.
If you want to keep literals (or any other parsing expression), the standard way is to make it a rule, as it's done in the JSON grammar:
Value < String
/ Number
/ JSONObject
/ Array
/ True
/ False
/ Null
True <- "true"
False <- "false"
Null <- "null"
Here, "true"
and co are kept through the True
, False
and Null
rules.
Another way is the ^
operator, that tells Pegged to always keep a node, even when it should be dropped. It's a way to keep calls to external rules, like the predefined ones (identifier
for example) or the literals. For example, given the two grammars Gram1
and Gram2
:
Gram1:
A <- B*
B <- 'b'
Gram2:
A <- B*
B <- ^'b'
When given "bbb"
as input, Gram1
will return a parse tree containing only A
and B
nodes:
A ["b", "b", "b"]
B ["b"]
B ["b"]
B ["b"]
Whereas Gram2
will keep the literal child nodes:
A ["b", "b", "b"]
B ["b"]
literal!(b) ["b"]
B ["b"]
literal!(b) ["b"]
B ["b"]
literal!(b) ["b"]
It's useful when you want the parse tree structure to keep the syntactic structure, with all markers reproduced as nodes. Also, in a programming language, it allows one to keep keywords. For example:
ReturnStatement <- ^"return" Expression? ';'
Given "return i;"
, the previous rule will return a parse tree with "return" as a node, but the semicolon dropped.
ReturnStatement ["return", "i"]
literal!(return) ["return"]
Expression ["i"]
(... lots of intermediary nodes ...)
Note the subtile difference between
AddExpr < ^('+'/'-') Factor
and
AddExpr < (^'+'/^'-') Factor
In the former case the entire choice node (pegged.peg.or
behind the scenes) is kept in the parse tree. In the latter case, the or
node will be discarded and only the literal will be kept.
The ~
(tilde) operator concatenates an expression's matches in one string. It was chosen for its proximity with the equivalent D operator. It's useful when an expression would otherwise return a long list of individual matches, whereas you're interested only in the global result:
mixin(grammar("
Strings:
String <- doublequote ~(Char*) doublequote
Char <- !doublequote . # Anything but a double quote
"));
Without the tilde operator, using String
on a string would return a list of Char
results. With tilde, you get the string content, which is most probably what you want:
auto p = Strings(q{"Hello World!"});
assert(p.matches == ["Hello World!"];
// without tilde: p.matches == ["H", "e", "l", "l", "o", " ", "W", ...]
The same goes for number-recognizers:
Numbers:
Number <- ~(Digit+)
Digit <- [0-9]
auto n = Numbers("1234");
assert(n.matches == ["1234"]);
// without tilde: n.matches == ["1", "2", "3", "4"]
Internally, it's used by identifier
and qualifiedIdentifier
.
Semantic actions are enclosed in curly braces and put behind the expression they act upon:
XMLNode <- OpeningTag {OpeningAction} (Text / Node)* ClosingTag {ClosingAction}
You can use any delegate from ParseTree
to ParseTree
as a semantic action. See Semantic Actions.
The characters -
(dash), [
(opening square brackets) and ]
(closing square bracket) have special meaning in char ranges (the [a-z]
syntax). In Pegged they can be escaped with \
to represent themselves. As usual, \
is thus \\
. Use them like this:
-
[-+]
-
in first position is OK. Already possible in PEG. -
[+\-]
Matches+
and-
-
[-\\\]]
Matches-
,\
and]
-
[\r\n\t]
Standard escapes are there, too.
Also, the char semantics were extended to deal with Unicode characters (and not only UTF-8). Basic PEG allows a char to be represented as an ASCII octal sequence: \040
. Pegged also recognizes:
-
\x41
: hexadecimal UTF-8 chars. -
\u0041
: hexadecimal UTF-16 chars. -
\U00000041
: hexadecimal UTF-32 chars.
I added them to deal with the W3C XML spec.
I plan to use other symbols, such as =
, @
or $
but these are in flux right now and I'll wait for the design to stabilize before documenting them.
All the previously-described extensions act upon expressions. When you want an operator to act upon an entire rule, it's possible to enclose it between parenthesis:
Rule <- ~(complicated expression I want to fuse)
This need is common enough for Pegged to provide a shortcut: put the operator in the arrow:
<~
(squiggly arrow) concatenates the captures on the right-hand side of the arrow.
<:
(colon arrow) drops the entire rule result (useful to ignore comments, for example)
<^
(keep arrow) that calls the 'keep' operator to all subelements in a rule.
For example:
Number <~ Digit+
Digit <- [0-9]
# Nested comments
Comment <: "/*" (Comment / Text)* "*/"
Text <~ (!("/*"/"*/") .)* # Anything but begin/end markers
That makes Number
expression a bit more readable.
There is another kind of space-level rule, it's the 'space arrow': just using <
(lower-than followed by a space) as an arrow. Instead of then treating the parsing expression as a standard non-space-consuming PEG sequence, it will consume spaces before, between and after elements.
So, given:
Rule1 <- A B C
Rule2 < A B C
A <- 'a'
B <- 'b'
C <- 'c'
Rule1
will parse "abc"
but not "a b c"
, where Rule2
will parse the two inputs (and will output a parse tree holding only an A
node, a B
and a C
node: no space node. Rule2
will also parse " ab c "
up to d
.
You can select what pattern is considered as a blank: define a rule called Spacing
in your grammar. This is the rule that will be called by Pegged to consume spaces in this grammar, when the <
arrow is used. If no user-defined Spacing
is provided, Pegged uses the predefined spacing
rule that parses blank chars (whitespace, tabulation, etc).
Use Spacing
to represent what you want as a space: most of the time, that should be zero or more unit of space, like (' ' / '\t')*
. Thus, a standard Spacing rule is:
Spacing <- :(' ' / '\t' / '\r' / '\n' / '\r\n')*
The <
arrow will automatically discard the Spacing
parse nodes, considering they do not interest you.
Warning! Do not use the space arrow in the Spacing
rule definition itself.
By using a different Spacing
rule, you can represent different syntaxes:
Number1 <- digit digit digit
Number2 < digit digit digit
Spacing <- ('_' / ' ')*
The previous Number2
will match "123"
, but also "1_23"
or "12 __3__"
.
<
does not apply the spacing inside e*
or e+
. This is to avoid nasty bugs for a user caught unaware by space-consuming. If you need space-consumption inside a zero-or-more or one-or-more construct, put Spacing
explicitly inside the block:
Number1 <- digit+
Number2 < (digit Spacing)+
Spacing <- ('_' / blank)*
Number1
will match "123"
but not "1 23"
or "10_000"
. Number2
is nearer to what a D lexer would accept: "10_000"
is OK.
If you're using Pegged to parse a programming language with comments, a nice trick is to put a call to both whitechars and comments rules in Spacing
(see Pegged own grammar for an example). That way, you can put comments anywhere a space can be and they will be recognized and discarded while parsing.
It is possible to provide an extra /
at the beginning of alternation expressions:
Rule1 <- / 'a'
Rule2 <- / 'a' / 'b'
Rule3 <- (/ 'a' / 'b')
The lack of terminals or symbols at the beginning of the alternation will never match anything. In other words: Rule1
in the above example is actually equivalent to this:
Rule1 <- 'a'
This is provided for the syntactic convenience of being able to place an entire alternation expression in its own indentation level:
MultiLineAlternation <-
/ 'a'
/ 'b'
/ 'c'
The traditional PEG alternation /
implements prioritised choice, meaning that rules are evaluated left to right until input is matched. Any remaining rules, which also could produce a match, are ignored. Sometimes this is a problem. Consider the following constructed example:
Rule1 <- ( 'aaa' / 'aaab' ) 'cc'
When Rule1
is tried on the input aaabcc
it fails, because after aaa
matches, only cc
is tried on bcc
, which fails. The obvious solution here is to swap the literals in the alternation, so the longest literal is tried first. There are languages however where you don't know which option will produce the longest match. Then you would want to evaluate all rules in an alternation, compare the length of their matches and proceed with the option that consumes the most input. You can achieve this in Pegged, since v0.4.0, by using |
instead of /
:
Rule1 <- ( 'aaa' | 'aaab' ) 'cc'
This will successfully parse aaabcc
.
Some languages, such as HTML and Pascal, have case insensitive keywords. Appending an i
to the literal causes the input and the literal to be compared case-insensitively, using std.uni.icmp
. Thus, the rule
Keyword <- "writeln"i
will match writeln
, WriteLn
, WRITELN
, etc.
Parameterized rules are another PEG extension, described in Parameterized Rules.
Next lesson: Parameterized Rules