This repository contains explores verified PEG parsing in PVS. It contains two main parts: static analysis of grammars (to ensure termination) and several variants of parser generators (memoization, semantic interpretation, tail-recursive), all shown equivalent to each other. A third part dedicated to applications and examples is still in-progress. This repository is linked to this CPP2020 paper
Peg grammars are unambiguous by default. However, they might not be complete, i.e. terminating on all inputs. A simple subset of complete grammars is wellformed grammars. The wf_peg
theory (in delta.pvs
) presents an implementation of a wellformedness check with the associated properties. The static_analysis
theory (in static_analysis.pvs
) refines the wellformedness check to allow a reordering of grammar nodes based on a topological ordering.
Several parser generators are defined, all shown equivalent:
parsing
is a straightforward, reference implementation (inparser.pvs
)packrat_parser
is a memoized version of the reference parser (inpackrat_parser.pvs
)sempp
is a memoized parser with semantic actions (insemantic_parser.pvs
)sempp_tl
is a memoized tail-recursive parser with semantic actions (insemantic_tlparser.pvs
)
- A small arithmetic expressions parser is defined in
ascii.pvs
. The semantic actions compute the result of the expression. - A small json parser is defined in
json.pvs
.
Both applications have a correct_grammar
theorem to show that they are wellformed
Technical definitions used for predicate manipulations
Here are defined a boolean triple nTinst
(short for non-terminal properties instance) to represent the grammar properties (can fail, can succeed without consuming something, can succeed consuming something) of a given grammar node. To store the properties of all the non-terminal interpretations, we use the nTprop
object. We define an order on both, and the usual properties : transitivity and reflexivity, plus a distributivity of the order on nTprop
over instances of nTinst
.
Technical definitions used for counting the number of times a predicate is satisfied in a nTprop
object
The aux
function is just a tail-recursive function that iterates over the array. Two main results are shown : the aux function is growing (for the nonTerminal ordering) and injective.
Main file for peg grammar definition, properties computation and wellformedness
This theory is parameterized by the type of the terminals, their ordering, and the number of nonterminals
The peg
datatype is made of constructors that directly represent the possible patterns of a peg grammar.
The pegMeasure
is a simple reduce_nat
. Three results are shown : the measure is growing and injective regarding the subterm relationship, and the subterm relationship is transitive.
This theory is parameterized by the same parameters as peg
.
This theory aims at expressing what a wellformed grammar is, based on properties calculation.
The grammar_props
function recursively computes the properties of a grammar node based on the already-known properties of the non-terminals (given as an argument). Then we can compute all accessible new properties, adding them as we go over all the non terminals with the recompute_nonTerminals_properties
function. The fix point of accessible properties is obtained with the compute_properties
function, that calls the previous one over and over again until no new properties are found. The obtained set of properties, of type fix_point
is coherent (not contradicted by iteself) and maximal (recomputing properties based on it cannot lead to new ones).
Two types of wellformedness are defined :
- complete : a completely-wellformed grammar node is a node where no subterm is of the form
star(e)
orplus(e)
withe
being able to succeed without consuming any input (thus looping forever). To say it in other words : a completely-wellformed grammar does not have structural loops - strong : a strongly-wellformed grammar node (relative to a non terminal A) is a node that only uses non terminals that are strictly less than A, unless a sequence is found, with the first argument not being able to succeed without consuming any input. Then the check switches to the complete wellformedness. To say it in other words : a strongly wellformed grammar is allowed to use higher non terminals only when we are sure that it consumed at least one character.
The grammar_wellformedness
function works for both kinds by switching the last argument. A wellformed set of non terminals (WF_nT
) is defined.
Definition of an abstract syntax tree (ast) structure
The datatype pre_ast
aims at capturing all parsing paths possible. A few structural conditions are already included in the constructors (for example, the leftmost branch correspond to a path that the parser has to explore, and thus cannot contain any skip node).
Main file for ast properties : definition of a meaningful and a wellformed ast
The astType
captures the fact that a tree is either a coherent proof of success or failure, or is incoherent (undefined). The check is done by the recursive astType?
function. A astMeaningful
tree is a tree that is either of type success
or failure
.
The astWellformed?
function recursively checks that the tree correspond to a valid proof of parse. Two results are shown : a wellformed tree is also meaningful, and a wellformed tree starting at s
and ending at e
as to verify s <= e
.
Technical definition of triple lexical ordering
Technical definition of quadruple lexical ordering
Main file for the reference parser generator definition
The argument of the parser are the following :
-
P_exp
: the set of non terminal interpretations, of typeWF_nT
(strongly wellformed) -
A
: the current non terminal -
G
: the current grammar node (as to be a subterm ofP_exp(A)
) -
inp
: the input array -
b
: the bound of the input -
s
: the current index -
s_T
: the index when the parse of the current nonTerminal started
The output type carries a lot of information. A returned tree T verifies :
s(T) = s
: it starts at the current indexe(T) <= b
: it ends before or at the bound- if G was a star node, then T is too
- if G was a plus node, then T is too
- T is not a skip (a skip does not represent a parsing result)
- if T is a success, and did not consume anything, it implies that G had the property
P_0
- if T is a success, and did consume something, it implies that G had the property
P_>0
- if T is a failure, it implies that G had the property
P_f
The parsing termination is proved using a quadruple lexicographic order :
- the parser goes down the grammar (
pegMeasure(G)
), unless it finds a star, a plus or a nonTerminal operator - if a star or a plus is found, then the grammar node stays constant, but at least one character must be consumed and thus,
s
increases. - if a nonTerminal node is found,
s
does not change, and the new grammar node might be anything. But, asP_exp
is strongly wellformed, we either have a non terminal strictly lower than the current one, or at least one character was consumed, sos
is greater thans_T
(so the recursive call withs_T
set ass
is legit)
Most of the proof work is done in the tccs.
Definition of a parser generator that uses memoization
The packrat_parser theory has a similar parser with a result that is a table mapping each start index and nonterminal to an AST that matches the one returned by the reference parser parsing, and an updated table.
Definition of the semantic interpretation of result tree
Definition of a packrat parser with semantic interpretation built-in
Tail-recursive version, written with continuations The proof trick is to have the continuation applied to the result of the parser being equal to the result of the reference parser
Wellformedness up-to reordering of non-terminal nodes