deprecated, writing parsers by hand wins for all of my use cases, so don’t plan to maintain/generic parser generator library.
Collection of various utilities related to parsing, ranging from
statically typed wrapper on top of scanf
to full parser generator,
automatic tree-sitter wrappers etc.
Pure nim parser generator supports EBNF notation, tree actions, template rules, compile/runtime parsing and automatic parse tree generation.
Tree-sitter wrapper generator currently WIP, but can already generate user-fiendly interfaces for existing grammars.
nimble install hparse
hts-wrapgen
command-line utility comes instealled with hparse
, but it
need additional dependencies in order to actually generate grammar.
For official installation instructions for tree-sitter’s getting started page.
tree-sitter-cli
is available, and can be installed using sudo pacman -S
tree-sitter
I couldn’t find any conclusive instructions about installing
tree-sitter-cli
on other distributions, so this is a most general
approach that ‘should work’.
If you are already familliar with npm
, have it installed etc. - you can
skip this part. For people like me who have no idea on setting it up:
sudo apt-get -qq install -y npm
export NPM_PACKAGES=$HOME/.npm-packages
npm config set prefix $NPM_PACKAGES
export PATH="$PATH:$NPM_PACKAGES/bin"
npm install tree-sitter-cli -g
cd /tmp
wget https://github.com/tree-sitter/tree-sitter/archive/0.17.3.tar.gz
tar -xvf 0.17.3.tar.gz && cd tree-sitter-0.17.3
sudo make install
export LD_LIBRARY_PATH=/usr/local/lib
NOTE: last line might not be necessary, but when testing installation in
debian
container I run into issues where tree-sitter
library would not
be detected without it.
After all necessary dependencies have been installed, you can generate
wrapper for arbitrary grammar file using hts-wrapgen grammar <arguments>
.
Most commonly used switches are
--grammarJs=<grammar file.js>
to set input grammar file--langPrefix=<language-name>
You can also use --parserUser=nimCode.nim
to immediately compiler test
nim file. It is not necessary to always use hts-wrapgen
when compiling
code that uses generated parser, but it has some built-in heuristics to
provide better error messages for common errors (mostly related to
linking).
hts-wrapgen
generates wrapper file for your language, named
<language-name>_wrapper.nim
. Wrapper consists of several distict
wrapper types, enum listing all possile node kinds, and several helper
procs. Generated API is made to resemble that of std/macros
as close as
possible - len
, indexation via []
, kind()
proc and so on.
By default hts-wrapgen
compiles generated parser source into static
library which fill be then saved as <language-name>_parser.o
alongside
your grammar file.
To build final application you need to link it with tree-sitter
library.
To do this from command-line you need to use --passl:-ltree-sitter
. Or
use {.passl: "-ltree-sitter".}
pragma in your code.
- Input grammar file, usually called
grammar.js
. Note that it is not mandatory to name it like this, it is perfectly fine to havecpp-grammar.js
for example. Grammar file example, tutorial on writing grammars.
To compile tree-sitter grammar and generate wrappers run
hts-wrapgen grammarFromFile \
--grammar=your-grammar.js
--lang=language
Create temporary directory for you project - during compilation a lot of auxiliary files are created, most of which are not needed later on.
For purposes of demonstration cpp
grammar will be used. You can get
all necessary files by either manually downloading grammar.js
and
scanner.cc
from tree-sitter-cpp repository on github or using wget
wget https://raw.githubusercontent.com/tree-sitter/tree-sitter-cpp/master/src/scanner.cc
wget https://raw.githubusercontent.com/tree-sitter/tree-sitter-cpp/master/grammar.js
To compile cpp grammmar tree-sitter needs to have tree-sitter-c
js
module installed. It can be obtained using npm install tree-sitter-c
- tree-sitter runtime
- Parser generated by tree-sitter is not fully
standalone and you program is need to be linked against
tree-sitter
library. If linker fails withundefined reference to `ts_parser_new'
(or reference to similar function) most likely you need to pass linker flags via either{.passl: "-ltree-sitter".}
in your code or--passl:-ltree-sitter
- external scanners
- External scanners allow you to write custom C
code which runs during the lexing process in order to handle lexical
rules that cannot be described by regular expressions. This
functions are written in separate
scanner.c
file (example for C++ parser) that has to be compiled and then linked with final application. If you have errors likeundefined reference to `tree_sitter_cpp_external_scanner_destroy'
(note word ‘external’) this is indication that you must also link compiled scanner code (for example using{.passl: "cppscanner.o".}
(note: name of the linked object file will be different depending on your language name)) - c++ stdlib
- some scanner implementations might be written in C++
and therefore depend on C++ runtime to operate. If you get
compilation errors like
undefined reference to `std::__cxx11::basic_
then you need to pass{.passl: "-lstdc++".}
- parser runtime
- all other linking errors are most likely related
to missing linking with compiled parser and can be solved by adding
{.passl: "cppparser.o".}
(note: name of the linked object file will be different depending on your language name)
Some of the gree-sitter grammars can be tried out online in interactive playground
Wrapper generates heterogeneous AST with defined []
operator, len
and kind
procs, which means it can be used with pattern matching.
Until I finish implementation you can import it form
hmisc/macros/matching
and use like
case tree[0]:
of Declaration[@dtype, .._]:
echo "first is declaration with type ", dtype.strVal()
It requires to enable {.experimental: "caseStmtMacros".}
- Generate tree-sitter grammar files using EBNF notation from pure nim parser generator. Grammar is already available at runtime as value so it is simple matter of converting it into javascript code.
- Implement custom scanners in nim. You already compile nim code to C, so why spend 10x effort writing things in C for scanners when you can just do the same in nim. It is already possible to do this, but process could be streamlined even more.
Statically typed wrapper on top of scanf
, supports all matcher
syntax (e.g $w
, $i
etc) as well as custom matcher procedures.
Tuple ml
is injected in the scope with following types for each
capture:
i, o, b, h
: intf
: float*, +
: string${matcherProc}
: if matcher proc has signature(s: string, arg: var T, start: int): int
then tuple field will be of typeT
import hparse/tscanf
func matcher1(s: string, arg: var seq[string], start: int): int =
# ^^^^^^^^^^^
# type of the captured variable
arg = @["##", "$$"]
return s.len - start
if tscanf("12x12---%1,1,1,1", "$ix$i$+%${matcher1}"):
echo ml[0] / 2, " = ", ml[2]," ", ml[3]
assert declared(ml)
assert type(ml[3]) is seq[string]
# ^^^^^^^^^^^
# Resulting field type in tuple
else:
assert not declared(ml)
echo "does not match"
6.0 = --- @["##", "$$"]
Simple addition to parseutils
library from stdlib - separate string
on tokens based on character sets.
import hparse/tokenize
import hpprint
type
LispPart = enum
lpPunct
lpQuoted
lpIdent
lpIntLit
pprint "(hello '(world) 22)".tokenize({
{'(', ')'} : lpPunct,
{'0'..'9'} : lpIntLit,
{'\'', 'a'..'z', 'A'..'Z', ')', '('} : lpQuoted,
{'a'..'z', 'A'..'Z'} : lpIdent
})
- (lpPunct, "(") - (lpQuoted, "hello") - (lpQuoted, "'(world)") - (lpIntLit, "22") - (lpPunct, ")")
Lisp notation for regex description. reimplementation of emacs-lisp rx macro.
(print (rx (and "a" "E") (or "()" "{}")))
aE\(?:()\|{}\)
Parser generator focuses on simplicity and ease of use. Concrete implementation details of particular parser algorithms are hidden as much as possible - you write grammar and provide input tokens, and get a tree. Whole API can be described as
let grammar = makeGrammar:
# grammar definition
let parser = new<Algorithm-name>Parser(grammar)
var stream = # create token stream
let tree = parser.parse(stream)
Grammar described using EBNF notation, with only exception being use
of prefix notation - e.g. for zero-or-more you need to write *E
.
Example of very simple grammar:
const defaultCategory = catNoCategory
let grammar = initGrammar[NoCategory, string]:
A ::= *("hello" & "world")
More complex example (with result tree)
exampleGrammarConst(grammar):
List ::= !"[" & Elements & !"]"
Elements ::= Element & @*(@(!"," & Element))
Element ::= "i" | List
let parser = exampleParser(grammar)
var stream = "[i,i,[i,i,i],i]".mapIt($it).makeTokens().makeStream()
let tree = parser.parse(stream)
echo tree.treeRepr()
+-> List +-> Elements +-> Element +-> 'i' +-> Element +-> 'i' +-> Element | +-> List | +-> Elements | +-> Element +-> 'i' | +-> Element +-> 'i' | +-> Element +-> 'i' +-> Element +-> 'i'
Note: <string>
means a string literal, like “|????”
*
zero-or-more+
one-or-more?
optional&
concatenation|
alternativeNonterminal ::= ...
declare new nontemrinal. Identifier must be uppercased.<string>
token literal. Default category is used<string>.prCat
or<string>.cat
token literal with lexeme<string>
and categoryprCat
. Prefix is automatically inferred on grammar construction and can be omitted.[[ expr ]]
token with lexeme predicate.[ ... ]
option
!
drop@
splice-discard^
promote^@
splice-promote
!*
drop zero-or-more elements!+
drop one-or-more@+
splice one-or-more@*
splice zero-or-more!?
drop optional^?
promote optional
Invalid combinations: *!
, +!
, *@
, +@
, *^@
, +^@
, +^
, *^
Result of parser generator is a parse tree
- very representation of
original source code and contains all helper symbols (punctuation,
brackets, precedence levels etc). All of this cruft is necessary to
correctly recognize input sequence of tokens, but completely
irrelevant afterwards - in nested list grammar only Elements
are
actually necessary, everything else can be thrown away immediately.
Tree actions are intended for this exact purpose - dropping
unnecessary parts of the parse tree, flattening out nested parts etc.
Right now there is five type of tree actions (four implemented).
Completely remove subtree element
echo ecompare(@["a", "b", "c"]) do:
A ::= "a" & "b" & "c"
do:
A ::= "a" & !"b" & "c"
+-> A +-> A +-> 'a' +-> 'a' +-> 'b' +-> 'c' +-> 'c'
Add subnode elements in parent tree. Subtree head is removed.
echo ecompare(@["-", "+", "+", "+", "-"]) do:
A ::= "-" & *"+" & "-"
do:
A ::= "-" & @*"+" & "-"
+-> A +-> A +-> '-' +-> '-' +-> [ [ ... ] ] +-> '+' | +-> '+' +-> '+' | +-> '+' +-> '+' | +-> '+' +-> '-' +-> '-'
Splice all node node elements and replace parent node. NOTE: this
replaces only parent node - in expression like E ::= A & B
parent
node for B
is concatenation - not nonterminal head.
echo ecompare(@["-", "+", "+", "+"]) do:
A ::= "-" & B
B ::= *"+"
do:
A ::= "-" & ^@B
B ::= *"+"
+-> A +-> A +-> '-' +-> B +-> B +-> '-' +-> '+' +-> '+' +-> '+' +-> '+' +-> '+' +-> '+'
Move part of the tree into separate list
echo ecompare(@["-", "z", "e"]) do:
A ::= "-" & "z" & "e"
do:
A ::= "-" & { "z" & "e" }
+-> A +-> A +-> '-' +-> '-' +-> 'z' +-> [ [ ... ] ] +-> 'e' +-> 'z' +-> 'e'
Some patterns often occur in grammar construction - list with
delimiters, kv pairs etc. Even though grammar is pretty simple,
writing something like Element & @*(@(!"," & Element))
over and over
again is not really fun. Parse templates are designed to solve this
issue.
Parse template is a function that will be executed to produce part of the pattern. In this example we generate template rule for comma-separated list of strings.
proc csvList(str: string): Patt[NoCategory, string] =
andP(
makeExpNoCat(str).tok(),
zeroP(andP(
makeExpNoCat(",").tok().addAction(taDrop),
makeExpNoCat(str).tok()
).addAction(taSpliceDiscard)
).addAction(taSpliceDiscard))
echo csvList("@").exprRepr()
echo eparse(@["@", ",", "@"], A ::= %csvList("@"))
{'@' & @*(@{!',' & '@'})} +-> A +-> '@' +-> '@'
DSL syntax is %functionName(..<list-of-arguments>..)
. For
codegen-based parsers (recursive LL(1)
and LL(*)
) function MUST be
executable at compile-time. In all other cases grammar construction
happens at runtime. In example above LL(*)
parser was used.
Token is has three generic parameters, referred to as C
, L
and I
throughout codebase.
- First one is ‘category’ for token. It is expected (but not
mandatory) to be an enum. Category is usuall things like
punctuation, identifier, string/int literal, etc. If you don’t need
token category use
NoCategory
enum.A - Second parameter - ‘lexeme’. It is can be absolutely anything
(
void
included). This field stores ‘all other’ information about token - integer/string value for literals for example. - Last parameter ‘information’. Similar to lexeme - but made for storing additional ‘metainformation’ for token: position in source code, order in original token stream etc. THis information is NOT used in parsing.
For example of custom token category/lexeme see manual.org
Token is accepted if lexeme predicate evaluates to ‘true’. Predicate
is placed in double square braces = [[ expr ]]
. Depending on syntax
of the expression different actions are performed.
- if it is
Infix
,Call
orDotExpr
(ex:it in ["a", "B"]
,startsWith(it, "..")
) whole expression is wrapped into predicate functionproc(it: L): bool {.noSideEffect.} = <your-expression>
. - otherwise it is passed to
makeExpTokenPredUsr(cat: C, val: <your-expression-type>
)
import strutils, strformat
const defaultCategory = catNoCategory
func makeExpTokenPredUsr(
cat: NoCategory, valset: bool): ExpectedToken[NoCategory, string] =
result = makeExpTokenPred[NoCategory, string](
catNoCategory, # Expected token category
&"[{valset}]", # string representation of expected token predicate
# (for pretty-printing)
proc(str: string): bool = valset # Construct predicate yourself
)
initGrammarConst[NoCategory, string](grammar):
A ::= *(B | C)
B ::= [[ it.startsWith("@") ]]
# ^^^^^^^^^^^^^^^^^^
# |
# Copied to predicate directly
C ::= [[ true ]] # Fallback nonterminal
# ^^^^
# |
# Passed to `makeExpTokenPredUsr`
let parser = newLLStarParser[NoCategory, string, void](grammar)
var stream = @["@ident", "#comment", "@ident"].makeTokens().makeStream()
let tree = parser.parse(stream)
echo tree.treeRepr()
Large part of the design is described in devnotes.org, all functions and types are documented in the source code. If you have any additional questions feel free to join my discord server and ask questions there.
I’m not an expert on parsing algorithms and related things, so I tried to design it in a way that would actually abstract things and make it easy to understand the API.
Not supporting syntactic predicates allows use of multiple parsing
algorithms for the same grammar, ranging from restrictive but fast
LL(1)
to something like earley parser.
The parser abstracts notion of token and is not tied to any lexer implementation - if you want to can just split string on spaces and call it a lexer. Or you can do some heuristics in lexer and assign category based on context. Or something else, I don’t know now.
The whole grammar is available as a value, which means it is possible to easily do all sorts of preprocessing, error detection (like using undeclared nonterminal, left recursion detection and so on).
Tree actions and template rules provide small, but hopefully useful subset of syntactic actions. Advantage - it is possible to know how exactly the tree will look like. Generating statically typed case object for a grammar is possible.
Parser generator was originally intended to work in conjunction with term rewriting system. You write grammar in EBNF notation, dropping all cruft immediately (using splice-discard and drop rules) and then declaratively transform tree into something else.
Parser generator is currently work-in-progress. All advertized
features are implemented, but number of supported algorithms is
lacking - fully supported is only backtracking LL(*)
. Codegen and
table-driven LL(1)
are partially supported (have some weird bugs).
Some work has been done on adding SLR
and Earley
parser.
Parser generator has relatively clean and documented internal API, designed to make implementation of new algorithms as simple as possible (most of details are abstracted).
All sorts of contributions are welcome - issues, unit tests, documentation updates etc.
In addition there are several things that I wasn’t able to implement myself. If you are interested to solve one of there problems it will be especially useful.
If you have any question about implementation details, API etc. you can join my discord server.
WARNING: another issue I ran into - when actually running compiled & linked library in container, it just dies with, and I have no idea how to fix it.
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:440
440 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
Only recursive descent parsers can accept EBNF notation as-is. Every
other one requires conversion from EBNF to BNF (implemented, tested).
The problem is - this trasnformation changes shape of the parsed tree.
For example A ::= *(E)
is converted to A ::= E1
and E1 ::= Ɛ | E
E1
- recursion is replaced with iteration.
const defaultCategory = catNoCategory
initGrammarConst[NoCategory, string](grammar):
A ::= "hello" & *(B) & "world"
B ::= "!!"
var toks = @[
"hello", "!!", "!!", "!!", "world"].makeTokens().makeStream()
let grammarVal =
block:
let tmp = grammar
tmp.toGrammar()
echo "Original grammar"
echo grammarVal.exprRepr()
echo "---\n"
echo "Grammar converter to BNF"
echo grammarVal.toBNF().exprRepr()
echo "---\n"
echo "Recursive descent tree"
let parser1 = newLLStarParser[NoCategory, string, void](grammar)
let tree1 = parser1.parse(toks)
echo tree1.treeRepr()
echo "---\n"
toks.revertTo(0)
echo "Table-driven parser tree without structure fixup"
let parser2 = newLL1TableParser(
grammarVal,
dofixup = false,
retainGenerated = true
)
let tree2 = parser2.parse(toks)
echo tree2.treeRepr()
echo "---\n"
toks.revertTo(0)
echo "Table-driven parser tree with fixup"
let parser3 = newLL1TableParser(grammarVal, dofixup = true)
let tree3 = parser3.parse(toks)
echo tree3.treeRepr()
echo "---\n"
Original grammar A ::= {'hello' & *(<B>) & 'world'} B ::= '!!' --- Grammar converter to BNF A ::= .0 | 'hello' & <A0_1> & 'world' B ::= .0 | '!!' A0_1 ::= .0 | Ɛ .1 | <B> & <@A0_1> --- Recursive descent tree +-> A +-> 'hello' +-> [ [ ... ] ] | +-> B +-> '!!' | +-> B +-> '!!' | +-> B +-> '!!' +-> 'world' --- Table-driven parser tree without structure fixup +-> A +-> 'hello' +-> A0_1 | +-> B +-> '!!' | +-> A0_1 | +-> B +-> '!!' | +-> A0_1 | +-> B +-> '!!' +-> 'world' --- Table-driven parser tree with fixup +-> A +-> 'hello' +-> [ [ ... ] ] | +-> B +-> '!!' | +-> B +-> '!!' | +-> B +-> '!!' +-> 'world' ---
Instead of *(B)
new rule A0_1
is introduced, with two possible
alternatives: either empty production (Ɛ
) or B
, followed by A0_1
again. How this conversion affects parse tree can be seen in the
output: instead of simple list of elements you get deeply nested tree
of A0_1
. This is fixed automatically when converting EBNF
grammar
to BNF
by adding ‘splice’ rule on every use of newly generated
pattern.
It kind of works (not really tested though), but I’m yet to figure how
to preserve original tree actions. For example, when converting
something like @*(@{!',' & <Element>})}
to BNF it gets flattened
out, and it is not clear how to first splice things in !',' &
<Element>
, and then splice it again.
- [ ] support
`<token-literal>`
in grammar - [ ] generate errors on unknown nonterminals used in production
- [ ] Unit test for nimscript and js
- [ ] Error reporting. Right now it is basically non-existent
Right now parse tree is ‘stringly typed’ - nonterminal heads are
described using string
and all subnodes are placed in the same
subnodes: seq[ParseTree[...]]
.
Grammar DSL contains all necessary information to construct case
object with selector enum as well as order all fields (LL(*)
parser
uses constant grammar to generate set of mutally recursive functions).
Tree actions could provide almost all necessary information for field
types and ordering.
Possible mapping from grammar to constructed object
Nterm ::= ...
->of ptrNterm: <fields>
E1 & E2 & E3
->tuple[e1: <type-of-E1>, ... ]
*E1
and+E1
->seq[<type-of-E1>]
?E1
->Optional[<type-of-E1>]
E1 | E2
->case idx: [<number-of-alternatives>]
and each alternative gets it’s own field. Case objects can be nested so this is not a problem.<token>
->tok: Token[...]
There are several questions related to possible use cases, ease of use etc.
- [ ] Determenistic and intuitive names for fields.
- [ ] How fields should be named? It is not possible to have same-named fields in nim case objects.
Right now ParseTree[C, L, I]
is hardcoded into all parsers - I don’t
think it will be enough for all use cases.
- It is required to make separate type of parse tree defined for each grammar is
- Inegration with
nimtrs
- construct term instead of parse tree and maybe run rewriting actions immediately.
I’m like, 40% sure that I’m not sure about what it is, but it looked nice when I saw it last time. It is related to prolog and nimtrs already implements large portions (no clauses and backtracking but full support of unification and all auxiliary functions for working with terms and environments).
DSL for this library uses hmisc/hexceptions to generate much better compilation errors in case of malformed DSL.
let tree = "h".exampleParse:
A ::= !@*("h")
echo tree.treeRepr()
Unexpected prefix: '!@*' 2 let tree = "h".exampleParse: 5:8 A ::= !@*("h") ^^^ | Incorrect prefix combination Raised in grammar_dsl.nim:112 [CodeError:ObjectType]
NOTE: output is not colored in readme (because github fails to support
this basic feature since 2014), but it is colored by default
terminal (controlled by using -d:plainStdout
compilation flag).