You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this article we will be exploring what parser combinators are, what runtime parser generation is - why they’re useful, and then walking through a Zig implementation of them.
Let’s say we want to parse the syntax which describes a regular expression: a[bc].*abc
We can define some parsers to help us parse this syntax (e.g. into tokens or AST nodes):
Suppose that for a[bc].*abc:
RegexLiteralParser can parse a, b, and c, but not abc (the string.)
RegexRangeOpenParser can parse [.
RegexRangeCloseParser can parse ]
RegexAnyParser can parse the . “any character” syntax.
RegexRepetitionParser can parse the * repetition operator.
Now that we have these parsers, we can define parser combinators to help us parse the full regular expression. First, we need something to parse a string abc which we can define as:
What is OneOrMore, though? That’s our first parser combinator!
It takes a single parser as input (in this case, RegexLiteralParser) and uses it to parse the input one or more times. If it succeeded once, the parser combinator succeeded. Otherwise, it failed to parse anything.
Now if we want to parse the [bc] part of our regex, let’s say it can only contain a literal like bc (of course, real regex allows far more than this) we can e.g. reuse our new RegexStringLiteralParser:
In this case, Sequence is a parser combinator which takes multiple parsers and tries to parse them one-after-the-other in order, requiring all to succeed or failing otherwise.
Building upon this basic idea, we can use parser combinators to build a full regex syntax parser:
Going deeper: runtime parser generation
From before, our parser combinatorRegexSyntaxParser is built out of multiple parsers (Regex...Parser) and ultimately produces an AST describing the syntax for a given regex.
We can use the same combinatorial principle here to introduce a new parser generator called RegexParser which uses RegexSyntaxParser to create a brand new parser at runtime that is capable of parsing the actual semantics the regex describes - forming a full regex engine:
A note about traditional regex engines
Revised Mar 10, 2021 to clarify a misunderstanding I had about about the difference between DFA and NFA regex engines. Thanks @burntsushi for helping me to learn!
Production grade regex engines are either finite automata based or backtracking based, and are described in great detail in Russ Cox’s article here and his second article here covering the virtual-machine approach commonly used in regex engines.
It’s worth noting that combinatorial parsing and generating parsers at runtime is very much an uncommon method of implementing a regular expression engine. This is somewhat close to what Comby does in practice, although we use a runtime parser generator instead of parser parser combinators.
One could argue this makes what we’re parsing not strictly regular expressions, although as Larry Wall (author of the Perl programming language) writes, neither are the modern “regexp” pattern matchers you are likely used to:
“Regular expressions” […] are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I’m not going to try to fight linguistic necessity here. I will, however, generally call them “regexes” (or “regexen”, when I’m in an Anglo-Saxon mood).
Implementing the Parser interface
Parser combinators tend to be written in higher-level languages with much fancier type-systems such as Haskell and OCaml, which lend themselves well to higher-order functions like parser combinators.
We’ll be implementing this in Zig, which is a new low-level language aiming to be a better C.
Compile-time vs. run-time
Zig has very cool compile-time code execution semantics which help provide its generics. We’ll be exploring these a bit, but since we want to ultimately build parser generators at runtime (in order to execute a regexp) what we’ll be looking at is mostly runtime parser interfaces rather than compile-time parser interfaces (which are very much possible!)
Since we’ll be dealing with heap allocations, our parser will not be able to run at comptime for now. Once Zig gets comptime heap allocations this should be possible and opens up interesting new opportunities.
The parser interface
We need an interface in Zig which describes a parser as we previously mentioned:
Here it is - there’s a lot to unpack here so we’ll walk through it step-by-step:
This is a Zig function which takes two arbitrary type arguments at comptime, named Value and Reader. Uppercase is used to denote the name of a type in Zig. Thes are:
Value will be the type of the actual value that the parser will produce (e.g. a string of matched text, or an AST note.)
Reader will be the type of the actual source of the raw text to parse (we’ll cover this more later.)
The function itself returns a new type.
What we’re seeing here is the key way in which Zig approaches generic data structures: you merely pass around types as parameters - as if they were values - and you write functions which take types as parameters and return types as values. Some examples of valid calls to this function are:
Parser(u8, []u8) where u8 is an unsigned 8-bit integer and []u8 is a slice of unsigned 8-bit integers.
Parser([]const u8, @TypeOf(reader)) where []const u8 is describing a slice of UTF-8 text (a string) and reader is some reader type, such as std.io.fixedBufferStream("foobar").
Zig runtime interfaces
Now, since we’re trying to define an interface whose actual implementation can be swapped out at runtime - what we need is pretty simple:
A struct type which has the methods we want every implementation to provide.
Those methods to call function pointers which are defined as fields of our struct.
Basically, if someone wants to implement our interface they just need to create a new instance of Parser and populate the fields (callbacks) so their implementation is called when the interface is used.
In our case here, the returned struct has a method that consumers of the interface would invoke called parse - and the function pointer field that implementors will set to get a callback is the _parse field:
Type parameters
Let’s look at some of the data types going around here:
A few other notes:
Error!?Value is just describing the function can return an Error OR no value OR a Value type. See Zig’s error union types and optional types.
callconv(.Inline) is just telling the compiler to inline the function call - since our function isn’t doing a ton.
Errors the Parser interface can produce
Our error type might start out looking something like this:
error{...} describes a set of potential errors and || std.mem.Allocator.Error merely says to merge the allocator type’s error set with ours - so our potential set of errors includes ours and theirs.
As we start performing different operations within parsers, it will become more complex to describe more potential sources of errors:
In the above, the type T in const parser: T is denoting the type of the constant named parser - in this case it’ll be the type returned by Parser([]u8, @TypeOf(reader)). And this:
something=.{._parse=myParse,}
Is the Zig syntax for populating a struct. We’re setting the _parse field to myParse. Zig can infer the type of the struct if you write a .{} instead of T{} - which avoids the need for us to repeat the call to the Parser() function which is verbose.
What actually is a “Reader”?
Up to this point, we’ve just talked about Reader as being any type.
However, in contrast to our Parser type which invokes function pointers at runtime, the std.io.Reader interface is a compile time type - meaning calls to the underlying implementation do not involve a pointer dereference.
This means that, for now, we cannot simply define our function as accepting only an std.io.Reader interface - instead we must declare that we accept any type which we’ll call Reader, write our code as if it is an std.io.Reader - and the compiler will just barf if anybody passes something in that isn’t an std.io.Reader. This can sometimes lead to confusing compiler error messages (“there’s an error in the standard library code? Ah, no, I just needed to pass a .reader()!”).
A Parser that parses a literal string
If we want a Parser interface implementation that parses a specific string literal, one way to do that is to also make that a generic function which accepts any reader type (so we’re not restricted to e.g. just file inputs):
This is pretty good - but we need some way to have the type we return implement the Parser interface we defined. The way to do this is by defining a field in our struct:
<span>const</span> <span>Self</span> <span>=</span> <span>@This</span><span>();</span>
<span>// The `want` string must stay alive for as long as the parser will be used.</span>
<span>pub</span> <span>fn</span> <span>init</span><span>(</span><span>want</span><span>:</span> <span>[]</span><span>const</span> <span>u8</span><span>)</span> <span>Self</span> <span>{</span>
<span>return</span> <span>Self</span><span>{</span>
<span>.</span><span>want</span> <span>=</span> <span>want</span>
<span>};</span>
<span>}</span>
<span>};</span>
}
In this case, want is the string literal we want to match - and []const u8 is Zig’s string type. It describes a slice of immutable (non-modifiable) encoded UTF-8 bytes.
Unlike C, []const u8 being a slice means it is a pointer to the string in memory and its length - so we don’t have to pass around the length parameter separately or use a null-terminated string. In Zig, there are two ways to represent a string:
But wait a minute! In order for the ._parse = parse, assignment to work the first argument to parse needs to be the self parameter for a Parser([]u8, Reader) - so how does ourparse implementation method get to access the want field of our struct?
This is where some Zig magic comes in: on obscure builtin function we can use inside of our parse method:
To understand this, first let’s get a look at what these parameters are referring to:
We can see from the Zig documentation that this function operates as follows:
Given a pointer to a field, returns the base pointer of a struct.
So in our case:
Self is the “parent struct” we’re trying to acquire a reference to (our type)
"parser" is the name of our struct’s field.
parser is the pointer to our parser struct field.
Hopefully you can start to see the link here: parser is a pointer to our struct field, so Zig has a little helper @fieldParentPtr which can rely on that fact to give us our struct given a pointer to our struct field.
Implementing the rest of parse
Our full parse method will look like this:
// If a value is returned, it is up to the caller to free it.fnparse(parser:*Parser([]u8,Reader),allocator:*Allocator,src:*Reader)callconv(.Inline)Error!?[]u8{constself=@fieldParentPtr(Self,"parser",parser);constbuf=tryallocator.alloc(u8,self.want.len);errdeferallocator.free(buf);constread=trysrc.reader().readAll(buf);if(read<self.want.lenor!std.mem.eql(u8,buf,self.want)){trysrc.seekableStream().seekBy(-@intCast(i64,read));allocator.free(buf);// parsing failedreturnnull;}returnbuf;}
There are a few notable things here:
We’re trying to return a string from our parse function, i.e. the value it emits is a string (instead of an AST node).
The want string we got inside of our init method is agreed to only be valid while parse will still be called. We’ve decided to create a contract that all of our Parser implementations will either not hold onto memory given by others - or if they do, only do so until parse returns. Hence, we need to allocate a new string in our method.
Normally we could rely solely on defer (“run at end of function”) or errdefer (“run if an error is returned”), but since we’ve chosen to reserve the none optionalnull as “we didn’t parse anything” we need to manually free if we return null;. A nulldefer and somedefer could be nice, maybe?
Putting it all together, you’ll get something like this: GitHub gist.
Our first parser combinator
To demonstrate how a parser combinator would be implemented, we’ll try implementing the OneOf operator. It will take any number of parsers as input and run them consecutively until one succeeds or none do.
Let’s first start by writing out the basic structure of our function:
You’ll notice here that in contrast to our Literalparser function from earlier, this function takes a second comptime Value: type parameter. This is because we want it to work with any existing Parser implementation, regardless of what type of value it produces.
We can start to fill in the type by adding our init method:
<span>const</span> <span>Self</span> <span>=</span> <span>@This</span><span>();</span>
<span>// `parsers` slice must stay alive for as long as the parser will be</span>
<span>// used.</span>
<span>pub</span> <span>fn</span> <span>init</span><span>(</span><span>parsers</span><span>:</span> <span>[]</span><span>*</span><span>Parser</span><span>(</span><span>Value</span><span>,</span> <span>Reader</span><span>))</span> <span>Self</span> <span>{</span>
<span>return</span> <span>Self</span><span>{</span>
<span>.</span><span>parsers</span> <span>=</span> <span>parsers</span><span>,</span>
<span>};</span>
<span>}</span>
<span>};</span>
}
As you can see here, we’re going to simply take in a list of pointers to parsers. They’ll all need to have the same return Value as was specified in the call to OneOf.
One reason for this is that Zig does not support return type inference. You can have a function which takes anytype as a parameter, but it cannot return an anytype. This just means we need to have a generic function (in this case, OneOf) which accepts a type parameter and then use that Value type later. In a language like Haskell or OCaml, this would not be true.
try one_of_parser.parse(allocator, src); indicates that if parsing using one_of_parser returns an error that our function should return immediately and not continue attempting to parse with other parsers.
if (result != null) { is how you check if an Optional type in Zig is “None”. I find this pretty interesting: it’s not null, it’s actually an optional “none” type - but it is called null. I’m not sure why, but can imagine this making the language friendlier to people unfamiliar with optional types.
Using our OneOf parser combinator
Now for the cool part: we get to put both our Literal parser and OneOf parser combinator to build a new parser!
The above will parse one of "dog", "sheep", or "cat" from the input reader.
We’re passing @TypeOf(reader) frequently above which makes the code a bit more cryptic than needed, and it would be possible to introduce a OneOfLiteral helper which makes the above instead read:
You might be wondering how we would go from the Literalparser and OneOfparser combinator to actually generating a parser at runtime that can parse the semantics defined in a regexp string.
Since our Parser interface is a runtime interface (you can swap out the implementation at runtime) and since our parser combinator OneOf operates using that interface (only the return value must be known at compile time, it could be a generic AST node) it means that we can easily dynamically create slices of []*Parser(...) at runtime based on the result of a parser combinator we have built - like our “dog, cat, sheep” parser from earlier.
The challenge left for you as a reader is to:
Write parsers like our Literal parser that can parse the components of our regexp a[bc].*abc:
RegexLiteralParser can parse a, b, and c, but not abc (the string.)
RegexRangeOpenParser can parse [.
RegexRangeCloseParser can parse ]
RegexAnyParser can parse the . “any character” syntax.
RegexRepetitionParser can parse the * repetition operator.
Write a parser combinators like our OneOf parser, except have it parse a Sequence of parsers.
Use our Sequence parser combinator and RegexLiteralParser to build a RegexStringLiteralParser - similar to how we built out “dog, cat, sheep” parser.
Write a new kind of function called a runtime parser generator named RegexParser which will be super familiar:
Take in a parser combinator called RegexSyntaxParser which can turn your regexp syntax into some intermediary like an AST.
Have your function use parser combinators like OneOf, Sequence, etc. to build a brand new parser at runtime based on that intermediary AST.
Return that new parser which parses the actual semantics described by the input regexp!
Closing thoughts
I am sorry for not giving you a full (or even partial) regex engine :) I am still exploring this and it is a large undertaking, this blog post would be far too long if it was included.
You can find a copy of the final code with parsers and parser combinatorshere. Just zig init-exe and plop them into your src/ directory.
You may also want to check out Mecha, a parser combinator library for Zig.
If anything was unclear or confusing, I’m happy to help: shoot me an email [email protected] or leave a comment on Hacker News / Reddit and I’ll follow up.
via Hexops' devlog
March 16, 2022 at 02:13PM
The text was updated successfully, but these errors were encountered:
Zig, Parser Combinators - and Why They're Awesome
https://ift.tt/JW16kYh
In this article we will be exploring what parser combinators are, what runtime parser generation is - why they’re useful, and then walking through a Zig implementation of them.
What are parser combinators?
A parser parses some text to produce a result:
A parser combinator is a higher-order function which takes parsers as input and produces a new parser as output:
Why are parser combinators useful?
Let’s say we want to parse the syntax which describes a regular expression:
a[bc].*abc
We can define some parsers to help us parse this syntax (e.g. into tokens or AST nodes):
Suppose that for
a[bc].*abc
:RegexLiteralParser
can parsea
,b
, andc
, but notabc
(the string.)RegexRangeOpenParser
can parse[
.RegexRangeCloseParser
can parse]
RegexAnyParser
can parse the.
“any character” syntax.RegexRepetitionParser
can parse the*
repetition operator.Now that we have these parsers, we can define parser combinators to help us parse the full regular expression. First, we need something to parse a string
abc
which we can define as:What is
OneOrMore
, though? That’s our first parser combinator!It takes a single parser as input (in this case,
RegexLiteralParser
) and uses it to parse the input one or more times. If it succeeded once, the parser combinator succeeded. Otherwise, it failed to parse anything.Now if we want to parse the
[bc]
part of our regex, let’s say it can only contain a literal likebc
(of course, real regex allows far more than this) we can e.g. reuse our newRegexStringLiteralParser
:In this case,
Sequence
is a parser combinator which takes multiple parsers and tries to parse them one-after-the-other in order, requiring all to succeed or failing otherwise.Building upon this basic idea, we can use parser combinators to build a full regex syntax parser:
Going deeper: runtime parser generation
From before, our parser combinator
RegexSyntaxParser
is built out of multiple parsers (Regex...Parser
) and ultimately produces an AST describing the syntax for a given regex.We can use the same combinatorial principle here to introduce a new parser generator called
RegexParser
which usesRegexSyntaxParser
to create a brand new parser at runtime that is capable of parsing the actual semantics the regex describes - forming a full regex engine:A note about traditional regex engines
Revised Mar 10, 2021 to clarify a misunderstanding I had about about the difference between DFA and NFA regex engines. Thanks @burntsushi for helping me to learn!
Production grade regex engines are either finite automata based or backtracking based, and are described in great detail in Russ Cox’s article here and his second article here covering the virtual-machine approach commonly used in regex engines.
It’s worth noting that combinatorial parsing and generating parsers at runtime is very much an uncommon method of implementing a regular expression engine. This is somewhat close to what Comby does in practice, although we use a runtime parser generator instead of parser parser combinators.
One could argue this makes what we’re parsing not strictly regular expressions, although as Larry Wall (author of the Perl programming language) writes, neither are the modern “regexp” pattern matchers you are likely used to:
Implementing the Parser interface
Parser combinators tend to be written in higher-level languages with much fancier type-systems such as Haskell and OCaml, which lend themselves well to higher-order functions like parser combinators.
We’ll be implementing this in Zig, which is a new low-level language aiming to be a better C.
Compile-time vs. run-time
Zig has very cool compile-time code execution semantics which help provide its generics. We’ll be exploring these a bit, but since we want to ultimately build parser generators at runtime (in order to execute a regexp) what we’ll be looking at is mostly runtime parser interfaces rather than compile-time parser interfaces (which are very much possible!)
Since we’ll be dealing with heap allocations, our parser will not be able to run at comptime for now. Once Zig gets comptime heap allocations this should be possible and opens up interesting new opportunities.
The parser interface
We need an interface in Zig which describes a parser as we previously mentioned:
Here it is - there’s a lot to unpack here so we’ll walk through it step-by-step:
Zig generics are provided via type parameters
This is a Zig function which takes two arbitrary
type
arguments atcomptime
, namedValue
andReader
. Uppercase is used to denote the name of a type in Zig. Thes are:Value
will be the type of the actual value that the parser will produce (e.g. a string of matched text, or an AST note.)Reader
will be the type of the actual source of the raw text to parse (we’ll cover this more later.)The function itself returns a new type.
What we’re seeing here is the key way in which Zig approaches generic data structures: you merely pass around types as parameters - as if they were values - and you write functions which take types as parameters and return types as values. Some examples of valid calls to this function are:
Parser(u8, []u8)
whereu8
is an unsigned 8-bit integer and[]u8
is a slice of unsigned 8-bit integers.Parser([]const u8, @TypeOf(reader))
where[]const u8
is describing a slice of UTF-8 text (a string) andreader
is some reader type, such asstd.io.fixedBufferStream("foobar")
.Zig runtime interfaces
Now, since we’re trying to define an interface whose actual implementation can be swapped out at runtime - what we need is pretty simple:
struct
type which has the methods we want every implementation to provide.Basically, if someone wants to implement our interface they just need to create a new instance of
Parser
and populate the fields (callbacks) so their implementation is called when the interface is used.This is the same pattern used by the Zig
std.mem.Allocator
interface.In our case here, the returned struct has a method that consumers of the interface would invoke called
parse
- and the function pointer field that implementors will set to get a callback is the_parse
field:Type parameters
Let’s look at some of the data types going around here:
A few other notes:
Error!?Value
is just describing the function can return anError
OR no value OR aValue
type. See Zig’s error union types and optional types.callconv(.Inline)
is just telling the compiler to inline the function call - since our function isn’t doing a ton.Errors the Parser interface can produce
Our error type might start out looking something like this:
error{...}
describes a set of potential errors and|| std.mem.Allocator.Error
merely says to merge the allocator type’s error set with ours - so our potential set of errors includes ours and theirs.As we start performing different operations within parsers, it will become more complex to describe more potential sources of errors:
Zig can often infer error sets but only in some contexts today.
Our first Parser
All we need to do in order to implement a
Parser
is provide the_parse
method, and define its returnValue
type andReader
input type:In the above, the type
T
inconst parser: T
is denoting the type of the constant namedparser
- in this case it’ll be the type returned byParser([]u8, @TypeOf(reader))
. And this:Is the Zig syntax for populating a struct. We’re setting the
_parse
field tomyParse
. Zig can infer the type of the struct if you write a.{}
instead ofT{}
- which avoids the need for us to repeat the call to theParser()
function which is verbose.What actually is a “Reader”?
Up to this point, we’ve just talked about
Reader
as being any type.Similar to our
Parser
interface, the Zig standard library provides astd.io.Reader
interface and there are many implementors of it including:std.fs.File
std.io.fixedBufferStream("foobar")
std.net.Stream
(network sockets)However, in contrast to our
Parser
type which invokes function pointers at runtime, thestd.io.Reader
interface is a compile time type - meaning calls to the underlying implementation do not involve a pointer dereference.Today, Zig is in early stages (version 0.7) and does not have anything like an interface or trait type (although it seems likely this will be improved in the future.)
This means that, for now, we cannot simply define our function as accepting only an
std.io.Reader
interface - instead we must declare that we accept any type which we’ll callReader
, write our code as if it is anstd.io.Reader
- and the compiler will just barf if anybody passes something in that isn’t anstd.io.Reader
. This can sometimes lead to confusing compiler error messages (“there’s an error in the standard library code? Ah, no, I just needed to pass a.reader()
!”).A Parser that parses a literal string
If we want a
Parser
interface implementation that parses a specific string literal, one way to do that is to also make that a generic function which accepts any reader type (so we’re not restricted to e.g. just file inputs):This is pretty good - but we need some way to have the type we return implement the
Parser
interface we defined. The way to do this is by defining a field in our struct:Now a consumer can write the following to get a literal string parser:
Passing parameters to a parser implementation
If we want our
Literal
parser to accept a parameter – the literal string to look for – we need to give it a parameter.In the case of merely passing it a string, we could adjust the signature so that this is possible:
However, we’ll define ours using an
init
method which is more common in Zig data structures:In this case,
want
is the string literal we want to match - and[]const u8
is Zig’s string type. It describes a slice of immutable (non-modifiable) encoded UTF-8 bytes.Unlike C,
[]const u8
being a slice means it is a pointer to the string in memory and its length - so we don’t have to pass around the length parameter separately or use a null-terminated string. In Zig, there are two ways to represent a string:[]const u8
(unmodifiable string, most common)[]u8
(modifiable string)Understanding Zig’s wild/confusing
@fieldParentPtr
We’re finally ready to actually have our
Literal
parser parse something! We just need to implement ourparse
method:But wait a minute! In order for the
._parse = parse,
assignment to work the first argument toparse
needs to be theself
parameter for aParser([]u8, Reader)
- so how does ourparse
implementation method get to access thewant
field of our struct?This is where some Zig magic comes in: on obscure builtin function we can use inside of our
parse
method:To understand this, first let’s get a look at what these parameters are referring to:
We can see from the Zig documentation that this function operates as follows:
So in our case:
Self
is the “parent struct” we’re trying to acquire a reference to (our type)"parser"
is the name of our struct’s field.parser
is the pointer to ourparser
struct field.Hopefully you can start to see the link here:
parser
is a pointer to our struct field, so Zig has a little helper@fieldParentPtr
which can rely on that fact to give us our struct given a pointer to our struct field.Implementing the rest of
parse
Our full
parse
method will look like this:There are a few notable things here:
parse
function, i.e. the value it emits is a string (instead of an AST node).want
string we got inside of ourinit
method is agreed to only be valid whileparse
will still be called. We’ve decided to create a contract that all of ourParser
implementations will either not hold onto memory given by others - or if they do, only do so untilparse
returns. Hence, we need to allocate a new string in our method.defer
(“run at end of function”) orerrdefer
(“run if an error is returned”), but since we’ve chosen to reserve the none optionalnull
as “we didn’t parse anything” we need to manually free if wereturn null;
. Anulldefer
andsomedefer
could be nice, maybe?Putting it all together, you’ll get something like this: GitHub gist.
Our first parser combinator
To demonstrate how a parser combinator would be implemented, we’ll try implementing the
OneOf
operator. It will take any number of parsers as input and run them consecutively until one succeeds or none do.Let’s first start by writing out the basic structure of our function:
You’ll notice here that in contrast to our
Literal
parser function from earlier, this function takes a secondcomptime Value: type
parameter. This is because we want it to work with any existingParser
implementation, regardless of what type of value it produces.We can start to fill in the type by adding our
init
method:As you can see here, we’re going to simply take in a list of pointers to parsers. They’ll all need to have the same return
Value
as was specified in the call toOneOf
.One reason for this is that Zig does not support return type inference. You can have a function which takes
anytype
as a parameter, but it cannot return ananytype
. This just means we need to have a generic function (in this case,OneOf
) which accepts a type parameter and then use thatValue
type later. In a language like Haskell or OCaml, this would not be true.Finally, we can implement our
parse
method:There are a few things to unpack here:
try one_of_parser.parse(allocator, src);
indicates that if parsing usingone_of_parser
returns an error that our function should return immediately and not continue attempting to parse with other parsers.if (result != null) {
is how you check if an Optional type in Zig is “None”. I find this pretty interesting: it’s notnull
, it’s actually an optional “none” type - but it is callednull
. I’m not sure why, but can imagine this making the language friendlier to people unfamiliar with optional types.Using our OneOf parser combinator
Now for the cool part: we get to put both our
Literal
parser andOneOf
parser combinator to build a new parser!The above will parse one of
"dog"
,"sheep"
, or"cat"
from the input reader.We’re passing
@TypeOf(reader)
frequently above which makes the code a bit more cryptic than needed, and it would be possible to introduce aOneOfLiteral
helper which makes the above instead read:One thing to unpack here is this syntax for passing an array to
init
:&.{...}
:parsers: []*Parser(Value, Reader)
.{...}
would give us a fixed size array[3]*Parser(Value, Reader)
&.{}
gives us a pointer to an array, i.e. a slice[]*Parser(Value, Reader)
.Since our list is known at compile time, we don’t have to allocate or free memory for the slice. If our list was dynamic, we would need to do so.
Finally, we can actually use our parser above:
Runtime parser generation
You might be wondering how we would go from the
Literal
parser andOneOf
parser combinator to actually generating a parser at runtime that can parse the semantics defined in a regexp string.Since our
Parser
interface is a runtime interface (you can swap out the implementation at runtime) and since our parser combinatorOneOf
operates using that interface (only the return value must be known at compile time, it could be a generic AST node) it means that we can easily dynamically create slices of[]*Parser(...)
at runtime based on the result of a parser combinator we have built - like our “dog, cat, sheep” parser from earlier.The challenge left for you as a reader is to:
Literal
parser that can parse the components of our regexpa[bc].*abc
:RegexLiteralParser
can parsea
,b
, andc
, but notabc
(the string.)RegexRangeOpenParser
can parse[
.RegexRangeCloseParser
can parse]
RegexAnyParser
can parse the.
“any character” syntax.RegexRepetitionParser
can parse the*
repetition operator.OneOf
parser, except have it parse aSequence
of parsers.Sequence
parser combinator andRegexLiteralParser
to build aRegexStringLiteralParser
- similar to how we built out “dog, cat, sheep” parser.RegexParser
which will be super familiar:RegexSyntaxParser
which can turn your regexp syntax into some intermediary like an AST.Closing thoughts
I am sorry for not giving you a full (or even partial) regex engine :) I am still exploring this and it is a large undertaking, this blog post would be far too long if it was included.
You can find a copy of the final code with parsers and parser combinators here. Just
zig init-exe
and plop them into yoursrc/
directory.You may also want to check out Mecha, a parser combinator library for Zig.
If anything was unclear or confusing, I’m happy to help: shoot me an email [email protected] or leave a comment on Hacker News / Reddit and I’ll follow up.
via Hexops' devlog
March 16, 2022 at 02:13PM
The text was updated successfully, but these errors were encountered: