This module turns regular expressions into bytecode for execution on the virtual machine invented by Ken Thompson and described by Russ Cox. It also produces a Graphviz representation of the abstract syntax tree (with some nodes removed).
Some of the VM instructions (NCHAR
, WILD
) are not documented by Cox.
NCHAR
means "match anything but this" and WILD
is the wildcard.
Usage: Compile a regular expression into NFA instructions
Options:
-h, --help show this help message and exit
-d DOT, --dot=DOT Write the AST as a Graphviz dot file
-f FORMAT, --format=FORMAT
Output format, either "pretty" or "json"
-j, --json Alias for --format=json
-p PDF, --pdf=PDF Write the AST as a PDF file. Implies
--dot=FILENAME.dot.
-v, --verbose Print debugging information
pip install -e git+https://github.com/cwru-compilers/regex_compiler.git#egg=rajax
python -m rajax ab[\D8]+
Tokens:
'ORD_CHAR' 'a'
'ORD_CHAR' 'b'
'LBRACK' '['
'ES_SPECIAL' 'D'
'ORD_CHAR' '8'
'RBRACK' ']'
'PLUS' '+'
-------
Graphviz written to AST.dot
PDF written to AST.pdf
-------
Instructions for VM:
0: char a ZERO
1: char b ZERO
2: split 3 5
3: split 4 7
4: jmp 9
5: char : INF
6: jmp 10
7: char ZERO /
8: jmp 10
9: char 8 ZERO
10: split 2 11
11: match
AST.dot
will look like this:
digraph AST {
node [shape=box];
{
rank=same;
1 [label="re_expr_concat: ", style=filled, fillcolor="#eeeeee"];
}
{
rank=same;
2 [label="re_expr_concat: ", style=filled, fillcolor="#eeeeee"];
5 [label="simple_re_dup: +", style=filled, fillcolor="#ccffcc"];
}
{
rank=same;
3 [label="nondup_re_char: a", style=filled, fillcolor="#ffffcc"];
4 [label="nondup_re_char: b", style=filled, fillcolor="#ffffcc"];
6 [label="brack_expr_matching_list: ", style=filled, fillcolor="#ffffff"];
}
{
rank=same;
7 [label="follow_list_multi: ", style=filled, fillcolor="#cccccc"];
}
{
rank=same;
8 [label="char_class: D", style=filled, fillcolor="#ffcccc"];
9 [label="end_range_char: 8", style=filled, fillcolor="#ccccff"];
}
1 -> 2;
1 -> 5;
2 -> 3;
2 -> 4;
5 -> 6;
6 -> 7;
7 -> 8;
7 -> 9;
}