A very simple regular expression engine written in rust.
reginald 0.1.0
A very simple regular expression engine written in rust.
USAGE:
reginald [OPTIONS] <COMMAND> <REGEX> [REPLACE_STR]
ARGS:
<COMMAND> [possible values: match, matches, is]
<REGEX>
<REPLACE_STR>
OPTIONS:
-h, --help Print help information
-i, --input <INPUT>
-V, --version Print version information
cargo install --git https://github.com/ellabellla/reginald
cargo uninstall reginald
You can view an online example of the engine compiled to wasm here.
-
A regular expression is inputted as a string into a Lexer
-
The Lexer generates a stream of tokens from the string
-
The parser validates and generates an abstract syntax tree (AST) from the token stream
-
The AST is compiled into a non-deterministic finite state machine
-
Input is given to test against the regular expression
-
The state machine is simulated with the input to determine if a string is apart of the language the regular expression defines
Node | Description |
---|---|
ZeroOrMore | Has only one child node, Represents "*" |
Optional | Has only one child node, Represents "?" |
OneOrMore | Has only one child node, Represent "+" |
Once | Can have multiple children. Represents Concatenation. Is also the root of the AST. |
Or | Can have multiple children. Represents "|" |
From(min) | Has only one child node, Represents "{min,}". |
To(max) | Has only one child node, Represents "{,max}". "max" cannot be zero |
Between(min, max) | Has only one child node, Represents "{min,max}". "max" cannot be zero. "min" cannot be greater than "max" |
Symbol(char) | Has no child nodes. Represents a character to match |
Set([SetSymbol]) | Has one child. Represent "[ac-b]". Ranges are converted to the corresponding ascii numbers. Only ranges using 0-9, a-z, and A-Z are accepted. |
NotSet([SetSymbol]) | Has one child. Represent "[^ac-b]". Ranges are converted to the corresponding ascii numbers. Only ranges using 0-9, a-z, and A-Z are accepted. |
Any | Has no children. Represent ".". |
flowchart LR
0(Symbol a)
1(Once)
1-->0
2(Symbol b)
3(Once)
3-->2
4(Or)
4-->1
4-->3
5(Once)
5-->4
flowchart LR
0(Any)
1(Once)
1-->0
flowchart LR
0(Symbol a)
1(Optional)
1-->0
2(Once)
2-->1
flowchart LR
0(Symbol a)
1(OneOrMore)
1-->0
2(Once)
2-->1
flowchart LR
0(Symbol a)
1(ZeroOrMore)
1-->0
2(Once)
2-->1
flowchart LR
0(Symbol a)
1(To 3)
1-->0
2(Once)
2-->1
flowchart LR
0(Symbol a)
1(From 2)
1-->0
2(Once)
2-->1
flowchart LR
0(Symbol a)
1(Between 2 and 4)
1-->0
2(Once)
2-->1
flowchart LR
0('a', c99-c100)
1(Once)
1-->0
flowchart LR
0(not 'a', c99-c100)
1(Once)
1-->0
State | Description |
---|---|
Symbol | Any character that is the same as the states character will be matched and the state machine will continue |
Any | Any character will be matched on this state and the state machine will continue |
Set | Any character in the set will be matched and the state machine will continue |
NotSet | Any character not in the set will be matched and the state machine will continue |
Accept | A ending state for the state machine |
None | Used as a junction between states. |
flowchart LR
0(None)
0-->1
0-->2
1('a')
1-->3
2('b')
2-->3
3(None)
3-->4
4(Accept)
flowchart LR
0(None)
0-->1
1(Any)
1-->2
2(Accept)
flowchart LR
0(None)
0-->1
0-->2
1('a')
1-->2
2(None)
2-->3
3(Accept)
flowchart LR
0(None)
0-->1
1('a')
1-->0
1-->2
2(Accept)
flowchart LR
0(None)
0-->1
0-->2
1('a')
1-->0
2(Accept)
flowchart LR
0(None)
0-->1
0-->2
1(None)
1-->5
2('a')
2-->1
2-->3
3('a')
3-->1
3-->4
4('a')
4-->1
5(Accept)
flowchart LR
0(None)
0-->1
1('a')
1-->2
2('a')
2-->3
3(None)
3-->4
3-->5
4('a')
4-->3
5(Accept)
flowchart LR
0(None)
0-->1
1('a')
1-->2
2('a')
2-->3
2-->4
3(None)
3-->6
4('a')
4-->3
4-->5
5('a')
5-->3
6(Accept)
flowchart LR
0(None)
0-->1
1('a', c99-c100)
1-->2
2(Accept)
flowchart LR
0(None)
0-->1
1(not 'a', c99-c100)
1-->2
2(Accept)
char = a_character;
digit = [0-9];
num = digit+;
set = '[' '^'? (char | char '-' char)+ ']';
between = '{' num ',' '}' | '{' ',' num '}' | '{' num ',' num '}';
value = ('.' | char | '(' regex ')' | set) ('?' | '*' | '+' | between );
regex = value+ ('|' value+)*;
Operators | Description | Example | |
---|---|---|---|
| | will match either what is before or after it | a|b | will match with "a" or "b" |
. | any | . | will match with any character |
? | zero or one, greedy | a?b | "ab" or "b" |
+ | one or more, greedy | a+ | one or more "a" |
* | zero or more, greedy | a* | zero or more "a" |
{x,y} | will match an expression at least x times and at most y times | a{,3} | at most three "a" |
a{2,} | at minimum two "a" | ||
a{1,3} | between one and three "a" | ||
() | allows grouping of regular expressions | (a|b)* | will match with "a" or "b" zero or more times |
[] | will match with any characters or ranges in the set | [ac-e] | will match with "a", "c'", "d", "e" |
[^] | will match with any characters not in the set | [^ab] | will match with any character that is not "a" or "b" |
a-z | a range, used in a set, ranges can only be defined with alphanumerical characters | [0-z] | will match will all numbers and upper and lower case letters |
This software is provided under the MIT license. Click here to view.