Rewrite lexer and parser #196

01mf02 · 2024-07-17T11:01:25Z

In the beginning, jaq used pest for its parser and lexer.
Later (3e26651), a new lexer/parser written using chumsky replaced the pest-based lexer/parser.
Over time, several shortcomings of the chumsky-based parser became apparent: First, at runtime, it was so slow that it became necessary to cache parse results, in particular of the standard library (std.jq, the equivalent of builtin.jq in jq), to achieve fast startup times. This meant additional dependencies on serde and bincode. Second, the compilation speed of the jaq parser was quite slow as well (making up for the largest part of the build process), despite my investing quite some time in trying to remedy this problem (mostly by sprinkling .boxed() throughout the parser).

This PR adds a new, hand-written lexer/parser for jaq.
Its build time is 1.55 seconds, compared to 47.59 seconds for the old chumsky-based parser (measured with cargo build --release --no-default-features -p jaq-{parse,syn}). This solves the long-standing issue #2.

The runtime speed of the new parser is significantly higher than that of the old parser; consider the following benchmark:

$ (for i in `seq 1000000`; do echo "def a: 0;"; done; echo 0) > bla.jq
$ time jaq -n -f bla.jq 0

This writes a file containing 1M instances of def a: 0; (amounting to 9.6MB), then executes jaq on that file.
Using the new parser, jaq takes 1.6 seconds, whereas with the old parser, jaq takes 30.3 seconds.
(For the same file, gojq 0.12.15 takes 4 minutes and 49 seconds, and jq 1.7.1 fails immediately with "error: memory exhausted".)

The executable size of jaq also decreases a bit, going down from 4.8MB (old parser) to about 3.8MB (new parser).

The new parser can parse a few constructs that the old parser could not; for example, object keys can now be keywords (e.g. {if: 1}.if).
Furthermore, the new parser adds support for several new syntactical constructs such as label ... break, generalised nested definitions (#157) and module syntax (module, import, include). However, while the parser can identify these constructs, the jaq compiler currently either ignores certain constructs (module) or panics ( label, break, import, include, generalised nested definitions) when encountering them. Handling this situation correctly (without panicking) will need to break the current API, so these syntactical constructs will be compiled only as of jaq 2.0.

If you are using jaq as an API, you can either continue using the old parser (in jaq-parse) or transition to the new parser (in jaq-syn). See jaq/src/main.rs for how to use the new parser.

01mf02 added 30 commits May 7, 2024 18:29

First prototype of lexer with parcours.

566e9cc

Rewrite tokeniser without library support.

cd5d735

Bare metal string parser.

32d0fcd

Finish lexer conversion.

6293c14

New Token type for simpler and faster lexing.

db2b78e

Remove parcours dependency.

6909174

Nicer handling of words.

0c671f1

Error reporting.

3bde326

Report Unicode errors.

8ac7319

Correct lexing of incorrect string escapes, e.g. "\0".

9c1a3e9

Remove unused function.

4f001de

Make lexer an object.

03df662

Improve string handling.

0f91ca2

Document.

c596dd9

Documentation, a bit of refactoring.

3928fc8

'"' is a delimiter, too.

fd85d66

Enable new lexer!

9ec392f

Compress strings.

8eb8ed6

Removed the old lexer.

0845350

Make Punct derive Eq.

0a126d1

Format.

195f694

Work on new term parser.

03ea6f8

Variable bindings.

50232d8

Fix a typo.

378539b

Parse definitions, less verbose error reporting.

bbf75f1

Properly report next token in blocks.

b31f014

Lex ?// operator.

28a4994

Simplify definitions.

fef8322

Definitions inside terms.

022d913

Remove Main; restrict object construction.

458333a

01mf02 added 7 commits July 17, 2024 09:10

Document.

9968bf6

Allow .key and {key} where key is a keyword, and disallow . key.

6f2478f

Remove def_head.

abd09d5

Clippy.

f30c3fe

Document.

1b82d32

Merge branch 'main' into faster-lexer

6729198

New test.

44e5851

denosaurtrain mentioned this pull request Jul 17, 2024

JSON with Comments (jsonc) support #197

Closed

01mf02 added 16 commits July 17, 2024 19:26

Document.

c379734

Report lex errors for characters c like 💣 where c.len_utf8() != 1.

b46c2b1

Clippy.

9cde4e4

Document.

2772166

Document term type.

1174607

Example for parse function.

2f970c0

Report unsupported operator.

88d55f0

Document.

39c3429

Remove KEYWORDS.

cd8085c

Document expectation.

33f65ab

Document.

e3df12f

Do not attempt to support destructuring alternative operator for now.

dce1814

Correctly compute {$k}.

27798d4

Atomicity tests.

7a74ea7

Make test more meaningful.

b410289

Document.

cf3fc71

01mf02 linked an issue Jul 22, 2024 that may be closed by this pull request

Rewrite parser to reduce release build times #2

Closed

01mf02 merged commit afc2af6 into main Jul 22, 2024
1 check passed

01mf02 deleted the faster-lexer branch July 22, 2024 16:36

01mf02 linked an issue Jul 22, 2024 that may be closed by this pull request

ER: ignore module directive (for the time being) #156

Closed

8e8b2c mentioned this pull request Aug 2, 2024

Upgrade jaq holochain-open-dev/holoom#66

Open

01mf02 mentioned this pull request Nov 11, 2024

ER: ignore module directive (for the time being) #156

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite lexer and parser #196

Rewrite lexer and parser #196

01mf02 commented Jul 17, 2024 •

edited

Loading

Rewrite lexer and parser #196

Rewrite lexer and parser #196

Conversation

01mf02 commented Jul 17, 2024 • edited Loading

01mf02 commented Jul 17, 2024 •

edited

Loading