perf: Cursor based lexer #38

MichaReiser · 2023-07-23T09:01:47Z

This PR rewrites the lexer to use the Cursor abstraction that we use for lexing in Ruff. I aimed to port some improvements from #36 without (almost) breaking the public API.

The PR includes some further performance improvements that simplified the refactoring:

Use Cursor instead of CharWindow: It has better economics in my view, and it avoids keeping a 4-character lookahead buffer. I'm also fairly certain that Cursor is easier to optimize for llvm that CharWindow
Use a fast path for ASCII-only characters. The old implementation checked for every character if it is a valid UTF identifier start. Performing the Unicode lookup is unnecessary if we know that the character is asld impcii only.
The existing implementation uses a pending buffer to which it writes every token before returning. The pending buffer at least always contained the next token, but can contain all tokens from the end of the previous logical line to the first token on the next logical line (newline, dedent/indent, comments, token). This results in an unnecessary write and read in the simple case where it only contains a single token, but resulted in a move of the whole vec if the vec contains multiple elements because the implementation appends to the back and removes from the front. The new implementation removes the Vec entirely and instead keeps an Option to track pending Dedents
The existing Lexer supports lexing from an offset. I removed this and instead introduced a new wrapper Iterator that offsets the returned tokens and errors by the given start offset. This simplifies the Lexer because it doesn't have to deal with relative (to index into the source string) and absolute offsets. Only paying the cost for relative offsets has the advantage that the lex from start doesn't pay for the overhead.

Public API changes

The existing implementation returns an error token, followed by the closing parentheses if the parentheses are unbalanced. This is no longer possible in the new implementation because I removed the pending buffer. I removed the error for unbalanced parentheses. This is a parser problem in my view. It also returns the error too late, because the lexer incorrectly assumed that it is in an un parenthesized context, handling and checking indents/dedents. But there isn't really anything we can do about that now... Ruff relies on the closing parentheses being returned, e.g. when lexing what comes after the as keyword in with (a as ex): (the lexer only sees ex): which has unbalanced parentheses).
The existing implementation normalized newlines to \n inside of strings. Re-introducing the normalization shouldn't be hard and it is kind of neat. However, it means that roundtrip parsing changes the line breaks in strings... I don't think we want this.
The implementation now supports mixed tab and spaces indents, using the same algorithm as CPython (hopefully)

Future improvements

Use &'str for Comment, String, and Identifier: The lexer doesn't perform any normalization on Strings, Comments, and Identifiers. We can, therefore, just store a &str referencing the content in the source instead of allocating a string for each of them. This should greatly improve performance because identifiers are probably the most common token besides whitespace. A later step could then be to propagate this change to the AST too.

Performance

I used Ruff's benchmarks that measure more than just lexing. Even the parser benchmark measures lexing, parsing, and traversing an AST. The parser benchmark improvement ranges from a 20% to 36% wall time improvement. This is huge, considering that the lexer only is a small portion of what the benchmark measures.

group                                      cursor                                   main
-----                                      ----                                   -----
formatter/large/dataset.py                 1.00      4.1±0.03ms     9.9 MB/sec    1.09      4.5±0.02ms     9.1 MB/sec
formatter/numpy/ctypeslib.py               1.00    808.7±1.41µs    20.6 MB/sec    1.12    905.0±3.66µs    18.4 MB/sec
formatter/numpy/globals.py                 1.00     76.2±0.33µs    38.7 MB/sec    1.19     90.5±0.24µs    32.6 MB/sec
formatter/pydantic/types.py                1.00   1735.7±2.46µs    14.7 MB/sec    1.11   1929.0±3.60µs    13.2 MB/sec
linter/all-rules/large/dataset.py          1.00      5.7±0.03ms     7.1 MB/sec    1.07      6.1±0.03ms     6.7 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00   1447.3±4.28µs    11.5 MB/sec    1.07   1544.0±2.48µs    10.8 MB/sec
linter/all-rules/numpy/globals.py          1.00    135.1±0.81µs    21.8 MB/sec    1.09    146.8±0.58µs    20.1 MB/sec
linter/all-rules/pydantic/types.py         1.00      2.6±0.01ms     9.9 MB/sec    1.09      2.8±0.02ms     9.1 MB/sec
linter/default-rules/large/dataset.py      1.00      2.9±0.00ms    14.2 MB/sec    1.13      3.2±0.02ms    12.5 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00    557.5±6.95µs    29.9 MB/sec    1.18   658.2±15.02µs    25.3 MB/sec
linter/default-rules/numpy/globals.py      1.00     58.6±0.29µs    50.4 MB/sec    1.24     72.6±0.97µs    40.6 MB/sec
linter/default-rules/pydantic/types.py     1.00   1193.2±5.02µs    21.4 MB/sec    1.18  1410.9±26.96µs    18.1 MB/sec
parser/large/dataset.py                    1.00      2.2±0.01ms    18.7 MB/sec    1.19      2.6±0.01ms    15.7 MB/sec
parser/numpy/ctypeslib.py                  1.00    398.4±0.55µs    41.8 MB/sec    1.28    509.6±1.48µs    32.7 MB/sec
parser/numpy/globals.py                    1.00     40.6±0.54µs    72.6 MB/sec    1.36     55.3±0.08µs    53.4 MB/sec
parser/pydantic/types.py                   1.00    858.2±4.58µs    29.7 MB/sec    1.27   1088.0±2.35µs    23.4 MB/sec

Roll out

This is the part I'm most scared about. I definitely want to make another careful review of the changes myself. Does the ecosystem check report new syntax errors?

I also need to incorporate the changes of the other open PRs (magic commands)

Other changes

I used this as a chance to remove the unused features... They got in my way ;)

MichaReiser · 2023-07-23T09:01:59Z

Current dependencies on/for this PR:

main
- PR perf: Cursor based lexer #38 👈

This comment was auto-generated by Graphite.

MichaReiser · 2023-07-23T10:07:23Z

I'm opening this for review. It's not done done but I'm interested in getting some feedback.

literal/src/escape.rs

parser/src/gen/parse.rs

parser/src/lexer/cursor.rs

parser/src/lexer/indentation.rs

parser/src/lexer.rs

konstin · 2023-07-23T13:15:24Z

parser/src/lexer.rs

+            Ok((Tok::Newline, TextRange::empty(self.offset())))
+        }
+        // Next, flush the indentation stack to zero.
+        else if self.indentations.pop().is_some() {


is it intentional that this is not a while loop anymore?

Yes, because the pending stack is gone.

The existing implementation writes any pending new line and dedents to the pending vector and then pops them one by one in the next method

The new implementation relies on the loop around next. It first returns the newline (and changes its internal state). The lexer is still at the end of the file for the next call and then returns the dedents one at a time until the stack is empty. It then finally returns the end of file token (forever)

parser/src/token.rs

parser/build.rs

charliermarsh · 2023-07-23T15:54:26Z

Wow, that is a ridiculously good PR summary (both well-written and clear enormous benefits in the change itself).

charliermarsh · 2023-07-23T15:55:03Z

Does the ecosystem check report new syntax errors?

It should because we track a diagnostic for those (E999), and we'd expect to see some other diagnostics "disappear".

parser/src/lexer/cursor.rs

parser/src/lexer.rs

charliermarsh · 2023-07-23T16:06:03Z

I'm planning to leave the in-depth lexer review to @konstin since it looks like he's already read through it in detail, but can you tag me if there are any specific things you want input on? And/or if you feel another close review is needed?

dhruvmanila

This is amazing work! I mainly have a few comments around Jupyter magic command, but otherwise this looks good.

parser/src/lexer/indentation.rs

parser/src/lexer/cursor.rs

parser/src/lexer.rs

dhruvmanila · 2023-07-24T03:46:38Z

Use &'str for Comment, String, and Identifier: The lexer doesn't perform any normalization on Strings, Comments, and Identifiers.

Might be able to use &'str for MagicCommand as well ;)

MichaReiser · 2023-07-24T07:52:44Z

Use &'str for Comment, String, and Identifier: The lexer doesn't perform any normalization on Strings, Comments, and Identifiers.

Might be able to use &'str for MagicCommand as well ;)

There's a subtle difference between how our lexer handles MagicCommand and String related to line continuation:

Magic Command: Omits the line continuation and newline token in the AST
String: The value string retains the line continuation and newline token

To me, the behavior of Magic commands seems more in line with how the lexer handles line continuation (it omits them). But it does mean we need to use a cow because it is sometimes necessary to drop a few characters.

MichaReiser · 2023-07-24T11:21:11Z

I'm planning to leave the in-depth lexer review to @konstin since it looks like he's already read through it in detail, but can you tag me if there are any specific things you want input on? And/or if you feel another close review is needed?

I would be interested in your perspective on whether String should normalize newlines and on changing the lexer to no longer Err on unbalanced parentheses (at least, more closing than opening parentheses)

parser/src/lexer.rs

zanieb · 2023-07-24T14:07:24Z

Thanks for the killer summary! Looking forward to reading through this :)

dhruvmanila

This is really good! All of the Jupyter related logics are handled well!

parser/src/lexer.rs

charliermarsh · 2023-07-24T16:22:30Z

I would be interested in your perspective on whether String should normalize newlines and on changing the lexer to no longer Err on unbalanced parentheses (at least, more closing than opening parentheses)

FWIW I support both of these changes (as long as the parser errors correctly on unbalanced parentheses as expected).

MichaReiser force-pushed the cursor-based-lexer branch from 7cbec09 to 9a4f47f Compare July 23, 2023 10:06

MichaReiser requested a review from charliermarsh July 23, 2023 10:08

MichaReiser force-pushed the cursor-based-lexer branch from 9a4f47f to 287b4c2 Compare July 23, 2023 10:08

MichaReiser marked this pull request as ready for review July 23, 2023 10:08

MichaReiser commented Jul 23, 2023

View reviewed changes

literal/src/escape.rs Show resolved Hide resolved

MichaReiser requested a review from dhruvmanila July 23, 2023 10:17

MichaReiser force-pushed the cursor-based-lexer branch 2 times, most recently from c6fca41 to 930324e Compare July 23, 2023 10:48

konstin reviewed Jul 23, 2023

View reviewed changes

charliermarsh reviewed Jul 23, 2023

View reviewed changes

parser/src/lexer/cursor.rs Show resolved Hide resolved

charliermarsh reviewed Jul 23, 2023

View reviewed changes

parser/src/lexer.rs Show resolved Hide resolved

dhruvmanila reviewed Jul 24, 2023

View reviewed changes

parser/src/lexer/indentation.rs Outdated Show resolved Hide resolved

parser/src/lexer/cursor.rs Outdated Show resolved Hide resolved

parser/src/lexer.rs Outdated Show resolved Hide resolved

parser/src/lexer.rs Outdated Show resolved Hide resolved

MichaReiser force-pushed the cursor-based-lexer branch from e21cc87 to e456660 Compare July 24, 2023 11:32

MichaReiser mentioned this pull request Jul 24, 2023

Use cursor based lexer astral-sh/ruff#6012

Merged

MichaReiser force-pushed the cursor-based-lexer branch 2 times, most recently from 86b8c39 to 7c74343 Compare July 24, 2023 13:46

MichaReiser commented Jul 24, 2023

View reviewed changes

parser/src/lexer.rs Show resolved Hide resolved

MichaReiser added 4 commits July 24, 2023 18:01

Cursor based lexer

0893de1

Use single filter call

a179d8e

Address code review feedback

8b9e222

Fix infinite loop caused by form feed

23a5027

Merge with magic assignment lexing

fcebb0c

MichaReiser force-pushed the cursor-based-lexer branch from 7c74343 to 46d940f Compare July 24, 2023 16:04

MichaReiser requested review from konstin and dhruvmanila July 24, 2023 16:07

dhruvmanila approved these changes Jul 24, 2023

View reviewed changes

parser/src/lexer.rs Show resolved Hide resolved

parser/src/lexer.rs Show resolved Hide resolved

Merge with magic parsing

36e75e1

MichaReiser force-pushed the cursor-based-lexer branch from 46d940f to 36e75e1 Compare July 25, 2023 13:26

MichaReiser merged commit 593b46b into main Jul 26, 2023

MichaReiser deleted the cursor-based-lexer branch July 26, 2023 05:50

zanieb mentioned this pull request Jun 29, 2024

Should cursor credit rustc? astral-sh/ruff#12107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Cursor based lexer #38

perf: Cursor based lexer #38

MichaReiser commented Jul 23, 2023 •

edited

Loading

MichaReiser commented Jul 23, 2023

MichaReiser commented Jul 23, 2023

konstin Jul 23, 2023

MichaReiser Jul 23, 2023

charliermarsh commented Jul 23, 2023

charliermarsh commented Jul 23, 2023

charliermarsh commented Jul 23, 2023

dhruvmanila left a comment

dhruvmanila commented Jul 24, 2023

MichaReiser commented Jul 24, 2023

MichaReiser commented Jul 24, 2023 •

edited

Loading

zanieb commented Jul 24, 2023

dhruvmanila left a comment

charliermarsh commented Jul 24, 2023

perf: Cursor based lexer #38

perf: Cursor based lexer #38

Conversation

MichaReiser commented Jul 23, 2023 • edited Loading

Public API changes

Future improvements

Performance

Roll out

Other changes

MichaReiser commented Jul 23, 2023

MichaReiser commented Jul 23, 2023

konstin Jul 23, 2023

Choose a reason for hiding this comment

MichaReiser Jul 23, 2023

Choose a reason for hiding this comment

charliermarsh commented Jul 23, 2023

charliermarsh commented Jul 23, 2023

charliermarsh commented Jul 23, 2023

dhruvmanila left a comment

Choose a reason for hiding this comment

dhruvmanila commented Jul 24, 2023

MichaReiser commented Jul 24, 2023

MichaReiser commented Jul 24, 2023 • edited Loading

zanieb commented Jul 24, 2023

dhruvmanila left a comment

Choose a reason for hiding this comment

charliermarsh commented Jul 24, 2023

MichaReiser commented Jul 23, 2023 •

edited

Loading

MichaReiser commented Jul 24, 2023 •

edited

Loading