Skip to content
This repository has been archived by the owner on Jul 27, 2023. It is now read-only.

perf: Cursor based lexer #38

Merged
merged 6 commits into from
Jul 26, 2023
Merged

perf: Cursor based lexer #38

merged 6 commits into from
Jul 26, 2023

Conversation

MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Jul 23, 2023

This PR rewrites the lexer to use the Cursor abstraction that we use for lexing in Ruff. I aimed to port some improvements from #36 without (almost) breaking the public API.

The PR includes some further performance improvements that simplified the refactoring:

  • Use Cursor instead of CharWindow: It has better economics in my view, and it avoids keeping a 4-character lookahead buffer. I'm also fairly certain that Cursor is easier to optimize for llvm that CharWindow
  • Use a fast path for ASCII-only characters. The old implementation checked for every character if it is a valid UTF identifier start. Performing the Unicode lookup is unnecessary if we know that the character is asld impcii only.
  • The existing implementation uses a pending buffer to which it writes every token before returning. The pending buffer at least always contained the next token, but can contain all tokens from the end of the previous logical line to the first token on the next logical line (newline, dedent/indent, comments, token). This results in an unnecessary write and read in the simple case where it only contains a single token, but resulted in a move of the whole vec if the vec contains multiple elements because the implementation appends to the back and removes from the front. The new implementation removes the Vec entirely and instead keeps an Option to track pending Dedents
  • The existing Lexer supports lexing from an offset. I removed this and instead introduced a new wrapper Iterator that offsets the returned tokens and errors by the given start offset. This simplifies the Lexer because it doesn't have to deal with relative (to index into the source string) and absolute offsets. Only paying the cost for relative offsets has the advantage that the lex from start doesn't pay for the overhead.

Public API changes

  • The existing implementation returns an error token, followed by the closing parentheses if the parentheses are unbalanced. This is no longer possible in the new implementation because I removed the pending buffer. I removed the error for unbalanced parentheses. This is a parser problem in my view. It also returns the error too late, because the lexer incorrectly assumed that it is in an un parenthesized context, handling and checking indents/dedents. But there isn't really anything we can do about that now... Ruff relies on the closing parentheses being returned, e.g. when lexing what comes after the as keyword in with (a as ex): (the lexer only sees ex): which has unbalanced parentheses).
  • The existing implementation normalized newlines to \n inside of strings. Re-introducing the normalization shouldn't be hard and it is kind of neat. However, it means that roundtrip parsing changes the line breaks in strings... I don't think we want this.
  • The implementation now supports mixed tab and spaces indents, using the same algorithm as CPython (hopefully)

Future improvements

  • Use &'str for Comment, String, and Identifier: The lexer doesn't perform any normalization on Strings, Comments, and Identifiers. We can, therefore, just store a &str referencing the content in the source instead of allocating a string for each of them. This should greatly improve performance because identifiers are probably the most common token besides whitespace. A later step could then be to propagate this change to the AST too.

Performance

I used Ruff's benchmarks that measure more than just lexing. Even the parser benchmark measures lexing, parsing, and traversing an AST. The parser benchmark improvement ranges from a 20% to 36% wall time improvement. This is huge, considering that the lexer only is a small portion of what the benchmark measures.

group                                      cursor                                   main
-----                                      ----                                   -----
formatter/large/dataset.py                 1.00      4.1±0.03ms     9.9 MB/sec    1.09      4.5±0.02ms     9.1 MB/sec
formatter/numpy/ctypeslib.py               1.00    808.7±1.41µs    20.6 MB/sec    1.12    905.0±3.66µs    18.4 MB/sec
formatter/numpy/globals.py                 1.00     76.2±0.33µs    38.7 MB/sec    1.19     90.5±0.24µs    32.6 MB/sec
formatter/pydantic/types.py                1.00   1735.7±2.46µs    14.7 MB/sec    1.11   1929.0±3.60µs    13.2 MB/sec
linter/all-rules/large/dataset.py          1.00      5.7±0.03ms     7.1 MB/sec    1.07      6.1±0.03ms     6.7 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00   1447.3±4.28µs    11.5 MB/sec    1.07   1544.0±2.48µs    10.8 MB/sec
linter/all-rules/numpy/globals.py          1.00    135.1±0.81µs    21.8 MB/sec    1.09    146.8±0.58µs    20.1 MB/sec
linter/all-rules/pydantic/types.py         1.00      2.6±0.01ms     9.9 MB/sec    1.09      2.8±0.02ms     9.1 MB/sec
linter/default-rules/large/dataset.py      1.00      2.9±0.00ms    14.2 MB/sec    1.13      3.2±0.02ms    12.5 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00    557.5±6.95µs    29.9 MB/sec    1.18   658.2±15.02µs    25.3 MB/sec
linter/default-rules/numpy/globals.py      1.00     58.6±0.29µs    50.4 MB/sec    1.24     72.6±0.97µs    40.6 MB/sec
linter/default-rules/pydantic/types.py     1.00   1193.2±5.02µs    21.4 MB/sec    1.18  1410.9±26.96µs    18.1 MB/sec
parser/large/dataset.py                    1.00      2.2±0.01ms    18.7 MB/sec    1.19      2.6±0.01ms    15.7 MB/sec
parser/numpy/ctypeslib.py                  1.00    398.4±0.55µs    41.8 MB/sec    1.28    509.6±1.48µs    32.7 MB/sec
parser/numpy/globals.py                    1.00     40.6±0.54µs    72.6 MB/sec    1.36     55.3±0.08µs    53.4 MB/sec
parser/pydantic/types.py                   1.00    858.2±4.58µs    29.7 MB/sec    1.27   1088.0±2.35µs    23.4 MB/sec

Roll out

This is the part I'm most scared about. I definitely want to make another careful review of the changes myself. Does the ecosystem check report new syntax errors?

I also need to incorporate the changes of the other open PRs (magic commands)

Other changes

I used this as a chance to remove the unused features... They got in my way ;)

@MichaReiser
Copy link
Member Author

Current dependencies on/for this PR:

This comment was auto-generated by Graphite.

@MichaReiser
Copy link
Member Author

I'm opening this for review. It's not done done but I'm interested in getting some feedback.

@MichaReiser MichaReiser marked this pull request as ready for review July 23, 2023 10:08
@MichaReiser MichaReiser requested a review from dhruvmanila July 23, 2023 10:17
@MichaReiser MichaReiser force-pushed the cursor-based-lexer branch 2 times, most recently from c6fca41 to 930324e Compare July 23, 2023 10:48
parser/src/gen/parse.rs Outdated Show resolved Hide resolved
parser/src/lexer/cursor.rs Outdated Show resolved Hide resolved
parser/src/lexer/indentation.rs Show resolved Hide resolved
parser/src/lexer/indentation.rs Outdated Show resolved Hide resolved
parser/src/lexer/indentation.rs Outdated Show resolved Hide resolved
parser/src/lexer.rs Outdated Show resolved Hide resolved
Ok((Tok::Newline, TextRange::empty(self.offset())))
}
// Next, flush the indentation stack to zero.
else if self.indentations.pop().is_some() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it intentional that this is not a while loop anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because the pending stack is gone.

The existing implementation writes any pending new line and dedents to the pending vector and then pops them one by one in the next method

The new implementation relies on the loop around next. It first returns the newline (and changes its internal state). The lexer is still at the end of the file for the next call and then returns the dedents one at a time until the stack is empty. It then finally returns the end of file token (forever)

parser/src/token.rs Outdated Show resolved Hide resolved
parser/src/token.rs Outdated Show resolved Hide resolved
parser/build.rs Show resolved Hide resolved
@charliermarsh
Copy link
Member

Wow, that is a ridiculously good PR summary (both well-written and clear enormous benefits in the change itself).

@charliermarsh
Copy link
Member

Does the ecosystem check report new syntax errors?

It should because we track a diagnostic for those (E999), and we'd expect to see some other diagnostics "disappear".

@charliermarsh
Copy link
Member

I'm planning to leave the in-depth lexer review to @konstin since it looks like he's already read through it in detail, but can you tag me if there are any specific things you want input on? And/or if you feel another close review is needed?

Copy link
Member

@dhruvmanila dhruvmanila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing work! I mainly have a few comments around Jupyter magic command, but otherwise this looks good.

parser/src/lexer/indentation.rs Outdated Show resolved Hide resolved
parser/src/lexer/cursor.rs Outdated Show resolved Hide resolved
parser/src/lexer.rs Outdated Show resolved Hide resolved
parser/src/lexer.rs Outdated Show resolved Hide resolved
@dhruvmanila
Copy link
Member

  • Use &'str for Comment, String, and Identifier: The lexer doesn't perform any normalization on Strings, Comments, and Identifiers.

Might be able to use &'str for MagicCommand as well ;)

@MichaReiser
Copy link
Member Author

  • Use &'str for Comment, String, and Identifier: The lexer doesn't perform any normalization on Strings, Comments, and Identifiers.

Might be able to use &'str for MagicCommand as well ;)

There's a subtle difference between how our lexer handles MagicCommand and String related to line continuation:

  • Magic Command: Omits the line continuation and newline token in the AST
  • String: The value string retains the line continuation and newline token

To me, the behavior of Magic commands seems more in line with how the lexer handles line continuation (it omits them). But it does mean we need to use a cow because it is sometimes necessary to drop a few characters.

@MichaReiser
Copy link
Member Author

MichaReiser commented Jul 24, 2023

I'm planning to leave the in-depth lexer review to @konstin since it looks like he's already read through it in detail, but can you tag me if there are any specific things you want input on? And/or if you feel another close review is needed?

I would be interested in your perspective on whether String should normalize newlines and on changing the lexer to no longer Err on unbalanced parentheses (at least, more closing than opening parentheses)

@MichaReiser MichaReiser force-pushed the cursor-based-lexer branch 2 times, most recently from 86b8c39 to 7c74343 Compare July 24, 2023 13:46
@zanieb
Copy link
Member

zanieb commented Jul 24, 2023

Thanks for the killer summary! Looking forward to reading through this :)

Copy link
Member

@dhruvmanila dhruvmanila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good! All of the Jupyter related logics are handled well!

parser/src/lexer.rs Show resolved Hide resolved
parser/src/lexer.rs Show resolved Hide resolved
@charliermarsh
Copy link
Member

I would be interested in your perspective on whether String should normalize newlines and on changing the lexer to no longer Err on unbalanced parentheses (at least, more closing than opening parentheses)

FWIW I support both of these changes (as long as the parser errors correctly on unbalanced parentheses as expected).

@MichaReiser MichaReiser merged commit 593b46b into main Jul 26, 2023
@MichaReiser MichaReiser deleted the cursor-based-lexer branch July 26, 2023 05:50
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants