-
Notifications
You must be signed in to change notification settings - Fork 2
Conversation
Current dependencies on/for this PR:
This comment was auto-generated by Graphite. |
7cbec09
to
9a4f47f
Compare
I'm opening this for review. It's not done done but I'm interested in getting some feedback. |
9a4f47f
to
287b4c2
Compare
c6fca41
to
930324e
Compare
Ok((Tok::Newline, TextRange::empty(self.offset()))) | ||
} | ||
// Next, flush the indentation stack to zero. | ||
else if self.indentations.pop().is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it intentional that this is not a while loop anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because the pending stack is gone.
The existing implementation writes any pending new line and dedents to the pending vector and then pops them one by one in the next method
The new implementation relies on the loop around next. It first returns the newline (and changes its internal state). The lexer is still at the end of the file for the next call and then returns the dedents one at a time until the stack is empty. It then finally returns the end of file token (forever)
Wow, that is a ridiculously good PR summary (both well-written and clear enormous benefits in the change itself). |
It should because we track a diagnostic for those (E999), and we'd expect to see some other diagnostics "disappear". |
I'm planning to leave the in-depth lexer review to @konstin since it looks like he's already read through it in detail, but can you tag me if there are any specific things you want input on? And/or if you feel another close review is needed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing work! I mainly have a few comments around Jupyter magic command, but otherwise this looks good.
Might be able to use |
There's a subtle difference between how our lexer handles
To me, the behavior of Magic commands seems more in line with how the lexer handles line continuation (it omits them). But it does mean we need to use a cow because it is sometimes necessary to drop a few characters. |
I would be interested in your perspective on whether |
e21cc87
to
e456660
Compare
86b8c39
to
7c74343
Compare
Thanks for the killer summary! Looking forward to reading through this :) |
7c74343
to
46d940f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really good! All of the Jupyter related logics are handled well!
FWIW I support both of these changes (as long as the parser errors correctly on unbalanced parentheses as expected). |
46d940f
to
36e75e1
Compare
This PR rewrites the lexer to use the
Cursor
abstraction that we use for lexing in Ruff. I aimed to port some improvements from #36 without (almost) breaking the public API.The PR includes some further performance improvements that simplified the refactoring:
Cursor
instead ofCharWindow
: It has better economics in my view, and it avoids keeping a 4-character lookahead buffer. I'm also fairly certain thatCursor
is easier to optimize for llvm thatCharWindow
pending
buffer to which it writes every token before returning. The pending buffer at least always contained the next token, but can contain all tokens from the end of the previous logical line to the first token on the next logical line (newline, dedent/indent, comments, token). This results in an unnecessary write and read in the simple case where it only contains a single token, but resulted in a move of the whole vec if the vec contains multiple elements because the implementation appends to the back and removes from the front. The new implementation removes theVec
entirely and instead keeps anOption
to track pendingDedent
sIterator
that offsets the returned tokens and errors by the given start offset. This simplifies theLexer
because it doesn't have to deal with relative (to index into the source string) and absolute offsets. Only paying the cost for relative offsets has the advantage that the lex from start doesn't pay for the overhead.Public API changes
as
keyword inwith (a as ex):
(the lexer only seesex):
which has unbalanced parentheses).\n
inside of strings. Re-introducing the normalization shouldn't be hard and it is kind of neat. However, it means that roundtrip parsing changes the line breaks in strings... I don't think we want this.Future improvements
&'str
for Comment, String, and Identifier: The lexer doesn't perform any normalization on Strings, Comments, and Identifiers. We can, therefore, just store a&str
referencing the content in the source instead of allocating a string for each of them. This should greatly improve performance because identifiers are probably the most common token besides whitespace. A later step could then be to propagate this change to theAST
too.Performance
I used Ruff's benchmarks that measure more than just lexing. Even the parser benchmark measures lexing, parsing, and traversing an AST. The parser benchmark improvement ranges from a 20% to 36% wall time improvement. This is huge, considering that the lexer only is a small portion of what the benchmark measures.
Roll out
This is the part I'm most scared about. I definitely want to make another careful review of the changes myself. Does the ecosystem check report new syntax errors?
I also need to incorporate the changes of the other open PRs (magic commands)
Other changes
I used this as a chance to remove the unused features... They got in my way ;)