-
Notifications
You must be signed in to change notification settings - Fork 2
Conversation
As a compromise, one thing we could do is avoid storing ranges on the internal |
## Summary At present, when we store a binding, we include a `TextRange` alongside it. The `TextRange` _sometimes_ matches the exact range of the identifier to which the `Binding` is linked, but... not always. For example, given: ```python x = 1 ``` The binding we create _will_ use the range of `x`, because the left-hand side is an `Expr::Name`, which has a valid range on it. However, given: ```python try: pass except ValueError as e: pass ``` When we create a binding for `e`, we don't have a `TextRange`... The AST doesn't give us one. So we end up extracting it via lexing. This PR extends that pattern to the rest of the binding kinds, to ensure that whenever we create a binding, we always use the range of the bound name. This leads to better diagnostics in cases like pattern matching, whereby the diagnostic for "unused variable `x`" here used to include `*x`, instead of just `x`: ```python def f(provided: int) -> int: match provided: case [_, *x]: pass ``` This is _also_ required for symbol renames, since we track writes as bindings -- so we need to know the ranges of the bound symbols. By storing these bindings precisely, we can also remove the `binding.trimmed_range` abstraction -- since bindings already use the "trimmed range". To implement this behavior, I took some of our existing utilities (like the code we had for `except ValueError as e` above), migrated them from a full lexer to a zero-allocation lexer that _only_ identifies "identifiers", and moved the behavior into a trait, so we can now do `stmt.identifier(locator)` to get the range for the identifier. Honestly, we might end up discarding much of this if we decide to put ranges on all identifiers (astral-sh/RustPython-Parser#8). But even if we do, this will _still_ be a good change, because the lexer introduced here is useful beyond names (e.g., we use it find the `except` keyword in an exception handler, to find the `else` after a `for` loop, and so on). So, I'm fine committing this even if we end up changing our minds about the right approach. Closes #5090. ## Benchmarks No significant change, with one statistically significant improvement (-2.1654% on `linter/all-rules/large/dataset.py`): ``` linter/default-rules/numpy/globals.py time: [73.922 µs 73.955 µs 73.986 µs] thrpt: [39.882 MiB/s 39.898 MiB/s 39.916 MiB/s] change: time: [-0.5579% -0.4732% -0.3980%] (p = 0.00 < 0.05) thrpt: [+0.3996% +0.4755% +0.5611%] Change within noise threshold. Found 6 outliers among 100 measurements (6.00%) 4 (4.00%) low severe 1 (1.00%) low mild 1 (1.00%) high mild linter/default-rules/pydantic/types.py time: [1.4909 ms 1.4917 ms 1.4926 ms] thrpt: [17.087 MiB/s 17.096 MiB/s 17.106 MiB/s] change: time: [+0.2140% +0.2741% +0.3392%] (p = 0.00 < 0.05) thrpt: [-0.3380% -0.2734% -0.2136%] Change within noise threshold. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe linter/default-rules/numpy/ctypeslib.py time: [688.97 µs 691.34 µs 694.15 µs] thrpt: [23.988 MiB/s 24.085 MiB/s 24.168 MiB/s] change: time: [-1.3282% -0.7298% -0.1466%] (p = 0.02 < 0.05) thrpt: [+0.1468% +0.7351% +1.3461%] Change within noise threshold. Found 15 outliers among 100 measurements (15.00%) 1 (1.00%) low mild 2 (2.00%) high mild 12 (12.00%) high severe linter/default-rules/large/dataset.py time: [3.3872 ms 3.4032 ms 3.4191 ms] thrpt: [11.899 MiB/s 11.954 MiB/s 12.011 MiB/s] change: time: [-0.6427% -0.2635% +0.0906%] (p = 0.17 > 0.05) thrpt: [-0.0905% +0.2642% +0.6469%] No change in performance detected. Found 20 outliers among 100 measurements (20.00%) 1 (1.00%) low severe 2 (2.00%) low mild 4 (4.00%) high mild 13 (13.00%) high severe linter/all-rules/numpy/globals.py time: [148.99 µs 149.21 µs 149.42 µs] thrpt: [19.748 MiB/s 19.776 MiB/s 19.805 MiB/s] change: time: [-0.7340% -0.5068% -0.2778%] (p = 0.00 < 0.05) thrpt: [+0.2785% +0.5094% +0.7395%] Change within noise threshold. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) low mild 1 (1.00%) high severe linter/all-rules/pydantic/types.py time: [3.0362 ms 3.0396 ms 3.0441 ms] thrpt: [8.3779 MiB/s 8.3903 MiB/s 8.3997 MiB/s] change: time: [-0.0957% +0.0618% +0.2125%] (p = 0.45 > 0.05) thrpt: [-0.2121% -0.0618% +0.0958%] No change in performance detected. Found 11 outliers among 100 measurements (11.00%) 1 (1.00%) low severe 3 (3.00%) low mild 5 (5.00%) high mild 2 (2.00%) high severe linter/all-rules/numpy/ctypeslib.py time: [1.6879 ms 1.6894 ms 1.6909 ms] thrpt: [9.8478 MiB/s 9.8562 MiB/s 9.8652 MiB/s] change: time: [-0.2279% -0.0888% +0.0436%] (p = 0.18 > 0.05) thrpt: [-0.0435% +0.0889% +0.2284%] No change in performance detected. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) low mild 1 (1.00%) high severe linter/all-rules/large/dataset.py time: [7.1520 ms 7.1586 ms 7.1654 ms] thrpt: [5.6777 MiB/s 5.6831 MiB/s 5.6883 MiB/s] change: time: [-2.5626% -2.1654% -1.7780%] (p = 0.00 < 0.05) thrpt: [+1.8102% +2.2133% +2.6300%] Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) low mild 1 (1.00%) high mild ```
Having the ranges on the identifier will simplify the formatter (and improve its performance) because it allows us to use the source text slice without any lexing for classes, functions, and potentially other nodes. I'm only worried about any potential performance penalty because it happens to increase the size of An alternative could be to have a |
parser/src/function.rs
Outdated
@@ -21,7 +21,13 @@ type ParameterDef = (ast::Arg, Option<ast::Expr>); | |||
pub(crate) fn validate_arguments( | |||
arguments: ast::Arguments, | |||
) -> Result<ast::Arguments, LexicalError> { | |||
let mut all_args: Vec<&ast::Arg> = vec![]; | |||
let mut all_args: Vec<&ast::Arg> = Vec::with_capacity( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sneaky changes ;)
let mut r = n.to_string(); | ||
for x in n2 { | ||
r.push('.'); | ||
r.push_str(x.1.as_str()); | ||
} | ||
ast::Identifier::new(r) | ||
ast::Identifier::new(r, (location..end_location).into()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry what... It's weird that a.b.c
gets parsed as a single identifier instead of a MemberName
node that has an object
and property
node where object
is a MemberName
or Identifier
and property
is an Identifier
.
Yeah, I think we should (at the very least) only use this for identifiers that aren't already ranged. I'll make that change. I believe I'm aware of every case:
|
abf3ef9
to
c767f10
Compare
c767f10
to
a23c23b
Compare
I change the implementation to intentionally exclude |
4c88e59
to
344a2c3
Compare
ab13b1a
to
9c43be0
Compare
ea1106a
to
f1433f7
Compare
f1433f7
to
871f369
Compare
## Summary In astral-sh/RustPython-Parser#8, we modified RustPython to include ranges for any identifiers that aren't `Expr::Name` (which already has an identifier). For example, the `e` in `except ValueError as e` was previously un-ranged. To extract its range, we had to do some lexing of our own. This change should improve performance and let us remove a bunch of code. ## Test Plan `cargo test`
This PR adds `TextRange` to `Identifier`. Right now, the AST only includes ranges for identifiers in certain cases (`Expr::Name`, `Keyword`, etc.), namely when the identifier comprises an entire AST node. In Ruff, we do additional ad-hoc lexing to extract identifiers from source code. One frequent example: given a function `def f(): ...`, we lex to find the range of `f`, for use in diagnostics. Another: `except ValueError as e`, for which the AST doesn't include a range for `e`. Note that, as an optimization, we avoid storing the `TextRange` for `Expr::Name`, since it's already included.
This PR adds `TextRange` to `Identifier`. Right now, the AST only includes ranges for identifiers in certain cases (`Expr::Name`, `Keyword`, etc.), namely when the identifier comprises an entire AST node. In Ruff, we do additional ad-hoc lexing to extract identifiers from source code. One frequent example: given a function `def f(): ...`, we lex to find the range of `f`, for use in diagnostics. Another: `except ValueError as e`, for which the AST doesn't include a range for `e`. Note that, as an optimization, we avoid storing the `TextRange` for `Expr::Name`, since it's already included.
This PR adds `TextRange` to `Identifier`. Right now, the AST only includes ranges for identifiers in certain cases (`Expr::Name`, `Keyword`, etc.), namely when the identifier comprises an entire AST node. In Ruff, we do additional ad-hoc lexing to extract identifiers from source code. One frequent example: given a function `def f(): ...`, we lex to find the range of `f`, for use in diagnostics. Another: `except ValueError as e`, for which the AST doesn't include a range for `e`. Note that, as an optimization, we avoid storing the `TextRange` for `Expr::Name`, since it's already included.
Summary
This PR adds
TextRange
toIdentifier
. Right now, the AST only includes ranges for identifiers in certain cases (Expr::Name
,Keyword
, etc.), namely when the identifier comprises an entire AST node. In Ruff, we do additional ad-hoc lexing to extract identifiers from source code.One frequent example: given a function
def f(): ...
, we lex to find the range off
, for use in diagnostics.Another:
except ValueError as e
, for which the AST doesn't include a range fore
.Note that, as an optimization, we avoid storing the
TextRange
forExpr::Name
, since it's already included.