Add `TextRange` to `Identifier` #8

charliermarsh · 2023-06-14T21:59:49Z

Summary

This PR adds TextRange to Identifier. Right now, the AST only includes ranges for identifiers in certain cases (Expr::Name, Keyword, etc.), namely when the identifier comprises an entire AST node. In Ruff, we do additional ad-hoc lexing to extract identifiers from source code.

One frequent example: given a function def f(): ..., we lex to find the range of f, for use in diagnostics.

Another: except ValueError as e, for which the AST doesn't include a range for e.

Note that, as an optimization, we avoid storing the TextRange for Expr::Name, since it's already included.

charliermarsh · 2023-06-15T15:29:39Z

As a compromise, one thing we could do is avoid storing ranges on the internal Identifier in Expr::Name, since that always matches the range of the expression. So we'd store ranges for (e.g.) the Identifier in except ValueError as e, but not on the Identifier in x = 1. That would avoid any duplication.

## Summary At present, when we store a binding, we include a `TextRange` alongside it. The `TextRange` _sometimes_ matches the exact range of the identifier to which the `Binding` is linked, but... not always. For example, given: ```python x = 1 ``` The binding we create _will_ use the range of `x`, because the left-hand side is an `Expr::Name`, which has a valid range on it. However, given: ```python try: pass except ValueError as e: pass ``` When we create a binding for `e`, we don't have a `TextRange`... The AST doesn't give us one. So we end up extracting it via lexing. This PR extends that pattern to the rest of the binding kinds, to ensure that whenever we create a binding, we always use the range of the bound name. This leads to better diagnostics in cases like pattern matching, whereby the diagnostic for "unused variable `x`" here used to include `*x`, instead of just `x`: ```python def f(provided: int) -> int: match provided: case [_, *x]: pass ``` This is _also_ required for symbol renames, since we track writes as bindings -- so we need to know the ranges of the bound symbols. By storing these bindings precisely, we can also remove the `binding.trimmed_range` abstraction -- since bindings already use the "trimmed range". To implement this behavior, I took some of our existing utilities (like the code we had for `except ValueError as e` above), migrated them from a full lexer to a zero-allocation lexer that _only_ identifies "identifiers", and moved the behavior into a trait, so we can now do `stmt.identifier(locator)` to get the range for the identifier. Honestly, we might end up discarding much of this if we decide to put ranges on all identifiers (astral-sh/RustPython-Parser#8). But even if we do, this will _still_ be a good change, because the lexer introduced here is useful beyond names (e.g., we use it find the `except` keyword in an exception handler, to find the `else` after a `for` loop, and so on). So, I'm fine committing this even if we end up changing our minds about the right approach. Closes #5090. ## Benchmarks No significant change, with one statistically significant improvement (-2.1654% on `linter/all-rules/large/dataset.py`): ``` linter/default-rules/numpy/globals.py time: [73.922 µs 73.955 µs 73.986 µs] thrpt: [39.882 MiB/s 39.898 MiB/s 39.916 MiB/s] change: time: [-0.5579% -0.4732% -0.3980%] (p = 0.00 < 0.05) thrpt: [+0.3996% +0.4755% +0.5611%] Change within noise threshold. Found 6 outliers among 100 measurements (6.00%) 4 (4.00%) low severe 1 (1.00%) low mild 1 (1.00%) high mild linter/default-rules/pydantic/types.py time: [1.4909 ms 1.4917 ms 1.4926 ms] thrpt: [17.087 MiB/s 17.096 MiB/s 17.106 MiB/s] change: time: [+0.2140% +0.2741% +0.3392%] (p = 0.00 < 0.05) thrpt: [-0.3380% -0.2734% -0.2136%] Change within noise threshold. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe linter/default-rules/numpy/ctypeslib.py time: [688.97 µs 691.34 µs 694.15 µs] thrpt: [23.988 MiB/s 24.085 MiB/s 24.168 MiB/s] change: time: [-1.3282% -0.7298% -0.1466%] (p = 0.02 < 0.05) thrpt: [+0.1468% +0.7351% +1.3461%] Change within noise threshold. Found 15 outliers among 100 measurements (15.00%) 1 (1.00%) low mild 2 (2.00%) high mild 12 (12.00%) high severe linter/default-rules/large/dataset.py time: [3.3872 ms 3.4032 ms 3.4191 ms] thrpt: [11.899 MiB/s 11.954 MiB/s 12.011 MiB/s] change: time: [-0.6427% -0.2635% +0.0906%] (p = 0.17 > 0.05) thrpt: [-0.0905% +0.2642% +0.6469%] No change in performance detected. Found 20 outliers among 100 measurements (20.00%) 1 (1.00%) low severe 2 (2.00%) low mild 4 (4.00%) high mild 13 (13.00%) high severe linter/all-rules/numpy/globals.py time: [148.99 µs 149.21 µs 149.42 µs] thrpt: [19.748 MiB/s 19.776 MiB/s 19.805 MiB/s] change: time: [-0.7340% -0.5068% -0.2778%] (p = 0.00 < 0.05) thrpt: [+0.2785% +0.5094% +0.7395%] Change within noise threshold. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) low mild 1 (1.00%) high severe linter/all-rules/pydantic/types.py time: [3.0362 ms 3.0396 ms 3.0441 ms] thrpt: [8.3779 MiB/s 8.3903 MiB/s 8.3997 MiB/s] change: time: [-0.0957% +0.0618% +0.2125%] (p = 0.45 > 0.05) thrpt: [-0.2121% -0.0618% +0.0958%] No change in performance detected. Found 11 outliers among 100 measurements (11.00%) 1 (1.00%) low severe 3 (3.00%) low mild 5 (5.00%) high mild 2 (2.00%) high severe linter/all-rules/numpy/ctypeslib.py time: [1.6879 ms 1.6894 ms 1.6909 ms] thrpt: [9.8478 MiB/s 9.8562 MiB/s 9.8652 MiB/s] change: time: [-0.2279% -0.0888% +0.0436%] (p = 0.18 > 0.05) thrpt: [-0.0435% +0.0889% +0.2284%] No change in performance detected. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) low mild 1 (1.00%) high severe linter/all-rules/large/dataset.py time: [7.1520 ms 7.1586 ms 7.1654 ms] thrpt: [5.6777 MiB/s 5.6831 MiB/s 5.6883 MiB/s] change: time: [-2.5626% -2.1654% -1.7780%] (p = 0.00 < 0.05) thrpt: [+1.8102% +2.2133% +2.6300%] Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) low mild 1 (1.00%) high mild ```

MichaReiser · 2023-06-19T07:36:04Z

Having the ranges on the identifier will simplify the formatter (and improve its performance) because it allows us to use the source text slice without any lexing for classes, functions, and potentially other nodes.

I'm only worried about any potential performance penalty because it happens to increase the size of Stmt by multiple bytes.

An alternative could be to have a Span<T> type that adds a range to a type. We could use it in the few places where the Identifier's range isn't known from its parent node.

MichaReiser · 2023-06-19T07:23:57Z

parser/src/function.rs

@@ -21,7 +21,13 @@ type ParameterDef = (ast::Arg, Option<ast::Expr>);
 pub(crate) fn validate_arguments(
    arguments: ast::Arguments,
 ) -> Result<ast::Arguments, LexicalError> {
-    let mut all_args: Vec<&ast::Arg> = vec![];
+    let mut all_args: Vec<&ast::Arg> = Vec::with_capacity(


Sneaky changes ;)

MichaReiser · 2023-06-19T07:27:44Z

parser/src/python.lalrpop

        let mut r = n.to_string();
        for x in n2 {
            r.push('.');
            r.push_str(x.1.as_str());
        }
-        ast::Identifier::new(r)
+        ast::Identifier::new(r, (location..end_location).into())


I'm sorry what... It's weird that a.b.c gets parsed as a single identifier instead of a MemberName node that has an object and property node where object is a MemberName or Identifier and property is an Identifier.

charliermarsh · 2023-06-19T13:09:37Z

Yeah, I think we should (at the very least) only use this for identifiers that aren't already ranged. I'll make that change. I believe I'm aware of every case:

Class names
Function names
Named exceptions (except ValueError as e)
Aliases (import foo as bar, where foo as bar has a range, but foo and bar do not)
Capture variants in pattern-matching (case *x:)

charliermarsh · 2023-06-19T15:31:34Z

I change the implementation to intentionally exclude Expr::Name which I think is the only case in which we don't need this.

## Summary In astral-sh/RustPython-Parser#8, we modified RustPython to include ranges for any identifiers that aren't `Expr::Name` (which already has an identifier). For example, the `e` in `except ValueError as e` was previously un-ranged. To extract its range, we had to do some lexing of our own. This change should improve performance and let us remove a bunch of code. ## Test Plan `cargo test`

This PR adds `TextRange` to `Identifier`. Right now, the AST only includes ranges for identifiers in certain cases (`Expr::Name`, `Keyword`, etc.), namely when the identifier comprises an entire AST node. In Ruff, we do additional ad-hoc lexing to extract identifiers from source code. One frequent example: given a function `def f(): ...`, we lex to find the range of `f`, for use in diagnostics. Another: `except ValueError as e`, for which the AST doesn't include a range for `e`. Note that, as an optimization, we avoid storing the `TextRange` for `Expr::Name`, since it's already included.

charliermarsh requested a review from MichaReiser June 14, 2023 22:00

charliermarsh mentioned this pull request Jun 15, 2023

Always use identifier ranges to store bindings astral-sh/ruff#5110

Merged

MichaReiser approved these changes Jun 19, 2023

View reviewed changes

charliermarsh force-pushed the charlie/atom branch from abf3ef9 to c767f10 Compare June 19, 2023 15:25

charliermarsh changed the base branch from main to charlie/fold June 19, 2023 15:25

charliermarsh force-pushed the charlie/atom branch from c767f10 to a23c23b Compare June 19, 2023 15:30

charliermarsh force-pushed the charlie/fold branch 3 times, most recently from 4c88e59 to 344a2c3 Compare June 19, 2023 21:20

Base automatically changed from charlie/fold to main June 19, 2023 21:26

charliermarsh force-pushed the charlie/atom branch 2 times, most recently from ab13b1a to 9c43be0 Compare June 19, 2023 21:30

charliermarsh mentioned this pull request Jun 19, 2023

Upgrade RustPython to access ranged names astral-sh/ruff#5194

Merged

charliermarsh force-pushed the charlie/atom branch 2 times, most recently from ea1106a to f1433f7 Compare June 19, 2023 22:31

Add TextRange to Identifier

871f369

charliermarsh force-pushed the charlie/atom branch from f1433f7 to 871f369 Compare June 20, 2023 15:10

charliermarsh merged commit ed3b4eb into main Jun 20, 2023

charliermarsh deleted the charlie/atom branch June 20, 2023 15:19

zanieb mentioned this pull request Jul 10, 2023

Sync with upstream repository #24

Closed

zanieb mentioned this pull request Jul 17, 2023

Merge upstream changes for PEP695 support (rebase) #25

Closed

zanieb mentioned this pull request Jul 17, 2023

Merge upstream changes for PEP695 support (cherry-pick ours) #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `TextRange` to `Identifier` #8

Add `TextRange` to `Identifier` #8

charliermarsh commented Jun 14, 2023 •

edited

Loading

charliermarsh commented Jun 15, 2023

MichaReiser commented Jun 19, 2023

MichaReiser Jun 19, 2023

MichaReiser Jun 19, 2023

charliermarsh commented Jun 19, 2023

charliermarsh commented Jun 19, 2023

Add TextRange to Identifier #8

Add TextRange to Identifier #8

Conversation

charliermarsh commented Jun 14, 2023 • edited Loading

Summary

charliermarsh commented Jun 15, 2023

MichaReiser commented Jun 19, 2023

MichaReiser Jun 19, 2023

Choose a reason for hiding this comment

MichaReiser Jun 19, 2023

Choose a reason for hiding this comment

charliermarsh commented Jun 19, 2023

charliermarsh commented Jun 19, 2023

Add `TextRange` to `Identifier` #8

Add `TextRange` to `Identifier` #8

charliermarsh commented Jun 14, 2023 •

edited

Loading