Skip to content
This repository has been archived by the owner on Jul 27, 2023. It is now read-only.

Add TextRange to Identifier #8

Merged
merged 1 commit into from
Jun 20, 2023
Merged

Add TextRange to Identifier #8

merged 1 commit into from
Jun 20, 2023

Conversation

charliermarsh
Copy link
Member

@charliermarsh charliermarsh commented Jun 14, 2023

Summary

This PR adds TextRange to Identifier. Right now, the AST only includes ranges for identifiers in certain cases (Expr::Name, Keyword, etc.), namely when the identifier comprises an entire AST node. In Ruff, we do additional ad-hoc lexing to extract identifiers from source code.

One frequent example: given a function def f(): ..., we lex to find the range of f, for use in diagnostics.

Another: except ValueError as e, for which the AST doesn't include a range for e.

Note that, as an optimization, we avoid storing the TextRange for Expr::Name, since it's already included.

@charliermarsh
Copy link
Member Author

As a compromise, one thing we could do is avoid storing ranges on the internal Identifier in Expr::Name, since that always matches the range of the expression. So we'd store ranges for (e.g.) the Identifier in except ValueError as e, but not on the Identifier in x = 1. That would avoid any duplication.

charliermarsh added a commit to astral-sh/ruff that referenced this pull request Jun 15, 2023
## Summary

At present, when we store a binding, we include a `TextRange` alongside
it. The `TextRange` _sometimes_ matches the exact range of the
identifier to which the `Binding` is linked, but... not always.

For example, given:

```python
x = 1
```

The binding we create _will_ use the range of `x`, because the left-hand
side is an `Expr::Name`, which has a valid range on it.

However, given:

```python
try:
  pass
except ValueError as e:
  pass
```

When we create a binding for `e`, we don't have a `TextRange`... The AST
doesn't give us one. So we end up extracting it via lexing.

This PR extends that pattern to the rest of the binding kinds, to ensure
that whenever we create a binding, we always use the range of the bound
name. This leads to better diagnostics in cases like pattern matching,
whereby the diagnostic for "unused variable `x`" here used to include
`*x`, instead of just `x`:

```python
def f(provided: int) -> int:
    match provided:
        case [_, *x]:
            pass
```

This is _also_ required for symbol renames, since we track writes as
bindings -- so we need to know the ranges of the bound symbols.

By storing these bindings precisely, we can also remove the
`binding.trimmed_range` abstraction -- since bindings already use the
"trimmed range".

To implement this behavior, I took some of our existing utilities (like
the code we had for `except ValueError as e` above), migrated them from
a full lexer to a zero-allocation lexer that _only_ identifies
"identifiers", and moved the behavior into a trait, so we can now do
`stmt.identifier(locator)` to get the range for the identifier.

Honestly, we might end up discarding much of this if we decide to put
ranges on all identifiers
(astral-sh/RustPython-Parser#8). But even if we
do, this will _still_ be a good change, because the lexer introduced
here is useful beyond names (e.g., we use it find the `except` keyword
in an exception handler, to find the `else` after a `for` loop, and so
on). So, I'm fine committing this even if we end up changing our minds
about the right approach.

Closes #5090.

## Benchmarks

No significant change, with one statistically significant improvement
(-2.1654% on `linter/all-rules/large/dataset.py`):

```
linter/default-rules/numpy/globals.py
                        time:   [73.922 µs 73.955 µs 73.986 µs]
                        thrpt:  [39.882 MiB/s 39.898 MiB/s 39.916 MiB/s]
                 change:
                        time:   [-0.5579% -0.4732% -0.3980%] (p = 0.00 < 0.05)
                        thrpt:  [+0.3996% +0.4755% +0.5611%]
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
linter/default-rules/pydantic/types.py
                        time:   [1.4909 ms 1.4917 ms 1.4926 ms]
                        thrpt:  [17.087 MiB/s 17.096 MiB/s 17.106 MiB/s]
                 change:
                        time:   [+0.2140% +0.2741% +0.3392%] (p = 0.00 < 0.05)
                        thrpt:  [-0.3380% -0.2734% -0.2136%]
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
linter/default-rules/numpy/ctypeslib.py
                        time:   [688.97 µs 691.34 µs 694.15 µs]
                        thrpt:  [23.988 MiB/s 24.085 MiB/s 24.168 MiB/s]
                 change:
                        time:   [-1.3282% -0.7298% -0.1466%] (p = 0.02 < 0.05)
                        thrpt:  [+0.1468% +0.7351% +1.3461%]
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  12 (12.00%) high severe
linter/default-rules/large/dataset.py
                        time:   [3.3872 ms 3.4032 ms 3.4191 ms]
                        thrpt:  [11.899 MiB/s 11.954 MiB/s 12.011 MiB/s]
                 change:
                        time:   [-0.6427% -0.2635% +0.0906%] (p = 0.17 > 0.05)
                        thrpt:  [-0.0905% +0.2642% +0.6469%]
                        No change in performance detected.
Found 20 outliers among 100 measurements (20.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  13 (13.00%) high severe

linter/all-rules/numpy/globals.py
                        time:   [148.99 µs 149.21 µs 149.42 µs]
                        thrpt:  [19.748 MiB/s 19.776 MiB/s 19.805 MiB/s]
                 change:
                        time:   [-0.7340% -0.5068% -0.2778%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2785% +0.5094% +0.7395%]
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high severe
linter/all-rules/pydantic/types.py
                        time:   [3.0362 ms 3.0396 ms 3.0441 ms]
                        thrpt:  [8.3779 MiB/s 8.3903 MiB/s 8.3997 MiB/s]
                 change:
                        time:   [-0.0957% +0.0618% +0.2125%] (p = 0.45 > 0.05)
                        thrpt:  [-0.2121% -0.0618% +0.0958%]
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
linter/all-rules/numpy/ctypeslib.py
                        time:   [1.6879 ms 1.6894 ms 1.6909 ms]
                        thrpt:  [9.8478 MiB/s 9.8562 MiB/s 9.8652 MiB/s]
                 change:
                        time:   [-0.2279% -0.0888% +0.0436%] (p = 0.18 > 0.05)
                        thrpt:  [-0.0435% +0.0889% +0.2284%]
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) low mild
  1 (1.00%) high severe
linter/all-rules/large/dataset.py
                        time:   [7.1520 ms 7.1586 ms 7.1654 ms]
                        thrpt:  [5.6777 MiB/s 5.6831 MiB/s 5.6883 MiB/s]
                 change:
                        time:   [-2.5626% -2.1654% -1.7780%] (p = 0.00 < 0.05)
                        thrpt:  [+1.8102% +2.2133% +2.6300%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
```
@MichaReiser
Copy link
Member

Having the ranges on the identifier will simplify the formatter (and improve its performance) because it allows us to use the source text slice without any lexing for classes, functions, and potentially other nodes.

I'm only worried about any potential performance penalty because it happens to increase the size of Stmt by multiple bytes.

An alternative could be to have a Span<T> type that adds a range to a type. We could use it in the few places where the Identifier's range isn't known from its parent node.

@@ -21,7 +21,13 @@ type ParameterDef = (ast::Arg, Option<ast::Expr>);
pub(crate) fn validate_arguments(
arguments: ast::Arguments,
) -> Result<ast::Arguments, LexicalError> {
let mut all_args: Vec<&ast::Arg> = vec![];
let mut all_args: Vec<&ast::Arg> = Vec::with_capacity(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sneaky changes ;)

let mut r = n.to_string();
for x in n2 {
r.push('.');
r.push_str(x.1.as_str());
}
ast::Identifier::new(r)
ast::Identifier::new(r, (location..end_location).into())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry what... It's weird that a.b.c gets parsed as a single identifier instead of a MemberName node that has an object and property node where object is a MemberName or Identifier and property is an Identifier.

@charliermarsh
Copy link
Member Author

Yeah, I think we should (at the very least) only use this for identifiers that aren't already ranged. I'll make that change. I believe I'm aware of every case:

  • Class names
  • Function names
  • Named exceptions (except ValueError as e)
  • Aliases (import foo as bar, where foo as bar has a range, but foo and bar do not)
  • Capture variants in pattern-matching (case *x:)

@charliermarsh charliermarsh changed the base branch from main to charlie/fold June 19, 2023 15:25
@charliermarsh
Copy link
Member Author

I change the implementation to intentionally exclude Expr::Name which I think is the only case in which we don't need this.

@charliermarsh charliermarsh force-pushed the charlie/fold branch 3 times, most recently from 4c88e59 to 344a2c3 Compare June 19, 2023 21:20
Base automatically changed from charlie/fold to main June 19, 2023 21:26
@charliermarsh charliermarsh force-pushed the charlie/atom branch 2 times, most recently from ab13b1a to 9c43be0 Compare June 19, 2023 21:30
@charliermarsh charliermarsh force-pushed the charlie/atom branch 2 times, most recently from ea1106a to f1433f7 Compare June 19, 2023 22:31
@charliermarsh charliermarsh merged commit ed3b4eb into main Jun 20, 2023
@charliermarsh charliermarsh deleted the charlie/atom branch June 20, 2023 15:19
charliermarsh added a commit to astral-sh/ruff that referenced this pull request Jun 20, 2023
## Summary

In astral-sh/RustPython-Parser#8, we modified
RustPython to include ranges for any identifiers that aren't
`Expr::Name` (which already has an identifier).

For example, the `e` in `except ValueError as e` was previously
un-ranged. To extract its range, we had to do some lexing of our own.
This change should improve performance and let us remove a bunch of
code.

## Test Plan

`cargo test`
zanieb added a commit that referenced this pull request Jul 10, 2023
This PR adds `TextRange` to `Identifier`. Right now, the AST only
includes ranges for identifiers in certain cases (`Expr::Name`,
`Keyword`, etc.), namely when the identifier comprises an entire AST
node. In Ruff, we do additional ad-hoc lexing to extract identifiers
from source code.

One frequent example: given a function `def f(): ...`, we lex to find
the range of `f`, for use in diagnostics.

Another: `except ValueError as e`, for which the AST doesn't include a
range for `e`.

Note that, as an optimization, we avoid storing the `TextRange` for
`Expr::Name`, since it's already included.
zanieb added a commit that referenced this pull request Jul 17, 2023
This PR adds `TextRange` to `Identifier`. Right now, the AST only
includes ranges for identifiers in certain cases (`Expr::Name`,
`Keyword`, etc.), namely when the identifier comprises an entire AST
node. In Ruff, we do additional ad-hoc lexing to extract identifiers
from source code.

One frequent example: given a function `def f(): ...`, we lex to find
the range of `f`, for use in diagnostics.

Another: `except ValueError as e`, for which the AST doesn't include a
range for `e`.

Note that, as an optimization, we avoid storing the `TextRange` for
`Expr::Name`, since it's already included.
zanieb pushed a commit that referenced this pull request Jul 17, 2023
This PR adds `TextRange` to `Identifier`. Right now, the AST only
includes ranges for identifiers in certain cases (`Expr::Name`,
`Keyword`, etc.), namely when the identifier comprises an entire AST
node. In Ruff, we do additional ad-hoc lexing to extract identifiers
from source code.

One frequent example: given a function `def f(): ...`, we lex to find
the range of `f`, for use in diagnostics.

Another: `except ValueError as e`, for which the AST doesn't include a
range for `e`.

Note that, as an optimization, we avoid storing the `TextRange` for
`Expr::Name`, since it's already included.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants