Always use identifier ranges to store bindings #5110

charliermarsh · 2023-06-15T03:56:56Z

Summary

At present, when we store a binding, we include a TextRange alongside it. The TextRange sometimes matches the exact range of the identifier to which the Binding is linked, but... not always.

For example, given:

x = 1

The binding we create will use the range of x, because the left-hand side is an Expr::Name, which has a valid range on it.

However, given:

try:
  pass
except ValueError as e:
  pass

When we create a binding for e, we don't have a TextRange... The AST doesn't give us one. So we end up extracting it via lexing.

This PR extends that pattern to the rest of the binding kinds, to ensure that whenever we create a binding, we always use the range of the bound name. This leads to better diagnostics in cases like pattern matching, whereby the diagnostic for "unused variable x" here used to include *x, instead of just x:

def f(provided: int) -> int:
    match provided:
        case [_, *x]:
            pass

This is also required for symbol renames, since we track writes as bindings -- so we need to know the ranges of the bound symbols.

By storing these bindings precisely, we can also remove the binding.trimmed_range abstraction -- since bindings already use the "trimmed range".

To implement this behavior, I took some of our existing utilities (like the code we had for except ValueError as e above), migrated them from a full lexer to a zero-allocation lexer that only identifies "identifiers", and moved the behavior into a trait, so we can now do stmt.identifier(locator) to get the range for the identifier.

Honestly, we might end up discarding much of this if we decide to put ranges on all identifiers (astral-sh/RustPython-Parser#8). But even if we do, this will still be a good change, because the lexer introduced here is useful beyond names (e.g., we use it find the except keyword in an exception handler, to find the else after a for loop, and so on). So, I'm fine committing this even if we end up changing our minds about the right approach.

Closes #5090.

Benchmarks

No significant change, with one statistically significant improvement (-2.1654% on linter/all-rules/large/dataset.py):

linter/default-rules/numpy/globals.py
                        time:   [73.922 µs 73.955 µs 73.986 µs]
                        thrpt:  [39.882 MiB/s 39.898 MiB/s 39.916 MiB/s]
                 change:
                        time:   [-0.5579% -0.4732% -0.3980%] (p = 0.00 < 0.05)
                        thrpt:  [+0.3996% +0.4755% +0.5611%]
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
linter/default-rules/pydantic/types.py
                        time:   [1.4909 ms 1.4917 ms 1.4926 ms]
                        thrpt:  [17.087 MiB/s 17.096 MiB/s 17.106 MiB/s]
                 change:
                        time:   [+0.2140% +0.2741% +0.3392%] (p = 0.00 < 0.05)
                        thrpt:  [-0.3380% -0.2734% -0.2136%]
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
linter/default-rules/numpy/ctypeslib.py
                        time:   [688.97 µs 691.34 µs 694.15 µs]
                        thrpt:  [23.988 MiB/s 24.085 MiB/s 24.168 MiB/s]
                 change:
                        time:   [-1.3282% -0.7298% -0.1466%] (p = 0.02 < 0.05)
                        thrpt:  [+0.1468% +0.7351% +1.3461%]
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  12 (12.00%) high severe
linter/default-rules/large/dataset.py
                        time:   [3.3872 ms 3.4032 ms 3.4191 ms]
                        thrpt:  [11.899 MiB/s 11.954 MiB/s 12.011 MiB/s]
                 change:
                        time:   [-0.6427% -0.2635% +0.0906%] (p = 0.17 > 0.05)
                        thrpt:  [-0.0905% +0.2642% +0.6469%]
                        No change in performance detected.
Found 20 outliers among 100 measurements (20.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  13 (13.00%) high severe

linter/all-rules/numpy/globals.py
                        time:   [148.99 µs 149.21 µs 149.42 µs]
                        thrpt:  [19.748 MiB/s 19.776 MiB/s 19.805 MiB/s]
                 change:
                        time:   [-0.7340% -0.5068% -0.2778%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2785% +0.5094% +0.7395%]
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high severe
linter/all-rules/pydantic/types.py
                        time:   [3.0362 ms 3.0396 ms 3.0441 ms]
                        thrpt:  [8.3779 MiB/s 8.3903 MiB/s 8.3997 MiB/s]
                 change:
                        time:   [-0.0957% +0.0618% +0.2125%] (p = 0.45 > 0.05)
                        thrpt:  [-0.2121% -0.0618% +0.0958%]
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe
linter/all-rules/numpy/ctypeslib.py
                        time:   [1.6879 ms 1.6894 ms 1.6909 ms]
                        thrpt:  [9.8478 MiB/s 9.8562 MiB/s 9.8652 MiB/s]
                 change:
                        time:   [-0.2279% -0.0888% +0.0436%] (p = 0.18 > 0.05)
                        thrpt:  [-0.0435% +0.0889% +0.2284%]
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) low mild
  1 (1.00%) high severe
linter/all-rules/large/dataset.py
                        time:   [7.1520 ms 7.1586 ms 7.1654 ms]
                        thrpt:  [5.6777 MiB/s 5.6831 MiB/s 5.6883 MiB/s]
                 change:
                        time:   [-2.5626% -2.1654% -1.7780%] (p = 0.00 < 0.05)
                        thrpt:  [+1.8102% +2.2133% +2.6300%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild

github-actions · 2023-06-15T04:31:12Z

PR Check Results

Ecosystem

ℹ️ ecosystem check detected changes. (+12, -12, 0 error(s))

airflow (+11, -11)

- airflow/callbacks/pipe_callback_sink.py:20:40: TCH003 [*] Move standard library import `multiprocessing.connection.Connection` into a type-checking block
+ airflow/callbacks/pipe_callback_sink.py:20:54: TCH003 [*] Move standard library import `multiprocessing.connection.Connection` into a type-checking block
- airflow/dag_processing/manager.py:36:40: TCH003 [*] Move standard library import `multiprocessing.connection.Connection` into a type-checking block
+ airflow/dag_processing/manager.py:36:54: TCH003 [*] Move standard library import `multiprocessing.connection.Connection` into a type-checking block
- airflow/dag_processing/processor.py:29:40: TCH003 [*] Move standard library import `multiprocessing.connection.Connection` into a type-checking block
+ airflow/dag_processing/processor.py:29:54: TCH003 [*] Move standard library import `multiprocessing.connection.Connection` into a type-checking block
- airflow/executors/kubernetes_executor.py:36:46: TCH002 [*] Move third-party import `kubernetes.client.models` into a type-checking block
+ airflow/executors/kubernetes_executor.py:36:56: TCH002 [*] Move third-party import `kubernetes.client.models` into a type-checking block
- airflow/kubernetes/k8s_model.py:23:31: TCH002 [*] Move third-party import `kubernetes.client.models` into a type-checking block
+ airflow/kubernetes/k8s_model.py:23:41: TCH002 [*] Move third-party import `kubernetes.client.models` into a type-checking block
- airflow/providers/elasticsearch/hooks/elasticsearch.py:29:39: TCH001 [*] Move application import `airflow.models.connection.Connection` into a type-checking block
+ airflow/providers/elasticsearch/hooks/elasticsearch.py:29:53: TCH001 [*] Move application import `airflow.models.connection.Connection` into a type-checking block
+ airflow/providers/exasol/hooks/exasol.py:23:18: TCH002 [*] Move third-party import `pandas` into a type-checking block
- airflow/providers/exasol/hooks/exasol.py:23:8: TCH002 [*] Move third-party import `pandas` into a type-checking block
- airflow/providers/google/cloud/transfers/presto_to_gcs.py:23:28: TCH002 [*] Move third-party import `prestodb.dbapi.Cursor` into a type-checking block
+ airflow/providers/google/cloud/transfers/presto_to_gcs.py:23:38: TCH002 [*] Move third-party import `prestodb.dbapi.Cursor` into a type-checking block
- airflow/providers/google/cloud/transfers/trino_to_gcs.py:23:25: TCH002 [*] Move third-party import `trino.dbapi.Cursor` into a type-checking block
+ airflow/providers/google/cloud/transfers/trino_to_gcs.py:23:35: TCH002 [*] Move third-party import `trino.dbapi.Cursor` into a type-checking block
+ airflow/providers/influxdb/hooks/influxdb.py:27:18: TCH002 [*] Move third-party import `pandas` into a type-checking block
- airflow/providers/influxdb/hooks/influxdb.py:27:8: TCH002 [*] Move third-party import `pandas` into a type-checking block
- airflow/serialization/serialized_objects.py:74:39: TCH004 [*] Move import `kubernetes.client.models` out of type-checking block. Import is used for more than type hinting.
+ airflow/serialization/serialized_objects.py:74:49: TCH004 [*] Move import `kubernetes.client.models` out of type-checking block. Import is used for more than type hinting.

bokeh (+1, -1)

+ src/bokeh/plotting/_figure.py:25:17: TCH002 [*] Move third-party import `numpy` into a type-checking block
- src/bokeh/plotting/_figure.py:25:8: TCH002 [*] Move third-party import `numpy` into a type-checking block

Rules changed: 4

Rule	Changes	Additions	Removals
TCH002	14	7	7
TCH003	6	3	3
TCH001	2	1	1
TCH004	2	1	1

Benchmark

Linux

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.01      6.3±0.04ms     6.4 MB/sec    1.00      6.3±0.03ms     6.5 MB/sec
formatter/numpy/ctypeslib.py               1.00   1320.8±4.02µs    12.6 MB/sec    1.00   1318.0±5.29µs    12.6 MB/sec
formatter/numpy/globals.py                 1.00    127.5±0.22µs    23.1 MB/sec    1.00    126.9±0.54µs    23.2 MB/sec
formatter/pydantic/types.py                1.01      2.6±0.03ms     9.9 MB/sec    1.00      2.6±0.01ms    10.0 MB/sec
linter/all-rules/large/dataset.py          1.01     13.7±0.16ms     3.0 MB/sec    1.00     13.5±0.15ms     3.0 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.01      3.3±0.02ms     5.0 MB/sec    1.00      3.3±0.02ms     5.1 MB/sec
linter/all-rules/numpy/globals.py          1.01    414.3±3.33µs     7.1 MB/sec    1.00    411.5±0.60µs     7.2 MB/sec
linter/all-rules/pydantic/types.py         1.02      5.8±0.06ms     4.4 MB/sec    1.00      5.7±0.03ms     4.5 MB/sec
linter/default-rules/large/dataset.py      1.00      6.6±0.04ms     6.2 MB/sec    1.00      6.6±0.05ms     6.2 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.01   1431.3±3.52µs    11.6 MB/sec    1.00   1412.8±2.88µs    11.8 MB/sec
linter/default-rules/numpy/globals.py      1.02    160.6±0.26µs    18.4 MB/sec    1.00    158.0±0.81µs    18.7 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.0±0.02ms     8.5 MB/sec    1.00      3.0±0.02ms     8.5 MB/sec

Windows

group                                      main                                   pr
-----                                      ----                                   --
formatter/large/dataset.py                 1.00      7.8±0.06ms     5.2 MB/sec    1.10      8.5±0.02ms     4.8 MB/sec
formatter/numpy/ctypeslib.py               1.00   1560.8±6.07µs    10.7 MB/sec    1.08  1693.2±10.24µs     9.8 MB/sec
formatter/numpy/globals.py                 1.00    155.6±3.80µs    19.0 MB/sec    1.06    164.6±1.46µs    17.9 MB/sec
formatter/pydantic/types.py                1.00      3.2±0.02ms     8.1 MB/sec    1.10      3.5±0.03ms     7.4 MB/sec
linter/all-rules/large/dataset.py          1.00     16.0±0.04ms     2.5 MB/sec    1.01     16.1±0.04ms     2.5 MB/sec
linter/all-rules/numpy/ctypeslib.py        1.00      4.2±0.02ms     4.0 MB/sec    1.00      4.2±0.02ms     4.0 MB/sec
linter/all-rules/numpy/globals.py          1.00    429.7±5.47µs     6.9 MB/sec    1.01    433.0±7.27µs     6.8 MB/sec
linter/all-rules/pydantic/types.py         1.00      7.0±0.02ms     3.6 MB/sec    1.01      7.1±0.02ms     3.6 MB/sec
linter/default-rules/large/dataset.py      1.00      8.3±0.01ms     4.9 MB/sec    1.00      8.3±0.03ms     4.9 MB/sec
linter/default-rules/numpy/ctypeslib.py    1.00   1716.7±6.48µs     9.7 MB/sec    1.00   1719.1±8.34µs     9.7 MB/sec
linter/default-rules/numpy/globals.py      1.00    185.2±1.16µs    15.9 MB/sec    1.01    186.5±0.86µs    15.8 MB/sec
linter/default-rules/pydantic/types.py     1.00      3.7±0.01ms     6.8 MB/sec    1.00      3.8±0.01ms     6.8 MB/sec

charliermarsh · 2023-06-15T04:39:59Z

crates/ruff_python_ast/src/identifier.rs

+//! The statement defining `f` has a range, but the identifier `f` does not.
+//!
+//! This module assists with extracting [`TextRange`] ranges from AST nodes
+//! via manual lexical analysis.


This file got collapsed, but it's really the one that needs to be reviewed. Most of the rest is just API migration.

crates/ruff_python_ast/src/helpers.rs

konstin · 2023-06-15T09:23:19Z

crates/ruff_python_ast/src/identifier.rs

+                rest: Some(_),
+                ..
+            }) => {
+                if let Some(pattern) = patterns.last() {


Suggested change

if let Some(pattern) = patterns.last() {

// From https://docs.python.org/3/reference/compound_stmts.html#grammar-token-python-grammar-mapping_pattern

// > At most one double star pattern may be in a mapping pattern.

// > The double star pattern must be the last subpattern in the mapping pattern.

if let Some(pattern) = patterns.last() {

You might want to change the exact comment, but i expect many readers won't be familiar with what exactly the PatternMatchMapping node entails

crates/ruff_python_ast/src/identifier.rs

konstin · 2023-06-15T09:35:54Z

crates/ruff_python_ast/src/identifier.rs

+
+    /// Peeks the next character from the input stream without consuming it.
+    /// Returns [`EOF_CHAR`] if the file is at the end of the file.
+    fn first(&self) -> char {


Why does this return EOF while bump returns an option

This is consistent with how Micha designed it in the trivia.rs (from which this is derived), so I'm just gonna leave it for now.

charliermarsh force-pushed the charlie/binding-identifier branch 2 times, most recently from 9c36a04 to 970107a Compare June 15, 2023 04:34

charliermarsh requested review from MichaReiser and konstin June 15, 2023 04:39

charliermarsh commented Jun 15, 2023

View reviewed changes

charliermarsh force-pushed the charlie/binding-identifier branch from 970107a to 3a7b3f0 Compare June 15, 2023 05:16

konstin approved these changes Jun 15, 2023

View reviewed changes

charliermarsh added 4 commits June 15, 2023 09:28

Always use identifier range for binding

843b4ce

Move utilities into their own crates

1c2942d

Add a lexer

d2fbf7c

Use a trait

b11d062

charliermarsh force-pushed the charlie/binding-identifier branch from 3a7b3f0 to 7945a9f Compare June 15, 2023 16:48

Remove temporary comments

43a26ca

charliermarsh force-pushed the charlie/binding-identifier branch from 7945a9f to 43a26ca Compare June 15, 2023 16:50

charliermarsh enabled auto-merge (squash) June 15, 2023 16:50

charliermarsh merged commit 5ea3e42 into main Jun 15, 2023

charliermarsh deleted the charlie/binding-identifier branch June 15, 2023 18:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always use identifier ranges to store bindings #5110

Always use identifier ranges to store bindings #5110

charliermarsh commented Jun 15, 2023 •

edited

Loading

github-actions bot commented Jun 15, 2023 •

edited

Loading

charliermarsh Jun 15, 2023

konstin Jun 15, 2023

konstin Jun 15, 2023

charliermarsh Jun 15, 2023

-                if let Some(pattern) = patterns.last() {
+                // From https://docs.python.org/3/reference/compound_stmts.html#grammar-token-python-grammar-mapping_pattern
+                // > At most one double star pattern may be in a mapping pattern.
+                // > The double star pattern must be the last subpattern in the mapping pattern.
+                if let Some(pattern) = patterns.last() {

Always use identifier ranges to store bindings #5110

Always use identifier ranges to store bindings #5110

Conversation

charliermarsh commented Jun 15, 2023 • edited Loading

Summary

Benchmarks

github-actions bot commented Jun 15, 2023 • edited Loading

PR Check Results

Ecosystem

Benchmark

Linux

Windows

charliermarsh Jun 15, 2023

Choose a reason for hiding this comment

konstin Jun 15, 2023

Choose a reason for hiding this comment

konstin Jun 15, 2023

Choose a reason for hiding this comment

charliermarsh Jun 15, 2023

Choose a reason for hiding this comment

charliermarsh commented Jun 15, 2023 •

edited

Loading

github-actions bot commented Jun 15, 2023 •

edited

Loading