-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create an internal lint for detecting "Unicode-unaware" BytePos
& Span
manipulations
#128790
Comments
BytePos
manipulationsBytePos
& Span
manipulations
Use `SourceMap::end_point` instead of `- BytePos(1)` in arg removal suggestion Previously, we tried to remove extra arg commas when providing extra arg removal suggestions. One of the edge cases is having to account for an arg that has a closing delimiter `)` following it. However, the previous suggestion code assumed that the delimiter is in fact exactly the 1-byte `)` character. This assumption was proven incorrect, because we recover from Unicode-confusable delimiters in the parser, which means that the ending delimiter could be a multi-byte codepoint that looks *like* a `)`. Subtracing 1 byte could land us in the middle of a codepoint, triggering a codepoint boundary assertion. This is fixed by using `SourceMap::end_point` which properly accounts for codepoint boundaries. Fixes rust-lang#128717. cc `@fmease` and rust-lang#128790
Ensure let stmt compound assignment removal suggestion respect codepoint boundaries Previously we would try to issue a suggestion for `let x <op>= 1`, i.e. a compound assignment within a `let` binding, to remove the `<op>`. The suggestion code unfortunately incorrectly assumed that the `<op>` is an exactly-1-byte ASCII character, but this assumption is incorrect because we also recover Unicode-confusables like `➖=` as `-=`. In this example, the suggestion code used a `+ BytePos(1)` to calculate the span of the `<op>` codepoint that looks like `-` but the mult-byte Unicode look-alike would cause the suggested removal span to be inside a multi-byte codepoint boundary, triggering a codepoint boundary assertion. The fix is to use `SourceMap::start_point(token_span)` which properly accounts for codepoint boundaries. Fixes rust-lang#128845. cc rust-lang#128790 r? `@fmease`
Use `SourceMap::end_point` instead of `- BytePos(1)` in arg removal suggestion Previously, we tried to remove extra arg commas when providing extra arg removal suggestions. One of the edge cases is having to account for an arg that has a closing delimiter `)` following it. However, the previous suggestion code assumed that the delimiter is in fact exactly the 1-byte `)` character. This assumption was proven incorrect, because we recover from Unicode-confusable delimiters in the parser, which means that the ending delimiter could be a multi-byte codepoint that looks *like* a `)`. Subtracing 1 byte could land us in the middle of a codepoint, triggering a codepoint boundary assertion. This is fixed by using `SourceMap::end_point` which properly accounts for codepoint boundaries. Fixes rust-lang#128717. cc ``@fmease`` and rust-lang#128790
Ensure let stmt compound assignment removal suggestion respect codepoint boundaries Previously we would try to issue a suggestion for `let x <op>= 1`, i.e. a compound assignment within a `let` binding, to remove the `<op>`. The suggestion code unfortunately incorrectly assumed that the `<op>` is an exactly-1-byte ASCII character, but this assumption is incorrect because we also recover Unicode-confusables like `➖=` as `-=`. In this example, the suggestion code used a `+ BytePos(1)` to calculate the span of the `<op>` codepoint that looks like `-` but the mult-byte Unicode look-alike would cause the suggested removal span to be inside a multi-byte codepoint boundary, triggering a codepoint boundary assertion. The fix is to use `SourceMap::start_point(token_span)` which properly accounts for codepoint boundaries. Fixes rust-lang#128845. cc rust-lang#128790 r? ``@fmease``
Use `SourceMap::end_point` instead of `- BytePos(1)` in arg removal suggestion Previously, we tried to remove extra arg commas when providing extra arg removal suggestions. One of the edge cases is having to account for an arg that has a closing delimiter `)` following it. However, the previous suggestion code assumed that the delimiter is in fact exactly the 1-byte `)` character. This assumption was proven incorrect, because we recover from Unicode-confusable delimiters in the parser, which means that the ending delimiter could be a multi-byte codepoint that looks *like* a `)`. Subtracing 1 byte could land us in the middle of a codepoint, triggering a codepoint boundary assertion. This is fixed by using `SourceMap::end_point` which properly accounts for codepoint boundaries. Fixes rust-lang#128717. cc ```@fmease``` and rust-lang#128790
Ensure let stmt compound assignment removal suggestion respect codepoint boundaries Previously we would try to issue a suggestion for `let x <op>= 1`, i.e. a compound assignment within a `let` binding, to remove the `<op>`. The suggestion code unfortunately incorrectly assumed that the `<op>` is an exactly-1-byte ASCII character, but this assumption is incorrect because we also recover Unicode-confusables like `➖=` as `-=`. In this example, the suggestion code used a `+ BytePos(1)` to calculate the span of the `<op>` codepoint that looks like `-` but the mult-byte Unicode look-alike would cause the suggested removal span to be inside a multi-byte codepoint boundary, triggering a codepoint boundary assertion. The fix is to use `SourceMap::start_point(token_span)` which properly accounts for codepoint boundaries. Fixes rust-lang#128845. cc rust-lang#128790 r? ```@fmease```
Use `SourceMap::end_point` instead of `- BytePos(1)` in arg removal suggestion Previously, we tried to remove extra arg commas when providing extra arg removal suggestions. One of the edge cases is having to account for an arg that has a closing delimiter `)` following it. However, the previous suggestion code assumed that the delimiter is in fact exactly the 1-byte `)` character. This assumption was proven incorrect, because we recover from Unicode-confusable delimiters in the parser, which means that the ending delimiter could be a multi-byte codepoint that looks *like* a `)`. Subtracing 1 byte could land us in the middle of a codepoint, triggering a codepoint boundary assertion. This is fixed by using `SourceMap::end_point` which properly accounts for codepoint boundaries. Fixes rust-lang#128717. cc ````@fmease```` and rust-lang#128790
Ensure let stmt compound assignment removal suggestion respect codepoint boundaries Previously we would try to issue a suggestion for `let x <op>= 1`, i.e. a compound assignment within a `let` binding, to remove the `<op>`. The suggestion code unfortunately incorrectly assumed that the `<op>` is an exactly-1-byte ASCII character, but this assumption is incorrect because we also recover Unicode-confusables like `➖=` as `-=`. In this example, the suggestion code used a `+ BytePos(1)` to calculate the span of the `<op>` codepoint that looks like `-` but the mult-byte Unicode look-alike would cause the suggested removal span to be inside a multi-byte codepoint boundary, triggering a codepoint boundary assertion. The fix is to use `SourceMap::start_point(token_span)` which properly accounts for codepoint boundaries. Fixes rust-lang#128845. cc rust-lang#128790 r? ````@fmease````
Rollup merge of rust-lang#128865 - jieyouxu:unicurd, r=Urgau Ensure let stmt compound assignment removal suggestion respect codepoint boundaries Previously we would try to issue a suggestion for `let x <op>= 1`, i.e. a compound assignment within a `let` binding, to remove the `<op>`. The suggestion code unfortunately incorrectly assumed that the `<op>` is an exactly-1-byte ASCII character, but this assumption is incorrect because we also recover Unicode-confusables like `➖=` as `-=`. In this example, the suggestion code used a `+ BytePos(1)` to calculate the span of the `<op>` codepoint that looks like `-` but the mult-byte Unicode look-alike would cause the suggested removal span to be inside a multi-byte codepoint boundary, triggering a codepoint boundary assertion. The fix is to use `SourceMap::start_point(token_span)` which properly accounts for codepoint boundaries. Fixes rust-lang#128845. cc rust-lang#128790 r? ````@fmease````
Rollup merge of rust-lang#128864 - jieyouxu:funnicode, r=Urgau Use `SourceMap::end_point` instead of `- BytePos(1)` in arg removal suggestion Previously, we tried to remove extra arg commas when providing extra arg removal suggestions. One of the edge cases is having to account for an arg that has a closing delimiter `)` following it. However, the previous suggestion code assumed that the delimiter is in fact exactly the 1-byte `)` character. This assumption was proven incorrect, because we recover from Unicode-confusable delimiters in the parser, which means that the ending delimiter could be a multi-byte codepoint that looks *like* a `)`. Subtracing 1 byte could land us in the middle of a codepoint, triggering a codepoint boundary assertion. This is fixed by using `SourceMap::end_point` which properly accounts for codepoint boundaries. Fixes rust-lang#128717. cc ````@fmease```` and rust-lang#128790
As for suggestions:
|
Alternatively, if suggestions are often finding that a span needs to be recreated to match some subset of the AST (e.g. a keyword's span like |
Inspired by #128717 (comment). CC @jieyouxu.
Since we recover from lexically invalid tokens that are Unicode-confusable with tokens that are lexically valid (e.g., U+037E Greek Question Mark → U+003B Semicolon; U+066B Arabic Decimal Separator → U+002C Comma), (suggestion) diagnostic code down the line generally ought not make too many assumptions about the length and/or position in bytes that the
Span
of a supposed Rust token/lexeme "maps to" .In reality however, all too often (suggestion) diagnostic code doesn't follow this 'rule' when performing low-level "span manipulations" defaulting to hard-coded lengths and/or positions. The compiler contains a bunch of snippets like
- BytePos(1)
or+ BytePos(1)
where the code guesses that the code point before/after corresponds to a certain token/lexeme like,
,;
,)
. However, such code doesn't account for the aforesaid recovery which may have mapped a UTF-8 code unit with byte length > 1 to an ASCII character of length 1 which can lead to ICEs (internal assertions or indexing/slicing at non-char boundaries).So it might be worth linting against these error-prone
BytePos
&Span
manipulations.I don't know how feasible it'd be to implement such a lint well (i.e., low false positive rate) or how the exact rules should look like.
This issue may serve a dual purpose as a tracking issue for eliminating this 'pattern' from the code base.
Uplifted from #128717 (comment):
It's so easy to find these kinds of ICEs:
/(\+|-) BytePos\(\d+\)/
insidecompiler/
,.len_utf8()
is >1 and 💥Example ICE: #128717.
Example ICE: I just found this a minute ago while reviewing an unrelated PR:
This code uses a Medium Right Parenthesis Ornament (U+2769) which is confusable with Right Parenthesis.
Leads to:
Code in question:
rust/compiler/rustc_hir_typeck/src/fn_ctxt/checks.rs
Line 1144 in c9687a9
Discussions
Related issues
The text was updated successfully, but these errors were encountered: