-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update string nodes for implicit concatenation #7927
Conversation
Current dependencies on/for this PR:
This stack of pull requests is managed by Graphite. |
7c763c5
to
02ff747
Compare
5b20992
to
66ae3fc
Compare
02ff747
to
ada537e
Compare
66ae3fc
to
fd9f090
Compare
ada537e
to
1ae9b42
Compare
fd9f090
to
7111576
Compare
fc67e03
to
56e27ac
Compare
7111576
to
f544b3f
Compare
13534ce
to
382e85d
Compare
crates/ruff_linter/src/rules/pyupgrade/rules/use_pep604_annotation.rs
Outdated
Show resolved
Hide resolved
cf8ee21
to
aaf65a1
Compare
07e49e4
to
8a392a2
Compare
As highlighted in the review: > If you have two `ConcatenatedStringLiteral` values where both have > equivalent values for `strings` but where one has `value` initialized > and the other does not, would you expect them to compare equal? > Semantically I think I would, since the alternative is that equality is > dependent on whether `as_str()` has been called, which seems incidental. #7927 (comment)
This commit implements the AST design to account for implicit concatenation in string nodes, specifically the `ExprFString`, `ExprStringLiteral`, and `ExprBytesLiteral` nodes.
This commit adds the new variants for the string parts to `AnyNode` and `AnyNodeRef` enums. These parts are `StringLiteral` (part of `ExprStringLiteral`), `BytesLiteral` (part of `ExprBytesLiteral`), and `FString` (part of `ExprFString`). The reason for this is to add visitor methods for these parts. This is done in the following commit. So, the visitor would visit the string as a whole first and then visit each part. ``` ExprStringLiteral - "foo" "bar" |- StringLiteral - "foo" |- StringLiteral - "bar" ``` The above tree helps understand the way visitor would work.
The visitor implementations are updated to visit each part nodes for the respective string nodes. The following example better highlights this: ``` ExprStringLiteral - "foo" "bar" |- StringLiteral - "foo" |- StringLiteral - "bar" ``` The `visit_expr` method would be use to visit the `ExprStringLiteral` while the `visit_string_literal` method would be use for the `StringLiteral` node. Similar methods are added for bytes and f-strings.
The generator is basically improved. Earlier, for an implicitly concatenated string we would produce the joined form. So, ```python "foo" "bar" "baz" ``` For the above example, the generator would give us: ```python "foobarbaz" ``` Now, as we have the information for each part, we will be producing the exact code back.
`Expr` is a general type for all expressions while `LiteralExpressionRef` is a type which includes only the literal expressions. The method is suited more for this type instead. This will also help in the formatter change.
As highlighted in the review: > If you have two `ConcatenatedStringLiteral` values where both have > equivalent values for `strings` but where one has `value` initialized > and the other does not, would you expect them to compare equal? > Semantically I think I would, since the alternative is that equality is > dependent on whether `as_str()` has been called, which seems incidental. #7927 (comment)
inner(&mut expr.left, &mut formatted_expressions); | ||
inner(&mut expr.right, &mut formatted_expressions); | ||
formatted_expressions | ||
.map(|output| Fix::safe_edit(Edit::range_replacement(output, f_string.range()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, how did this get so much simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's the magic of this refactor 😉
Jokes aside, this basically reverts this commit ac4a4da which fixed the bug where this rule wasn't producing the correct fix for an implicitly concatenated string. But, now that we work on individual parts, that isn't required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is impressive work. There's a huge surface area here. You did a great job of breaking down the individual PRs and walking us through it.
I left a few questions, which I'm happy to continued discussing, but erring on the side of approving.
#[derive(Clone, Debug, PartialEq)] | ||
pub struct FString { | ||
pub range: TextRange, | ||
pub values: Vec<Expr>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if these should be smallvec
... It's probably worth benchmarking in a separate PR. In practice, I'd bet the vast majority of concatenations are <= 5 elements.
@@ -130,7 +130,7 @@ pub(crate) fn unrecognized_platform(checker: &mut Checker, test: &Expr) { | |||
if !matches!(value.as_str(), "linux" | "win32" | "cygwin" | "darwin") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example where... we can probably do this without the concatenation? Imagine if we had a custom matcher that worked on implicit concatenations (i.e., a vector of strings). \cc @BurntSushi in case you have obvious ideas here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the closest "out of the box" experience you can get is stream searching with aho-corasick
, but you need to provide a std::io::Read
impl. Mildly annoying but doable in this context.
But I would say to write the simple/obvious code for now, and if this ends up popping up on a profile then we can revisit it and do something bespoke here or use aho-corasick
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer to avoid the complexity for such simple cases but I do understand that this could be useful in similar cases where the strings are long.
baa24e7
to
8193ff9
Compare
Rebase of #6365 authored by @davidszotten. ## Summary This PR updates the AST structure for an f-string elements. The main **motivation** behind this change is to have a dedicated node for the string part of an f-string. Previously, the existing `ExprStringLiteral` node was used for this purpose which isn't exactly correct. The `ExprStringLiteral` node should include the quotes as well in the range but the f-string literal element doesn't include the quote as it's a specific part within an f-string. For example, ```python f"foo {x}" # ^^^^ # This is the literal part of an f-string ``` The introduction of `FStringElement` enum is helpful which represent either the literal part or the expression part of an f-string. ### Rule Updates This means that there'll be two nodes representing a string depending on the context. One for a normal string literal while the other is a string literal within an f-string. The AST checker is updated to accommodate this change. The rules which work on string literal are updated to check on the literal part of f-string as well. #### Notes 1. The `Expr::is_literal_expr` method would check for `ExprStringLiteral` and return true if so. But now that we don't represent the literal part of an f-string using that node, this improves the method's behavior and confines to the actual expression. We do have the `FStringElement::is_literal` method. 2. We avoid checking if we're in a f-string context before adding to `string_type_definitions` because the f-string literal is now a dedicated node and not part of `Expr`. 3. Annotations cannot use f-string so we avoid changing any rules which work on annotation and checks for `ExprStringLiteral`. ## Test Plan - All references of `Expr::StringLiteral` were checked to see if any of the rules require updating to account for the f-string literal element node. - New test cases are added for rules which check against the literal part of an f-string. - Check the ecosystem results and ensure it remains unchanged. ## Performance There's a performance penalty in the parser. The reason for this remains unknown as it seems that the generated assembly code is now different for the `__reduce154` function. The reduce function body is just popping the `ParenthesizedExpr` on top of the stack and pushing it with the new location. - The size of `FStringElement` enum is the same as `Expr` which is what it replaces in `FString::format_spec` - The size of `FStringExpressionElement` is the same as `ExprFormattedValue` which is what it replaces I tried reducing the `Expr` enum from 80 bytes to 72 bytes but it hardly resulted in any performance gain. The difference can be seen here: - Original profile: https://share.firefox.dev/3Taa7ES - Profile after boxing some node fields: https://share.firefox.dev/3GsNXpD ### Backtracking I tried backtracking the changes to see if any of the isolated change produced this regression. The problem here is that the overall change is so small that there's only a single checkpoint where I can backtrack and that checkpoint results in the same regression. This checkpoint is to revert using `Expr` to the `FString::format_spec` field. After this point, the change would revert back to the original implementation. ## Review process The review process is similar to #7927. The first set of commits update the node structure, parser, and related AST files. Then, further commits update the linter and formatter part to account for the AST change. --------- Co-authored-by: David Szotten <davidszotten@gmail.com>
Summary
This PR updates the string nodes (
ExprStringLiteral
,ExprBytesLiteral
, andExprFString
) to account for implicit string concatenation.Motivation
In Python, implicit string concatenation are joined while parsing because the interpreter doesn't require the information for each part. While that's feasible for an interpreter, it falls short for a static analysis tool where having such information is more useful. Currently, various parts of the code uses the lexer to get the individual string parts.
One of the main challenge this solves is that of string formatting. Currently, the formatter relies on the lexer to get the individual string parts, and formats them including the comments accordingly. But, with PEP 701, f-string can also contain comments. Without this change, it becomes very difficult to add support for f-string formatting.
Implementation
The initial proposal was made in this discussion: #6183 (comment). There were various AST designs which were explored for this task which are available in the linked internal document1.
The selected variant was the one where the nodes were kept as it is except that the
implicit_concatenated
field was removed and instead a new struct was added to theExpr*
struct. This would be a private struct would contain the actual implementation of how the AST is designed for both single and implicitly concatenated strings.This implementation is achieved through an enum with two variants:
Single
andConcatenated
to avoid allocating a vector even for single strings. There are various public methods available on the value struct to query certain information regarding the node.The nodes are structured in the following way:
Visitor
The way the nodes are structured is that the entire string, including all the parts that are implicitly concatenation, is a single node containing individual nodes for the parts. The previous section has a representation of that tree for all the string nodes. This means that new visitor methods are added to visit the individual parts of string, bytes, and f-strings for
Visitor
,PreorderVisitor
, andTransformer
.Test Plan
cargo insta test --workspace --all-features --unreferenced reject
Footnotes
Internal document: https://www.notion.so/astral-sh/Implicit-String-Concatenation-e036345dc48943f89e416c087bf6f6d9?pvs=4 ↩