Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update string nodes for implicit concatenation #7927

Merged
merged 14 commits into from
Nov 24, 2023

Conversation

dhruvmanila
Copy link
Member

@dhruvmanila dhruvmanila commented Oct 12, 2023

Summary

This PR updates the string nodes (ExprStringLiteral, ExprBytesLiteral, and ExprFString) to account for implicit string concatenation.

Motivation

In Python, implicit string concatenation are joined while parsing because the interpreter doesn't require the information for each part. While that's feasible for an interpreter, it falls short for a static analysis tool where having such information is more useful. Currently, various parts of the code uses the lexer to get the individual string parts.

One of the main challenge this solves is that of string formatting. Currently, the formatter relies on the lexer to get the individual string parts, and formats them including the comments accordingly. But, with PEP 701, f-string can also contain comments. Without this change, it becomes very difficult to add support for f-string formatting.

Implementation

The initial proposal was made in this discussion: #6183 (comment). There were various AST designs which were explored for this task which are available in the linked internal document1.

The selected variant was the one where the nodes were kept as it is except that the implicit_concatenated field was removed and instead a new struct was added to the Expr* struct. This would be a private struct would contain the actual implementation of how the AST is designed for both single and implicitly concatenated strings.

This implementation is achieved through an enum with two variants: Single and Concatenated to avoid allocating a vector even for single strings. There are various public methods available on the value struct to query certain information regarding the node.

The nodes are structured in the following way:

ExprStringLiteral - "foo" "bar"
|- StringLiteral - "foo"
|- StringLiteral - "bar"

ExprBytesLiteral - b"foo" b"bar"
|- BytesLiteral - b"foo"
|- BytesLiteral - b"bar"

ExprFString - "foo" f"bar {x}"
|- FStringPart::Literal - "foo"
|- FStringPart::FString - f"bar {x}"
  |- StringLiteral - "bar "
  |- FormattedValue - "x"

Visitor

The way the nodes are structured is that the entire string, including all the parts that are implicitly concatenation, is a single node containing individual nodes for the parts. The previous section has a representation of that tree for all the string nodes. This means that new visitor methods are added to visit the individual parts of string, bytes, and f-strings for Visitor, PreorderVisitor, and Transformer.

Test Plan

  • cargo insta test --workspace --all-features --unreferenced reject
  • Verify that the ecosystem results are unchanged

Footnotes

  1. Internal document: https://www.notion.so/astral-sh/Implicit-String-Concatenation-e036345dc48943f89e416c087bf6f6d9?pvs=4

@dhruvmanila dhruvmanila added parser Related to the parser internal An internal refactor or improvement and removed parser Related to the parser labels Oct 12, 2023
@dhruvmanila dhruvmanila force-pushed the dhruv/implicit-str-concat-node branch 2 times, most recently from 7c763c5 to 02ff747 Compare October 19, 2023 13:48
@dhruvmanila dhruvmanila changed the base branch from main to dhruv/constant-to-literal October 19, 2023 13:48
@dhruvmanila dhruvmanila changed the title POC: AST node for implicit string concatenation New AST node for implicit string concatenation Oct 19, 2023
@dhruvmanila dhruvmanila force-pushed the dhruv/constant-to-literal branch from 5b20992 to 66ae3fc Compare October 19, 2023 17:52
@dhruvmanila dhruvmanila force-pushed the dhruv/implicit-str-concat-node branch from 02ff747 to ada537e Compare October 19, 2023 17:53
@dhruvmanila dhruvmanila force-pushed the dhruv/constant-to-literal branch from 66ae3fc to fd9f090 Compare October 19, 2023 19:27
@dhruvmanila dhruvmanila force-pushed the dhruv/implicit-str-concat-node branch from ada537e to 1ae9b42 Compare October 19, 2023 19:27
@dhruvmanila dhruvmanila force-pushed the dhruv/constant-to-literal branch from fd9f090 to 7111576 Compare October 19, 2023 19:33
@dhruvmanila dhruvmanila force-pushed the dhruv/implicit-str-concat-node branch 2 times, most recently from fc67e03 to 56e27ac Compare October 20, 2023 06:21
@dhruvmanila dhruvmanila force-pushed the dhruv/constant-to-literal branch from 7111576 to f544b3f Compare October 20, 2023 10:47
@dhruvmanila dhruvmanila force-pushed the dhruv/implicit-str-concat-node branch 2 times, most recently from 13534ce to 382e85d Compare October 20, 2023 14:07
@dhruvmanila dhruvmanila force-pushed the dhruv/constant-to-literal branch 2 times, most recently from cf8ee21 to aaf65a1 Compare October 30, 2023 05:50
Base automatically changed from dhruv/constant-to-literal to main October 30, 2023 06:43
@dhruvmanila dhruvmanila force-pushed the dhruv/implicit-str-concat-node branch 3 times, most recently from 07e49e4 to 8a392a2 Compare November 14, 2023 13:38
@dhruvmanila dhruvmanila changed the title New AST node for implicit string concatenation Update string nodes for implicit concatenation Nov 14, 2023
dhruvmanila added a commit that referenced this pull request Nov 22, 2023
As highlighted in the review:

> If you have two `ConcatenatedStringLiteral` values where both have
> equivalent values for `strings` but where one has `value` initialized
> and the other does not, would you expect them to compare equal?
> Semantically I think I would, since the alternative is that equality is
> dependent on whether `as_str()` has been called, which seems incidental.

#7927 (comment)
This commit implements the AST design to account for implicit
concatenation in string nodes, specifically the `ExprFString`,
`ExprStringLiteral`, and `ExprBytesLiteral` nodes.
This commit adds the new variants for the string parts to `AnyNode` and
`AnyNodeRef` enums. These parts are `StringLiteral` (part of
`ExprStringLiteral`), `BytesLiteral` (part of `ExprBytesLiteral`), and
`FString` (part of `ExprFString`).

The reason for this is to add visitor methods for these parts. This is
done in the following commit. So, the visitor would visit the string as
a whole first and then visit each part.

```
ExprStringLiteral - "foo" "bar"
|- StringLiteral - "foo"
|- StringLiteral - "bar"
```

The above tree helps understand the way visitor would work.
The visitor implementations are updated to visit each part nodes for the
respective string nodes.

The following example better highlights this:
```
ExprStringLiteral - "foo" "bar"
|- StringLiteral - "foo"
|- StringLiteral - "bar"
```

The `visit_expr` method would be use to visit the `ExprStringLiteral`
while the `visit_string_literal` method would be use for the
`StringLiteral` node. Similar methods are added for bytes and f-strings.
The generator is basically improved. Earlier, for an implicitly
concatenated string we would produce the joined form. So,

```python
"foo" "bar" "baz"
```

For the above example, the generator would give us:
```python
"foobarbaz"
```

Now, as we have the information for each part, we will be producing the
exact code back.
`Expr` is a general type for all expressions while
`LiteralExpressionRef` is a type which includes only the literal
expressions. The method is suited more for this type instead.

This will also help in the formatter change.
As highlighted in the review:

> If you have two `ConcatenatedStringLiteral` values where both have
> equivalent values for `strings` but where one has `value` initialized
> and the other does not, would you expect them to compare equal?
> Semantically I think I would, since the alternative is that equality is
> dependent on whether `as_str()` has been called, which seems incidental.

#7927 (comment)
inner(&mut expr.left, &mut formatted_expressions);
inner(&mut expr.right, &mut formatted_expressions);
formatted_expressions
.map(|output| Fix::safe_edit(Edit::range_replacement(output, f_string.range())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, how did this get so much simpler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's the magic of this refactor 😉

Jokes aside, this basically reverts this commit ac4a4da which fixed the bug where this rule wasn't producing the correct fix for an implicitly concatenated string. But, now that we work on individual parts, that isn't required.

Copy link
Member

@charliermarsh charliermarsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is impressive work. There's a huge surface area here. You did a great job of breaking down the individual PRs and walking us through it.

I left a few questions, which I'm happy to continued discussing, but erring on the side of approving.

#[derive(Clone, Debug, PartialEq)]
pub struct FString {
pub range: TextRange,
pub values: Vec<Expr>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if these should be smallvec... It's probably worth benchmarking in a separate PR. In practice, I'd bet the vast majority of concatenations are <= 5 elements.

@@ -130,7 +130,7 @@ pub(crate) fn unrecognized_platform(checker: &mut Checker, test: &Expr) {
if !matches!(value.as_str(), "linux" | "win32" | "cygwin" | "darwin") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example where... we can probably do this without the concatenation? Imagine if we had a custom matcher that worked on implicit concatenations (i.e., a vector of strings). \cc @BurntSushi in case you have obvious ideas here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the closest "out of the box" experience you can get is stream searching with aho-corasick, but you need to provide a std::io::Read impl. Mildly annoying but doable in this context.

But I would say to write the simple/obvious code for now, and if this ends up popping up on a profile then we can revisit it and do something bespoke here or use aho-corasick.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer to avoid the complexity for such simple cases but I do understand that this could be useful in similar cases where the strings are long.

@dhruvmanila dhruvmanila merged commit 017e829 into main Nov 24, 2023
17 checks passed
@dhruvmanila dhruvmanila deleted the dhruv/implicit-str-concat-node branch November 24, 2023 23:55
dhruvmanila added a commit that referenced this pull request Dec 7, 2023
Rebase of #6365 authored by @davidszotten.

## Summary

This PR updates the AST structure for an f-string elements.

The main **motivation** behind this change is to have a dedicated node
for the string part of an f-string. Previously, the existing
`ExprStringLiteral` node was used for this purpose which isn't exactly
correct. The `ExprStringLiteral` node should include the quotes as well
in the range but the f-string literal element doesn't include the quote
as it's a specific part within an f-string. For example,

```python
f"foo {x}"
# ^^^^
# This is the literal part of an f-string
```

The introduction of `FStringElement` enum is helpful which represent
either the literal part or the expression part of an f-string.

### Rule Updates

This means that there'll be two nodes representing a string depending on
the context. One for a normal string literal while the other is a string
literal within an f-string. The AST checker is updated to accommodate
this change. The rules which work on string literal are updated to check
on the literal part of f-string as well.

#### Notes

1. The `Expr::is_literal_expr` method would check for
`ExprStringLiteral` and return true if so. But now that we don't
represent the literal part of an f-string using that node, this improves
the method's behavior and confines to the actual expression. We do have
the `FStringElement::is_literal` method.
2. We avoid checking if we're in a f-string context before adding to
`string_type_definitions` because the f-string literal is now a
dedicated node and not part of `Expr`.
3. Annotations cannot use f-string so we avoid changing any rules which
work on annotation and checks for `ExprStringLiteral`.

## Test Plan

- All references of `Expr::StringLiteral` were checked to see if any of
the rules require updating to account for the f-string literal element
node.
- New test cases are added for rules which check against the literal
part of an f-string.
- Check the ecosystem results and ensure it remains unchanged.

## Performance

There's a performance penalty in the parser. The reason for this remains
unknown as it seems that the generated assembly code is now different
for the `__reduce154` function. The reduce function body is just popping
the `ParenthesizedExpr` on top of the stack and pushing it with the new
location.

- The size of `FStringElement` enum is the same as `Expr` which is what
it replaces in `FString::format_spec`
- The size of `FStringExpressionElement` is the same as
`ExprFormattedValue` which is what it replaces

I tried reducing the `Expr` enum from 80 bytes to 72 bytes but it hardly
resulted in any performance gain. The difference can be seen here:
- Original profile: https://share.firefox.dev/3Taa7ES
- Profile after boxing some node fields:
https://share.firefox.dev/3GsNXpD

### Backtracking

I tried backtracking the changes to see if any of the isolated change
produced this regression. The problem here is that the overall change is
so small that there's only a single checkpoint where I can backtrack and
that checkpoint results in the same regression. This checkpoint is to
revert using `Expr` to the `FString::format_spec` field. After this
point, the change would revert back to the original implementation.

## Review process

The review process is similar to #7927. The first set of commits update
the node structure, parser, and related AST files. Then, further commits
update the linter and formatter part to account for the AST change.

---------

Co-authored-by: David Szotten <davidszotten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants