Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ruff] Extend unnecessary-regular-expression to non-literal strings (RUF055) #14679

Merged
merged 22 commits into from
Dec 3, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions crates/ruff_linter/resources/test/fixtures/ruff/RUF055_1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
"""Test that RUF055 can follow a single str assignment for both the pattern and
the replacement argument to re.sub
"""

import re

pat1 = "needle"

re.sub(pat1, "", haystack)

# aliases are not followed, so this one should not trigger the rule
if pat4 := pat1:
re.sub(pat4, "", haystack)

# also works for the `repl` argument in sub
repl = "new"
re.sub(r"abc", repl, haystack)
3 changes: 2 additions & 1 deletion crates/ruff_linter/src/rules/ruff/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -409,7 +409,8 @@ mod tests {
#[test_case(Rule::MapIntVersionParsing, Path::new("RUF048_1.py"))]
#[test_case(Rule::UnrawRePattern, Path::new("RUF039.py"))]
#[test_case(Rule::UnrawRePattern, Path::new("RUF039_concat.py"))]
#[test_case(Rule::UnnecessaryRegularExpression, Path::new("RUF055.py"))]
#[test_case(Rule::UnnecessaryRegularExpression, Path::new("RUF055_0.py"))]
#[test_case(Rule::UnnecessaryRegularExpression, Path::new("RUF055_1.py"))]
fn preview_rules(rule_code: Rule, path: &Path) -> Result<()> {
let snapshot = format!(
"preview__{}_{}",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
use ruff_diagnostics::{AlwaysFixableViolation, Applicability, Diagnostic, Edit, Fix};
use ruff_macros::{derive_message_formats, ViolationMetadata};
use ruff_python_ast::ExprStringLiteral;
use ruff_python_ast::{
Arguments, CmpOp, Expr, ExprAttribute, ExprCall, ExprCompare, ExprContext, Identifier,
};
use ruff_python_semantic::analyze::typing::find_binding_value;
use ruff_python_semantic::{Modules, SemanticModel};
use ruff_text_size::TextRange;

Expand Down Expand Up @@ -90,8 +92,8 @@ pub(crate) fn unnecessary_regular_expression(checker: &mut Checker, call: &ExprC
return;
};

// For now, restrict this rule to string literals
let Some(string_lit) = re_func.pattern.as_string_literal_expr() else {
// For now, restrict this rule to string literals and variables that can be resolved to literals
let Some(string_lit) = resolve_string_literal(re_func.pattern, semantic) else {
return;
};

Expand Down Expand Up @@ -173,9 +175,8 @@ impl<'a> ReFunc<'a> {
// version
("sub", 3) => {
let repl = call.arguments.find_argument("repl", 1)?;
if !repl.is_string_literal_expr() {
return None;
}
// make sure repl can be resolved to a string literal
resolve_string_literal(repl, semantic)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, I think the need here is slightly different to the need on lines 95-96. On lines 95-96, we do need to know the value of the string in order to be able to check it doesn't have any metacharacters in it (so only a string literal will do, or something that we can resolve to a string literal). But here, we just need to know it's a string; any string will do, as long as the user isn't passing in a function.

Is that the case? If so, it might be worth adding back the is_str function you added in efcc4cf and using that here, rather than using resolve_string_literal in both places. The advantage of the is_str technique is that it also understands basic type hints, e.g. it would understand that re.sub() is being passed a string for the repl argument in something like this:

import re

def foo(input_str: str, repl: str):
    re.sub("foobar", repl, input_str)

Copy link
Member

@AlexWaygood AlexWaygood Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I look at it, maybe we do need to know (and analyze) the value of the repl string for a fully accurate analysis, though. For example, it seems like the initial version of the check that we merged yesterday emits a false-positive diagnostic (and incorrect autofix) for this:

import re

re.sub(r"a", r"\g<0>\g<0>\g<0>", "a")

Now, this is a massive edge case -- I had to work quite hard to find it! I believe the only way you get a false positive with the rule's current logic is if there's a \g in the replacement string but no backslashes or metacharacters in the pattenr string, and it's almost impossible to think of a way you could plausibly have a re.sub() call with those characteristics. So maybe we shouldn't worry about this -- I'm interested in your thoughts and @MichaReiser's!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, good catch! I was working on adding is_str back in, but maybe instead I need to check for metacharacters in repl too.

I thought we were safe from backreferences by avoiding ( in the pattern, but I overlooked \g<0>. That exact sequence seems like the only way to trigger this behavior?

Copy link
Member

@AlexWaygood AlexWaygood Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were safe from backreferences by avoiding ( in the pattern, but I overlooked \g<0>. That exact sequence seems like the only way to trigger this behavior?

I think so, yes! Although we also emit a RUF055 diagnostic on invalid re.sub() calls like this, and maybe we should just ignore them? It feels like it might be outside of this rule's purview to autofix invalid re.sub() calls into valid str.replace() calls. We probably don't really know what the user intended exactly if the re.sub() call is invalid:

>>> import re
>>> re.sub(r"a", r"\1", "a")
Traceback (most recent call last):
  File "<python-input-12>", line 1, in <module>
    re.sub(r"a", r"\1", "a")
    ~~~~~~^^^^^^^^^^^^^^^^^^
  File "/Users/alexw/.pyenv/versions/3.13.0/lib/python3.13/re/__init__.py", line 208, in sub
    return _compile(pattern, flags).sub(repl, string, count)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/Users/alexw/.pyenv/versions/3.13.0/lib/python3.13/re/__init__.py", line 377, in _compile_template
    return _sre.template(pattern, _parser.parse_template(repl, pattern))
                                  ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
  File "/Users/alexw/.pyenv/versions/3.13.0/lib/python3.13/re/_parser.py", line 1070, in parse_template
    addgroup(int(this[1:]), len(this) - 1)
    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/alexw/.pyenv/versions/3.13.0/lib/python3.13/re/_parser.py", line 1015, in addgroup
    raise s.error("invalid group reference %d" % index, pos)
re.PatternError: invalid group reference 1 at position 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added back is_str locally, along with your function argument test case. It's really nice to handle that case, but I'm a bit bothered by this edge case too, so I could go either way. I'm interested to hear which approach you and Micha think is best overall.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you push the version with is_str to this PR, and we can see if it results in any more ecosystem hits? That might give us some more data on how useful it is to be able to detect that the repl argument is a string from the function annotation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it doesn't look like it adds any new ecosystem hits :/

I guess in that case, I'd vote for removing is_str again, and fixing the false positives on \g<0> and \1 in repl arguments.

Thanks for putting up with my pernickitiness here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, thanks for the thorough review! Should I reuse the other code to reject any metacharacters, or are references to named or numbered capture groups the only problems? I'm picturing checking for \ followed by g or 1 through 9. That seems a bit nicer than rejecting any metacharacter like I did for the patterns but possibly less safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm picturing checking for \ followed by g or 1 through 9. That seems a bit nicer than rejecting any metacharacter like I did for the patterns but possibly less safe.

I think actually we could check for \ followed by any ASCII character except one of abfnrtv. Other than 0-9 and g (which both have special behaviour in repl strings, as we've just been discussing!), I believe those are the only ASCII escapes that will be permitted in a repl string by re.sub(), Anything else causes re.PatternError to be raised -- meaning it's probably out of scope for us to emit this diagnostic on it:

>>> re.sub(r"a", r"\d", "a")
Traceback (most recent call last):
  File "<python-input-13>", line 1, in <module>
    re.sub(r"a", r"\d", "a")
    ~~~~~~^^^^^^^^^^^^^^^^^^
  File "/Users/alexw/.pyenv/versions/3.13.0/lib/python3.13/re/__init__.py", line 208, in sub
    return _compile(pattern, flags).sub(repl, string, count)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/Users/alexw/.pyenv/versions/3.13.0/lib/python3.13/re/__init__.py", line 377, in _compile_template
    return _sre.template(pattern, _parser.parse_template(repl, pattern))
                                  ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
  File "/Users/alexw/.pyenv/versions/3.13.0/lib/python3.13/re/_parser.py", line 1076, in parse_template
    raise s.error('bad escape %s' % this, len(this)) from None
re.PatternError: bad escape \d at position 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know what you think about this version. If it looks good, it might be nice to reuse this escape check for pattern as well instead of rejecting \ entirely.

Some(ReFunc {
kind: ReFuncKind::Sub { repl },
pattern: call.arguments.find_argument("pattern", 0)?,
Expand Down Expand Up @@ -248,3 +249,23 @@ impl<'a> ReFunc<'a> {
})
}
}

/// Try to resolve `name` to an [`ExprStringLiteral`] in `semantic`.
fn resolve_string_literal<'a>(
name: &'a Expr,
semantic: &'a SemanticModel,
) -> Option<&'a ExprStringLiteral> {
if name.is_string_literal_expr() {
return name.as_string_literal_expr();
}

if let Some(name_expr) = name.as_name_expr() {
let binding = semantic.binding(semantic.only_binding(name_expr)?);
let value = find_binding_value(binding, semantic)?;
if value.is_string_literal_expr() {
return value.as_string_literal_expr();
}
}

None
}
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
source: crates/ruff_linter/src/rules/ruff/mod.rs
snapshot_kind: text
---
RUF055.py:6:1: RUF055 [*] Plain string pattern passed to `re` function
RUF055_0.py:6:1: RUF055 [*] Plain string pattern passed to `re` function
|
5 | # this should be replaced with s.replace("abc", "")
6 | re.sub("abc", "", s)
Expand All @@ -20,7 +20,7 @@ RUF055.py:6:1: RUF055 [*] Plain string pattern passed to `re` function
8 8 |
9 9 | # this example, adapted from https://docs.python.org/3/library/re.html#re.sub,

RUF055.py:22:4: RUF055 [*] Plain string pattern passed to `re` function
RUF055_0.py:22:4: RUF055 [*] Plain string pattern passed to `re` function
|
20 | # this one should be replaced with s.startswith("abc") because the Match is
21 | # used in an if context for its truth value
Expand All @@ -41,7 +41,7 @@ RUF055.py:22:4: RUF055 [*] Plain string pattern passed to `re` function
24 24 | if m := re.match("abc", s): # this should *not* be replaced
25 25 | pass

RUF055.py:29:4: RUF055 [*] Plain string pattern passed to `re` function
RUF055_0.py:29:4: RUF055 [*] Plain string pattern passed to `re` function
|
28 | # this should be replaced with "abc" in s
29 | if re.search("abc", s):
Expand All @@ -61,7 +61,7 @@ RUF055.py:29:4: RUF055 [*] Plain string pattern passed to `re` function
31 31 | re.search("abc", s) # this should not be replaced
32 32 |

RUF055.py:34:4: RUF055 [*] Plain string pattern passed to `re` function
RUF055_0.py:34:4: RUF055 [*] Plain string pattern passed to `re` function
|
33 | # this should be replaced with "abc" == s
34 | if re.fullmatch("abc", s):
Expand All @@ -81,7 +81,7 @@ RUF055.py:34:4: RUF055 [*] Plain string pattern passed to `re` function
36 36 | re.fullmatch("abc", s) # this should not be replaced
37 37 |

RUF055.py:39:1: RUF055 [*] Plain string pattern passed to `re` function
RUF055_0.py:39:1: RUF055 [*] Plain string pattern passed to `re` function
|
38 | # this should be replaced with s.split("abc")
39 | re.split("abc", s)
Expand All @@ -101,7 +101,7 @@ RUF055.py:39:1: RUF055 [*] Plain string pattern passed to `re` function
41 41 | # these currently should not be modified because the patterns contain regex
42 42 | # metacharacters

RUF055.py:70:1: RUF055 [*] Plain string pattern passed to `re` function
RUF055_0.py:70:1: RUF055 [*] Plain string pattern passed to `re` function
|
69 | # this should trigger an unsafe fix because of the presence of comments
70 | / re.sub(
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
---
source: crates/ruff_linter/src/rules/ruff/mod.rs
snapshot_kind: text
---
RUF055_1.py:9:1: RUF055 [*] Plain string pattern passed to `re` function
|
7 | pat1 = "needle"
8 |
9 | re.sub(pat1, "", haystack)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF055
10 |
11 | # aliases are not followed, so this one should not trigger the rule
|
= help: Replace with `haystack.replace(pat1, "")`

ℹ Safe fix
6 6 |
7 7 | pat1 = "needle"
8 8 |
9 |-re.sub(pat1, "", haystack)
9 |+haystack.replace(pat1, "")
10 10 |
11 11 | # aliases are not followed, so this one should not trigger the rule
12 12 | if pat4 := pat1:

RUF055_1.py:17:1: RUF055 [*] Plain string pattern passed to `re` function
|
15 | # also works for the `repl` argument in sub
16 | repl = "new"
17 | re.sub(r"abc", repl, haystack)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RUF055
|
= help: Replace with `haystack.replace("abc", repl)`

ℹ Safe fix
14 14 |
15 15 | # also works for the `repl` argument in sub
16 16 | repl = "new"
17 |-re.sub(r"abc", repl, haystack)
17 |+haystack.replace("abc", repl)