Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ruff] Implement unnecessary-regular-expression (RUF055) #14659

Merged
merged 35 commits into from
Nov 28, 2024

Conversation

ntBre
Copy link
Contributor

@ntBre ntBre commented Nov 28, 2024

Summary

This is a limited implementation of the rule idea from #12283 to replace some uses of the re module with str method calls. A few of the examples given there:

re.sub("abc", "", s)  # => s.replace("abc", "")
re.match("abc", s)  # => s.startswith("abc")
re.search("abc", s)  # => "abc" in s
re.fullmatch("abc", s) # => "abc" == s
re.split("abc", s)  # => s.split("abc")

For this initial implementation, I've restricted the rule to string literals in the pattern argument to each of the re functions and further restricted these string literals to exclude any re metacharacters. Each of the re functions takes additional kwargs that change their behavior, so the rule doesn't apply when these are present either. re.sub can also take a function as the replacement argument (unlike str.replace, which expects another str), so the rule is also restricted to cases where that argument is also a string literal. Finally, match, search, and fullmatch return Match objects unlike the proposed fixes, so the rule only applies when these are used in a boolean test for their truth values. For example,

if re.match("abc", s):
    pass

would trigger the rule, but the plain re.match("abc", s) call above would not because the returned Match could be used. I think this is probably a fairly common use case, so the rule can still be useful even with these restrictions.

The limitations around Match seem necessary, but some of the other restrictions can probably be loosened. For example, the sub replacement doesn't have to be a string literal, but it does need to be a string or at the very least not a function. Similarly, the patterns themselves could be plain str variables, but we need to inspect them for regex metacharacters. I didn't find a way to do that for non-literal strings, but if I missed it, that would be an easy improvement.

I think these checks can also be directly extended to the regex package. I saw unraw-re-pattern (RUF039), for example, handles both re and regex, but I only handled re for now.

Test Plan

cargo test with new RUF055.py snapshot test.

Possible related rule

Right before submitting this, I tried running RUF055.py with python, and it crashed with a ValueError: cannot use LOCALE flag with a str pattern. That would be an easy thing to check with very similar code to what I have here.

@Skylion007
Copy link
Contributor

Skylion007 commented Nov 28, 2024

@dosisod This seems like it would be a good 'refurb' rule for your linter

Copy link
Contributor

github-actions bot commented Nov 28, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

ℹ️ ecosystem check detected linter changes. (+50 -0 violations, +0 -0 fixes in 3 projects; 52 projects unchanged)

apache/airflow (+6 -0 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview --select ALL

+ helm_tests/airflow_aux/test_pod_template_file.py:358:16: RUF055 [*] Plain string pattern passed to `re` function
+ helm_tests/airflow_aux/test_pod_template_file.py:370:16: RUF055 [*] Plain string pattern passed to `re` function
+ helm_tests/airflow_aux/test_pod_template_file.py:407:16: RUF055 [*] Plain string pattern passed to `re` function
+ helm_tests/airflow_aux/test_pod_template_file.py:59:16: RUF055 [*] Plain string pattern passed to `re` function
+ helm_tests/airflow_aux/test_pod_template_file.py:97:16: RUF055 [*] Plain string pattern passed to `re` function
+ providers/tests/cncf/kubernetes/log_handlers/test_log_handlers.py:158:16: RUF055 [*] Plain string pattern passed to `re` function

zulip/zulip (+40 -0 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview --select ALL

+ zerver/lib/test_classes.py:2208:20: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_delete_unclaimed_attachments.py:36:23: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_markdown_thumbnail.py:140:23: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_realm.py:2401:23: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_realm.py:2476:64: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_scheduled_messages.py:655:20: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_scheduled_messages.py:661:20: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_thumbnail.py:315:23: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_thumbnail.py:414:23: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_thumbnail.py:465:23: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_thumbnail.py:480:23: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_thumbnail.py:494:23: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_thumbnail.py:601:27: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_thumbnail.py:651:32: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_thumbnail.py:706:27: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_thumbnail.py:747:23: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload.py:509:22: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload.py:528:22: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload.py:588:22: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload.py:592:22: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload.py:603:22: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload.py:691:22: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload.py:751:22: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload.py:797:22: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload.py:849:22: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload.py:918:22: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload.py:963:22: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload.py:988:26: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload_local.py:115:23: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload_local.py:39:19: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload_local.py:52:19: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload_local.py:62:19: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload_local.py:96:19: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload_s3.py:126:19: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload_s3.py:140:23: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload_s3.py:173:29: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload_s3.py:180:29: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload_s3.py:62:19: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload_s3.py:79:19: RUF055 [*] Plain string pattern passed to `re` function
+ zerver/tests/test_upload_s3.py:91:19: RUF055 [*] Plain string pattern passed to `re` function

astropy/astropy (+4 -0 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview

+ astropy/coordinates/tests/test_frames.py:364:11: RUF055 Plain string pattern passed to `re` function
+ astropy/io/fits/card.py:771:21: RUF055 [*] Plain string pattern passed to `re` function
+ astropy/io/registry/base.py:446:25: RUF055 [*] Plain string pattern passed to `re` function
+ astropy/time/tests/test_fast_parser.py:16:15: RUF055 [*] Plain string pattern passed to `re` function

Changes by rule (1 rules affected)

code total + violation - violation + fix - fix
RUF055 50 50 0 0 0

@AlexWaygood AlexWaygood added the great writeup A wonderful example of a quality contribution label Nov 28, 2024
Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overall looks great. You made this look simple.

The only thing that I notice we miss is raw-string support (or, at least, tests for it). Raw strings are the recommended way to write regex patterns in python because it avoids the need for double escaping.

Comment on lines 211 to 218
// For now, reject any regex metacharacters. Compare to the complete list
// from https://docs.python.org/3/howto/regex.html#matching-characters
let has_metacharacters = string_lit.value.chars().any(|c| {
matches!(
c,
'.' | '^' | '$' | '*' | '+' | '?' | '{' | '}' | '[' | ']' | '\\' | '|' | '(' | ')'
)
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how you intentionally excluded meta-characters. So consider this an extension, and I think it's totally fine to do this as a follow-up pr (or not at all).

It would be nice if the rule only skips replacement for characters that are different between regex expressions and regular strings. For example, \n matches \n in a regex and a string.

Copy link
Member

@AlexWaygood AlexWaygood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the excellent PR writeup -- it made reviewing this really easy! This looks great overall.

The limitations around Match seem necessary, but some of the other restrictions can probably be loosened. For example, the sub replacement doesn't have to be a string literal, but it does need to be a string or at the very least not a function. Similarly, the patterns themselves could be plain str variables, but we need to inspect them for regex metacharacters. I didn't find a way to do that for non-literal strings, but if I missed it, that would be an easy improvement.

We don't have an out-of-the-box way of doing this for strings right now, so I wouldn't try to tackle it in this PR. But if you're interested, a followup might be to add an is_str() function to ruff_python_semantic::analyze::typing that looks similar to this is_list function:

/// Test whether the given binding can be considered a list.
///
/// For this, we check what value might be associated with it through it's initialization and
/// what annotation it has (we consider `list` and `typing.List`)
pub fn is_list(binding: &Binding, semantic: &SemanticModel) -> bool {
check_type::<ListChecker>(binding, semantic)
}

And then you could use that in this rule for stronger type inference

ntBre and others added 2 commits November 28, 2024 11:49
Co-authored-by: Micha Reiser <micha@reiser.io>
Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
@AlexWaygood AlexWaygood added rule Implementing or modifying a lint rule preview Related to preview mode features labels Nov 28, 2024
@ntBre
Copy link
Contributor Author

ntBre commented Nov 28, 2024

Thank you both for the great reviews! I think I've incorporated all of the suggestions, with the exception of handling simple escapes like \n. I want to search around a bit for how escapes are handled elsewhere in the code, but if nothing else, it shouldn't be that hard to allow a few common escapes like \n at least. I'm also happy to leave that as a follow-up though.

Similarly, I'm quite interested in the is_str idea, but I agree that that should be separate. I'll plan to look into that next.

Copy link
Member

@AlexWaygood AlexWaygood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks!

ntBre and others added 8 commits November 28, 2024 13:07
Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
Co-authored-by: Simon Brugman <sbrugman@users.noreply.github.com>
Co-authored-by: Simon Brugman <sbrugman@users.noreply.github.com>
Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
@AlexWaygood AlexWaygood merged commit 224fe75 into astral-sh:main Nov 28, 2024
21 checks passed
32 32 |
33 33 | # this should be replaced with "abc" == s
34 |-if re.fullmatch("abc", s):
34 |+if "abc" == s:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a minor note, I think this should be s == "abc". It's a minor stylistic difference (in which I think x == VALUE is more idiomatic), but it can cause actual differences due to type checking implementations, e.g. microsoft/pyright#9093

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
great writeup A wonderful example of a quality contribution preview Related to preview mode features rule Implementing or modifying a lint rule
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants