Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[refurb] Parse more exotic decimal strings in verbose-decimal-constructor (FURB157) #14098

Merged
merged 6 commits into from
Nov 5, 2024

Conversation

dylwil3
Copy link
Collaborator

@dylwil3 dylwil3 commented Nov 4, 2024

FURB157 suggests replacing expressions like Decimal("123") with Decimal(123). This PR extends the rule to cover cases where the input string to Decimal can be easily transformed into an integer literal.

For example:

Decimal("1__000")   # fix: `Decimal(1000)`

Note: we do not implement the full decimal parsing logic from CPython on the grounds that certain acceptable string inputs to the Decimal constructor may be presumed purposeful on the part of the developer. For example, as in the linked issue, Decimal("١٢٣") is valid and equal to Decimal(123), but we do not suggest a replacement in this case.

Closes #13807

@dylwil3
Copy link
Collaborator Author

dylwil3 commented Nov 4, 2024

Happy to hear any suggestions around the strange mix of mutability and shadowing in the implementation. I was attempting to avoid allocating a String if we didn't have to, and to not change too much from the implementation that already existed...

Also happy to delete some comments if they are contributing more to clutter than understanding.

Copy link
Contributor

github-actions bot commented Nov 4, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

// _after_ trimming whitespace from the string and removing all occurrences of "_".
let mut trimmed = Cow::from(str_literal.to_str().trim_whitespace());
if memchr::memchr(b'_', trimmed.as_bytes()).is_some() {
trimmed = Cow::from(trimmed.replace('_', ""));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I don't think memchr gives us much here, considering that most literals are very short. We also don't have to use replace and allcoate a Cow::Owned. Instead, I would use Cow::from(trimmed.trim_start_matches('_')) for better readability (with the added benefit that it doesn't allocate)

Copy link
Collaborator Author

@dylwil3 dylwil3 Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine not using memchr, but I do think we need to use replace because underscores may appear anywhere in the string. For example- Decimal("1_____00") is valid.

Edit: whoops, didn't see that was already addressed below

// NB: We need not convert, e.g., `2e3` to `2000`.
// However, `2e+3` is a syntax error, so we remove
// the sign on the exponent.
rest = Cow::from(rest.replace('+', ""));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace is technically O(n). Not that it matters much here but I think I would find it slightly more readable if we explicitly tested for the + sign and the expected position.

if rest[index + 1] == b'+' {
    rest.into_owned().remove(index + 1)
}

The alternative is to use remove_matches which captures the semantic better than repalce with a empty string

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah dang. I guess replace is still the best option

// 'e' was the last character in `rest`, and then the right
// hand side will be `""`.
let exponent = rest[index + 1..]
.strip_prefix('+')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the exponential syntax also supports negative numbers:

[b'e' | b'E', b'0'..=b'9', ..] | [b'e' | b'E', b'-' | b'+', b'0'..=b'9', ..] => {
// 'e' | 'E'
number.push(self.cursor.bump().unwrap());
if let Some(sign) = self.cursor.eat_if(|c| matches!(c, '+' | '-')) {
number.push(sign);
}
self.radix_run(&mut number, Radix::Decimal);
true
}
_ => is_float,

Copy link
Contributor

@sharkdp sharkdp Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can have negative exponents. But this lint should not suggest replacing Decimal("1e-1") with Decimal(1e-1). They are not the same! Decimal offers a way to represent decimal numbers in an exact way, something that floating point representations do not support.

>>>> Decimal("1e-1") == Decimal(1e-1)  # 1e-1 = 0.1 can not be represented exactly as a float
False
>>>> Decimal("5e-1") == Decimal(5e-1)  # Okay, 5e-1 = 1/2 can be represented exactly
True

I think we need to be similarly careful with large numbers, even if they are integers:

>>>> Decimal("1e22") == Decimal(1e22)
True
>>>> Decimal("1e23") == Decimal(1e23)
False

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as @sharkdp points out, the omission of negative exponents was intentional. I didn't think about large exponents though, good point!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to figure out when this conversion is actually safe - parsing scientific notation is weird, and I can't seem to find a reference for the range where the representation is faithful. I think issues start to occur when numbers are larger than 1e16 but I'm having trouble finding a justification for that in code/documentation...

The smallest example I found so far was:

>>> Decimal(1801439850948199e1) == Decimal('1801439850948199e1')
False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FURB157 should not rewrite strings to float literals. Literals with exponents are float literals, so they should not be passed to the Decimal constructor, or they will trigger decimal-from-float-literal (RUF032).

$ cat ruf032.py                                                   
from decimal import Decimal
Decimal(1e16)

$ ruff check --isolated --preview --select RUF032 ruf032.py 
ruf032.py:2:9: RUF032 `Decimal()` called with float literal argument
  |
1 | from decimal import Decimal
2 | Decimal(1e16)
  |         ^^^^ RUF032
  |
  = help: Use a string literal instead

Found 1 error.
No fixes available (1 hidden fix can be enabled with the `--unsafe-fixes` option).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that makes life easier! Thank you!

Comment on lines 126 to 129
// NB: We have already checked that there is at most one 'e'.
if !rest
.bytes()
.all(|c| c.is_ascii_digit() || c == b'e' || c == b'E')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this support 1_2_3_4? or 1.1_2_3_4?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea above was to replace all _s, not just leading ones.

Copy link
Member

@MichaReiser MichaReiser Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I don't think replacing all of them is correct because _ are not permitted in the exponent part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@MichaReiser MichaReiser Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understanding your own code is always the hardest. So what's not allowed is Decimal(2._1)

if self.cursor.eat_char('_') {
return self.push_error(LexicalError::new(
LexicalErrorType::OtherError("Invalid Syntax".to_string().into_boxed_str()),
TextRange::new(self.offset() - TextSize::new(1), self.offset()),
));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Leading/trailing underscores are also not supported in "plain" Python syntax: _10. But they are supported in Decimals syntax.

But I don't think there is a problem in this PR? All underscores are replaced. So if someone really has Decimal("_1_0_0__") in their codebase, we would suggest replacing that with Decimal(100), which is fine.

Comment on lines 116 to 118
// NB: We need not convert, e.g., `2e3` to `2000`.
// However, `2e+3` is a syntax error, so we remove
// the sign on the exponent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the positive sign in the exponent is a syntax error?

>>> 2e+3
2000.0
>>> Decimal(2e+3)
Decimal('2000')

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right! Dunno why I thought it was...

24 | Decimal("__1____000")
25 | Decimal("2e4")
|
= help: Replace with `1000`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be nice to preserve the thousand separator in that case and suggest 1_000 instead of 1000. (If it's not too annoying to do)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good suggestion, and I think it would be a nice followup! I think one would want to trim leading/trailing underscores, and then replace every internal, contiguous sequence of underscores with a single underscore. At that point it may be worth a little refactoring to just walk through the string once and abort early as appropriate...

@@ -15,3 +15,29 @@
Decimal(0)
Decimal("Infinity")
decimal.Decimal(0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please also include cases for Decimal("-inf") and Decimal("inf") (both should error).

@dylwil3
Copy link
Collaborator Author

dylwil3 commented Nov 5, 2024

Sorry for sending everyone down an exponential rabbit hole, and thanks @dscorbett for digging me out of it!

Exponent handling has been removed since it takes us out of the world of integer literals.

@MichaReiser
Copy link
Member

@dylwil3 feel free to merge at your own discretion

@dylwil3 dylwil3 merged commit 2b76fa8 into astral-sh:main Nov 5, 2024
20 checks passed
@dylwil3 dylwil3 deleted the decimal-parsing branch November 5, 2024 19:33
@dhruvmanila dhruvmanila added the bug Something isn't working label Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[verbose-decimal-constructor (FURB157)] doesn't trigger on Decimal("1_000")
7 participants