-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[refurb] Parse more exotic decimal strings in verbose-decimal-constructor (FURB157)
#14098
Conversation
Happy to hear any suggestions around the strange mix of mutability and shadowing in the implementation. I was attempting to avoid allocating a String if we didn't have to, and to not change too much from the implementation that already existed... Also happy to delete some comments if they are contributing more to clutter than understanding. |
|
// _after_ trimming whitespace from the string and removing all occurrences of "_". | ||
let mut trimmed = Cow::from(str_literal.to_str().trim_whitespace()); | ||
if memchr::memchr(b'_', trimmed.as_bytes()).is_some() { | ||
trimmed = Cow::from(trimmed.replace('_', "")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I don't think memchr
gives us much here, considering that most literals are very short. We also don't have to use replace
and allcoate a Cow::Owned
. Instead, I would use Cow::from(trimmed.trim_start_matches('_'))
for better readability (with the added benefit that it doesn't allocate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine not using memchr, but I do think we need to use replace
because underscores may appear anywhere in the string. For example- Decimal("1_____00")
is valid.
Edit: whoops, didn't see that was already addressed below
// NB: We need not convert, e.g., `2e3` to `2000`. | ||
// However, `2e+3` is a syntax error, so we remove | ||
// the sign on the exponent. | ||
rest = Cow::from(rest.replace('+', "")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace
is technically O(n)
. Not that it matters much here but I think I would find it slightly more readable if we explicitly tested for the +
sign and the expected position.
if rest[index + 1] == b'+' {
rest.into_owned().remove(index + 1)
}
The alternative is to use remove_matches
which captures the semantic better than repalce
with a empty string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like remove_matches
but isn't it nightly only? https://doc.rust-lang.org/std/string/struct.String.html#method.remove_matches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah dang. I guess replace is still the best option
// 'e' was the last character in `rest`, and then the right | ||
// hand side will be `""`. | ||
let exponent = rest[index + 1..] | ||
.strip_prefix('+') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the exponential syntax also supports negative numbers:
ruff/crates/ruff_python_parser/src/lexer.rs
Lines 1088 to 1100 in 51a9983
[b'e' | b'E', b'0'..=b'9', ..] | [b'e' | b'E', b'-' | b'+', b'0'..=b'9', ..] => { | |
// 'e' | 'E' | |
number.push(self.cursor.bump().unwrap()); | |
if let Some(sign) = self.cursor.eat_if(|c| matches!(c, '+' | '-')) { | |
number.push(sign); | |
} | |
self.radix_run(&mut number, Radix::Decimal); | |
true | |
} | |
_ => is_float, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you can have negative exponents. But this lint should not suggest replacing Decimal("1e-1")
with Decimal(1e-1)
. They are not the same! Decimal
offers a way to represent decimal numbers in an exact way, something that floating point representations do not support.
>>>> Decimal("1e-1") == Decimal(1e-1) # 1e-1 = 0.1 can not be represented exactly as a float
False
>>>> Decimal("5e-1") == Decimal(5e-1) # Okay, 5e-1 = 1/2 can be represented exactly
True
I think we need to be similarly careful with large numbers, even if they are integers:
>>>> Decimal("1e22") == Decimal(1e22)
True
>>>> Decimal("1e23") == Decimal(1e23)
False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as @sharkdp points out, the omission of negative exponents was intentional. I didn't think about large exponents though, good point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to figure out when this conversion is actually safe - parsing scientific notation is weird, and I can't seem to find a reference for the range where the representation is faithful. I think issues start to occur when numbers are larger than 1e16
but I'm having trouble finding a justification for that in code/documentation...
The smallest example I found so far was:
>>> Decimal(1801439850948199e1) == Decimal('1801439850948199e1')
False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FURB157 should not rewrite strings to float literals. Literals with exponents are float literals, so they should not be passed to the Decimal
constructor, or they will trigger decimal-from-float-literal
(RUF032).
$ cat ruf032.py
from decimal import Decimal
Decimal(1e16)
$ ruff check --isolated --preview --select RUF032 ruf032.py
ruf032.py:2:9: RUF032 `Decimal()` called with float literal argument
|
1 | from decimal import Decimal
2 | Decimal(1e16)
| ^^^^ RUF032
|
= help: Use a string literal instead
Found 1 error.
No fixes available (1 hidden fix can be enabled with the `--unsafe-fixes` option).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well that makes life easier! Thank you!
// NB: We have already checked that there is at most one 'e'. | ||
if !rest | ||
.bytes() | ||
.all(|c| c.is_ascii_digit() || c == b'e' || c == b'E') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this support 1_2_3_4
? or 1.1_2_3_4
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea above was to replace all _
s, not just leading ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I don't think replacing all of them is correct because _
are not permitted in the exponent part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are:
>>> Decimal("2e1_0") == Decimal(2e1_0)
True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understanding your own code is always the hardest. So what's not allowed is Decimal(2._1)
ruff/crates/ruff_python_parser/src/lexer.rs
Lines 1072 to 1078 in 4323512
if self.cursor.eat_char('_') { | |
return self.push_error(LexicalError::new( | |
LexicalErrorType::OtherError("Invalid Syntax".to_string().into_boxed_str()), | |
TextRange::new(self.offset() - TextSize::new(1), self.offset()), | |
)); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Leading/trailing underscores are also not supported in "plain" Python syntax: _10
. But they are supported in Decimal
s syntax.
But I don't think there is a problem in this PR? All underscores are replaced. So if someone really has Decimal("_1_0_0__")
in their codebase, we would suggest replacing that with Decimal(100)
, which is fine.
// NB: We need not convert, e.g., `2e3` to `2000`. | ||
// However, `2e+3` is a syntax error, so we remove | ||
// the sign on the exponent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the positive sign in the exponent is a syntax error?
>>> 2e+3
2000.0
>>> Decimal(2e+3)
Decimal('2000')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right! Dunno why I thought it was...
24 | Decimal("__1____000") | ||
25 | Decimal("2e4") | ||
| | ||
= help: Replace with `1000` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be nice to preserve the thousand separator in that case and suggest 1_000
instead of 1000
. (If it's not too annoying to do)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good suggestion, and I think it would be a nice followup! I think one would want to trim leading/trailing underscores, and then replace every internal, contiguous sequence of underscores with a single underscore. At that point it may be worth a little refactoring to just walk through the string once and abort early as appropriate...
@@ -15,3 +15,29 @@ | |||
Decimal(0) | |||
Decimal("Infinity") | |||
decimal.Decimal(0) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please also include cases for Decimal("-inf")
and Decimal("inf")
(both should error).
Sorry for sending everyone down an exponential rabbit hole, and thanks @dscorbett for digging me out of it! Exponent handling has been removed since it takes us out of the world of integer literals. |
@dylwil3 feel free to merge at your own discretion |
FURB157 suggests replacing expressions like
Decimal("123")
withDecimal(123)
. This PR extends the rule to cover cases where the input string toDecimal
can be easily transformed into an integer literal.For example:
Note: we do not implement the full decimal parsing logic from CPython on the grounds that certain acceptable string inputs to the
Decimal
constructor may be presumed purposeful on the part of the developer. For example, as in the linked issue,Decimal("١٢٣")
is valid and equal toDecimal(123)
, but we do not suggest a replacement in this case.Closes #13807