-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix long value parsing in jsonextractscalar #14337
Fix long value parsing in jsonextractscalar #14337
Conversation
@@ -184,8 +184,12 @@ public long[] transformToLongValuesSV(ValueBlock valueBlock) { | |||
if (result instanceof Number) { | |||
_longValuesSV[i] = ((Number) result).longValue(); | |||
} else { | |||
// Handle scientific notation | |||
_longValuesSV[i] = (long) Double.parseDouble(result.toString()); | |||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jackie-Jiang @gortiz
Do we need to support the notation ?
If so, what do you think of copying parseLong code and using a static exception instead to curb potential allocations and stack trace business
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we are mixing CAST
into this method.
IMO the correct fix should be:
- Use
Long.parseLong()
here - Check result type, and fallback to default when result type doesn't match method (see
CastTransformFunction
as an example)
Seems other json extract function also have this problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could iterate on the string looking for numbers or -. If we only find them, we optimistically call Long.parseLong
. If that fails or if we find anything different than -
or numbers, we call Double.parseDouble
. Unless I missing something this should work in almost all case except very large numbers (which we can probably also check statically by comparing the length of the string).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO if user specifies it to extract as LONG
, the value should be either a number, or a long value string.
I guess the reason why we parse it as double here is to handle the scenario where upper level directly asks for double value. E.g. SUM(JSON_EXTRACT_SCALAR())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's weird because if we parse as double but then cast to long then we lose fractional anyway and get to lose precision for value longer than 15 digits. The only difference parsing as double could make is for parsing numbers with fractional part or scientific notation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion would be:
- Use
Long.parseLong()
without extra try-catch - Fall back to
BaseTransformFunction
if the result stored type doesn't match the function name (e.g. result stored type isDOUBLE
but function name istransformToLongValuesSV
)
To demonstrate the second point, you can try calling transformToLongValuesSV()
on JSON_EXTRACT_SCALAR(..., 'DOUBLE')
which might fail if we only do 1 without 2
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14337 +/- ##
============================================
+ Coverage 61.75% 63.87% +2.11%
- Complexity 207 1568 +1361
============================================
Files 2436 2663 +227
Lines 133233 146123 +12890
Branches 20636 22393 +1757
============================================
+ Hits 82274 93330 +11056
- Misses 44911 45901 +990
- Partials 6048 6892 +844
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -184,8 +191,7 @@ public long[] transformToLongValuesSV(ValueBlock valueBlock) { | |||
if (result instanceof Number) { | |||
_longValuesSV[i] = ((Number) result).longValue(); | |||
} else { | |||
// Handle scientific notation | |||
_longValuesSV[i] = (long) Double.parseDouble(result.toString()); | |||
_longValuesSV[i] = Long.parseLong(result.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can merge this as it is. In JSON standard numbers can be written using the scientific notation. I think we are going to need to add our own parser here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean user could write long values with scientific notation? If so, should we consider using BigInteger
to parse it? I'm also okay using BigDecimal
to parse all numeric values since the overhead should be okay comparing to the cost of parsing json string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSON standard does allow that for numbers but not for integers.
It's a question of whether we want to keep supporting conversion from json floating point ( scientific notation or a number with fractional part) string directly to long.
In terms of testing one can simulate it by adding the following to
JsonExtractScalarTransformFunctionTest.testJsonPathTransformFunctionDataProvider
:
testArguments.add(new Object[]{
String.format("jsonExtractScalar(%s,'$.stringSV','LONG')", input), DataType.LONG, true
});
I'd really avoid allocating object per row if result is going to be of primitive type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSON standard does allow that for numbers but not for integers.
That is not precise. In JSON there is only one type of numeric literal: number. A number is defined as number fraction exponent
. Therefore you can write a number as 9007199254740993e0
and it will generate the number 9007199254740993e
. We have to interpret that as 9007199254740992
, but (long) Double.parseDouble("9007199254740993e0")
returns 9007199254740992
.
Using BigInteger
/BigDecimal
should be correct, but IMHO pretty expensive. I think it shouldn't be that hard to write our own parser. Something like:
int e = indexOfExponent(str);
if (e < 0) {
return Long.parse(str);
}
long base = Long.parse(str, 0, e, 10));
int exp = Integer.parseInt(str, e+1, 10));
long powerOfTwo = powerOfTwo(exp); // as suggested in https://stackoverflow.com/questions/46983772/fastest-way-to-obtain-a-power-of-10
long result = verifyOverflow(base * powerOfTen(exp)); // Not sure if checking if the value changed the sign is good enough
Alternatively, if we are not sure if that algorithm is correct, we can do:
int e = indexOfExponent(str);
if (e < 0) {
return Long.parse(str);
}
if (str.contains('.')) {
throw whatever;
}
return toLong(new BigDecimal(str)); // verifying we fail if the value is not representable as double.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I was looking at productions only because the main images don't display properly for me.
What about the current behavior of parsing numbers with fractional parts ?
Both 123.456 and 122.11e1 are accepted (by code in main branch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, my algorithms are not correct because in order to keep backward compatibility we should round them in the same way double to long is doing, but without losing precision.
So 123.456
should be transformed into 123
. I think it is acceptable if we fail in case we find a number that is too large or small to be represented as a long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added custom methods for long and 'jsonLong' parsing.
public static long parseLong(CharSequence cs, int start, int end) { | ||
if (cs == null) { | ||
throw NULL_EXCEPTION; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is over-engineering. We should throw new exceptions by default.
In most cases, performance is not going to be that problematic given that the code either won't be in the hotpath or an invalid value will abort the execution. In these cases throwing a new exception would be more useful to resolve a possible bug or problem in the dataset.
Instead I suggest to have one public static long parseLongFast(CharSequence cs, int start, int end) throws OurCheckedNumberFormatException
. Then optionally add a public static long parseLong(CharSequence cs, int start, int end)
that catches OurCheckedNumberFormatException
and re-throws it as a new NumberFormatException
.
By doing that we would have the ability to call the fast code that doesn't allocate but in that case we will need to deal with the exception. In cases where performance is not so critical or invalid values would abort the execution we could call parseLong
.
I think we can repeat that with all other methods in this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default exception thrown by Long.parseLong() doesn't have useful context information, only the value. A regular user won't follow the stack trace to understand what went wrong so the really useful context (e.g. name of function parameter that is being parsed) must be provided by the call site (making the original exception's message and stack trace unnecessary) . In addition to that the approach makes it possible to ignore the exception (e.g. when it's ok to evaluate it to null) without having to pay for exception instantiation. For other cases I'd recommend Long.parseLong()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A regular user won't follow the stack trace to understand what went wrong so the really useful context (e.g. name of function parameter that is being parsed) must be provided by the call site (making the original exception's message and stack trace unnecessary)
In addition to that the approach makes it possible to ignore the exception (e.g. when it's ok to evaluate it to null) without having to pay for exception instantiation.
Precisely that is why throwing a checked exception is better. Throwing a standard, unchecked exception without context is something that open the gates of bad code patterns. Devs are going to forget about this exception and we will end up having exceptions without context. The same happens right now in all the places where we are doing this trick.
I don't think throwing a stackless exception is a good idea. But I'm fine if you think it better as long as it is a non standard checked exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about the checked exception but don't feel that multiplying methods is necessary.
Throwing stackless
exception is not a generic solution, it's meant to help in specific cases.
I've seen plenty of places in the codebase that ignore caught exceptions (either completely or set some flag), suggesting that an error code (which this stackless exception is) is enough.
As for the hotpath, I think we can safely assume anything appearing in e.g. an sql function method is going to be executed often and thus - is going to be hot.
When extracting very large LONG numbers,
json_extract_scalar()
function is losing some precision, even though the numbers should be able to fit into a LONG size just fine.For example, if there's a JSON column called properties that looks like
then sql such as
returns 5514400327644544000.
That is because function parses as Double and then casts to long.
This PR first tries to parse as Long and then reverts to double to support scientific notation.
Reported by @AlexanderKM