Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix long value parsing in jsonextractscalar #14337

Merged

Conversation

bziobrowski
Copy link
Contributor

When extracting very large LONG numbers, json_extract_scalar() function is losing some precision, even though the numbers should be able to fit into a LONG size just fine.
For example, if there's a JSON column called properties that looks like

{
  "num_clicks": "5514400327644543899"
}

then sql such as

select  json_extract_scalar(properties, '$.num_clicks', 'LONG', 0) from tab

returns 5514400327644544000.
That is because function parses as Double and then casts to long.
This PR first tries to parse as Long and then reverts to double to support scientific notation.

Reported by @AlexanderKM

@@ -184,8 +184,12 @@ public long[] transformToLongValuesSV(ValueBlock valueBlock) {
if (result instanceof Number) {
_longValuesSV[i] = ((Number) result).longValue();
} else {
// Handle scientific notation
_longValuesSV[i] = (long) Double.parseDouble(result.toString());
try {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jackie-Jiang @gortiz
Do we need to support the notation ?
If so, what do you think of copying parseLong code and using a static exception instead to curb potential allocations and stack trace business.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we are mixing CAST into this method.
IMO the correct fix should be:

  • Use Long.parseLong() here
  • Check result type, and fallback to default when result type doesn't match method (see CastTransformFunction as an example)

Seems other json extract function also have this problem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could iterate on the string looking for numbers or -. If we only find them, we optimistically call Long.parseLong. If that fails or if we find anything different than - or numbers, we call Double.parseDouble. Unless I missing something this should work in almost all case except very large numbers (which we can probably also check statically by comparing the length of the string).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO if user specifies it to extract as LONG, the value should be either a number, or a long value string.
I guess the reason why we parse it as double here is to handle the scenario where upper level directly asks for double value. E.g. SUM(JSON_EXTRACT_SCALAR())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's weird because if we parse as double but then cast to long then we lose fractional anyway and get to lose precision for value longer than 15 digits. The only difference parsing as double could make is for parsing numbers with fractional part or scientific notation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion would be:

  1. Use Long.parseLong() without extra try-catch
  2. Fall back to BaseTransformFunction if the result stored type doesn't match the function name (e.g. result stored type is DOUBLE but function name is transformToLongValuesSV)

To demonstrate the second point, you can try calling transformToLongValuesSV() on JSON_EXTRACT_SCALAR(..., 'DOUBLE') which might fail if we only do 1 without 2

@codecov-commenter
Copy link

codecov-commenter commented Oct 30, 2024

Codecov Report

Attention: Patch coverage is 89.90826% with 11 lines in your changes missing coverage. Please review.

Project coverage is 63.87%. Comparing base (59551e4) to head (b222158).
Report is 1310 commits behind head on master.

Files with missing lines Patch % Lines
...n/java/org/apache/pinot/core/util/NumberUtils.java 90.62% 4 Missing and 5 partials ⚠️
...m/function/JsonExtractScalarTransformFunction.java 75.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14337      +/-   ##
============================================
+ Coverage     61.75%   63.87%   +2.11%     
- Complexity      207     1568    +1361     
============================================
  Files          2436     2663     +227     
  Lines        133233   146123   +12890     
  Branches      20636    22393    +1757     
============================================
+ Hits          82274    93330   +11056     
- Misses        44911    45901     +990     
- Partials       6048     6892     +844     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.82% <89.90%> (+2.11%) ⬆️
java-21 63.71% <89.90%> (+2.08%) ⬆️
skip-bytebuffers-false 63.86% <89.90%> (+2.11%) ⬆️
skip-bytebuffers-true 63.68% <89.90%> (+35.96%) ⬆️
temurin 63.87% <89.90%> (+2.11%) ⬆️
unittests 63.86% <89.90%> (+2.11%) ⬆️
unittests1 55.59% <89.90%> (+8.70%) ⬆️
unittests2 34.17% <0.00%> (+6.44%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -184,8 +191,7 @@ public long[] transformToLongValuesSV(ValueBlock valueBlock) {
if (result instanceof Number) {
_longValuesSV[i] = ((Number) result).longValue();
} else {
// Handle scientific notation
_longValuesSV[i] = (long) Double.parseDouble(result.toString());
_longValuesSV[i] = Long.parseLong(result.toString());
Copy link
Contributor

@gortiz gortiz Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can merge this as it is. In JSON standard numbers can be written using the scientific notation. I think we are going to need to add our own parser here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean user could write long values with scientific notation? If so, should we consider using BigInteger to parse it? I'm also okay using BigDecimal to parse all numeric values since the overhead should be okay comparing to the cost of parsing json string.

Copy link
Contributor Author

@bziobrowski bziobrowski Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON standard does allow that for numbers but not for integers.
It's a question of whether we want to keep supporting conversion from json floating point ( scientific notation or a number with fractional part) string directly to long.

In terms of testing one can simulate it by adding the following to
JsonExtractScalarTransformFunctionTest.testJsonPathTransformFunctionDataProvider :

testArguments.add(new Object[]{
          String.format("jsonExtractScalar(%s,'$.stringSV','LONG')", input), DataType.LONG, true
      });

I'd really avoid allocating object per row if result is going to be of primitive type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON standard does allow that for numbers but not for integers.

That is not precise. In JSON there is only one type of numeric literal: number. A number is defined as number fraction exponent. Therefore you can write a number as 9007199254740993e0 and it will generate the number 9007199254740993e. We have to interpret that as 9007199254740992, but (long) Double.parseDouble("9007199254740993e0") returns 9007199254740992.

Using BigInteger/BigDecimal should be correct, but IMHO pretty expensive. I think it shouldn't be that hard to write our own parser. Something like:

int e = indexOfExponent(str);
if (e < 0) {
  return Long.parse(str);
}
long base = Long.parse(str, 0, e, 10));
int exp = Integer.parseInt(str, e+1, 10));
long powerOfTwo = powerOfTwo(exp); // as suggested in https://stackoverflow.com/questions/46983772/fastest-way-to-obtain-a-power-of-10
long result = verifyOverflow(base * powerOfTen(exp)); // Not sure if checking if the value changed the sign is good enough

Alternatively, if we are not sure if that algorithm is correct, we can do:

int e = indexOfExponent(str);
if (e < 0) {
  return Long.parse(str);
}
if (str.contains('.')) {
  throw whatever;
}
return toLong(new BigDecimal(str)); // verifying we fail if the value is not representable as double.

Copy link
Contributor Author

@bziobrowski bziobrowski Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I was looking at productions only because the main images don't display properly for me.
What about the current behavior of parsing numbers with fractional parts ?
Both 123.456 and 122.11e1 are accepted (by code in main branch).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, my algorithms are not correct because in order to keep backward compatibility we should round them in the same way double to long is doing, but without losing precision.

So 123.456 should be transformed into 123. I think it is acceptable if we fail in case we find a number that is too large or small to be represented as a long

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added custom methods for long and 'jsonLong' parsing.

Comment on lines 78 to 81
public static long parseLong(CharSequence cs, int start, int end) {
if (cs == null) {
throw NULL_EXCEPTION;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is over-engineering. We should throw new exceptions by default.

In most cases, performance is not going to be that problematic given that the code either won't be in the hotpath or an invalid value will abort the execution. In these cases throwing a new exception would be more useful to resolve a possible bug or problem in the dataset.

Instead I suggest to have one public static long parseLongFast(CharSequence cs, int start, int end) throws OurCheckedNumberFormatException. Then optionally add a public static long parseLong(CharSequence cs, int start, int end) that catches OurCheckedNumberFormatException and re-throws it as a new NumberFormatException.

By doing that we would have the ability to call the fast code that doesn't allocate but in that case we will need to deal with the exception. In cases where performance is not so critical or invalid values would abort the execution we could call parseLong.

I think we can repeat that with all other methods in this class.

Copy link
Contributor Author

@bziobrowski bziobrowski Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default exception thrown by Long.parseLong() doesn't have useful context information, only the value. A regular user won't follow the stack trace to understand what went wrong so the really useful context (e.g. name of function parameter that is being parsed) must be provided by the call site (making the original exception's message and stack trace unnecessary) . In addition to that the approach makes it possible to ignore the exception (e.g. when it's ok to evaluate it to null) without having to pay for exception instantiation. For other cases I'd recommend Long.parseLong().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A regular user won't follow the stack trace to understand what went wrong so the really useful context (e.g. name of function parameter that is being parsed) must be provided by the call site (making the original exception's message and stack trace unnecessary)

In addition to that the approach makes it possible to ignore the exception (e.g. when it's ok to evaluate it to null) without having to pay for exception instantiation.

Precisely that is why throwing a checked exception is better. Throwing a standard, unchecked exception without context is something that open the gates of bad code patterns. Devs are going to forget about this exception and we will end up having exceptions without context. The same happens right now in all the places where we are doing this trick.

I don't think throwing a stackless exception is a good idea. But I'm fine if you think it better as long as it is a non standard checked exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about the checked exception but don't feel that multiplying methods is necessary.
Throwing stackless exception is not a generic solution, it's meant to help in specific cases.
I've seen plenty of places in the codebase that ignore caught exceptions (either completely or set some flag), suggesting that an error code (which this stackless exception is) is enough.
As for the hotpath, I think we can safely assume anything appearing in e.g. an sql function method is going to be executed often and thus - is going to be hot.

@gortiz gortiz merged commit 013e80f into apache:master Nov 12, 2024
21 checks passed
davecromberge pushed a commit to davecromberge/pinot that referenced this pull request Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants