Fix long value parsing in jsonextractscalar #14337

bziobrowski · 2024-10-30T16:47:32Z

When extracting very large LONG numbers, json_extract_scalar() function is losing some precision, even though the numbers should be able to fit into a LONG size just fine.
For example, if there's a JSON column called properties that looks like

{
  "num_clicks": "5514400327644543899"
}

then sql such as

select  json_extract_scalar(properties, '$.num_clicks', 'LONG', 0) from tab

returns 5514400327644544000.
That is because function parses as Double and then casts to long.
This PR first tries to parse as Long and then reverts to double to support scientific notation.

Reported by @AlexanderKM

…_jsonextractscalar

bziobrowski · 2024-10-30T16:50:32Z

...va/org/apache/pinot/core/operator/transform/function/JsonExtractScalarTransformFunction.java

@@ -184,8 +184,12 @@ public long[] transformToLongValuesSV(ValueBlock valueBlock) {
      if (result instanceof Number) {
        _longValuesSV[i] = ((Number) result).longValue();
      } else {
-        // Handle scientific notation
-        _longValuesSV[i] = (long) Double.parseDouble(result.toString());
+        try {


@Jackie-Jiang @gortiz
Do we need to support the notation ?
If so, what do you think of copying parseLong code and using a static exception instead to curb potential allocations and stack trace business.

Seems like we are mixing CAST into this method.
IMO the correct fix should be:

Use Long.parseLong() here

Check result type, and fallback to default when result type doesn't match method (see CastTransformFunction as an example)

Seems other json extract function also have this problem

I think we could iterate on the string looking for numbers or -. If we only find them, we optimistically call Long.parseLong. If that fails or if we find anything different than - or numbers, we call Double.parseDouble. Unless I missing something this should work in almost all case except very large numbers (which we can probably also check statically by comparing the length of the string).

IMO if user specifies it to extract as LONG, the value should be either a number, or a long value string.
I guess the reason why we parse it as double here is to handle the scenario where upper level directly asks for double value. E.g. SUM(JSON_EXTRACT_SCALAR())

That's weird because if we parse as double but then cast to long then we lose fractional anyway and get to lose precision for value longer than 15 digits. The only difference parsing as double could make is for parsing numbers with fractional part or scientific notation.

My suggestion would be:

Use Long.parseLong() without extra try-catch

Fall back to BaseTransformFunction if the result stored type doesn't match the function name (e.g. result stored type is DOUBLE but function name is transformToLongValuesSV)

To demonstrate the second point, you can try calling transformToLongValuesSV() on JSON_EXTRACT_SCALAR(..., 'DOUBLE') which might fail if we only do 1 without 2

codecov-commenter · 2024-10-30T17:27:05Z

Codecov Report

Attention: Patch coverage is 89.90826% with 11 lines in your changes missing coverage. Please review.

Project coverage is 63.87%. Comparing base (59551e4) to head (b222158).
Report is 1310 commits behind head on master.

Files with missing lines	Patch %	Lines
...n/java/org/apache/pinot/core/util/NumberUtils.java	90.62%	4 Missing and 5 partials ⚠️
...m/function/JsonExtractScalarTransformFunction.java	75.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14337      +/-   ##
============================================
+ Coverage     61.75%   63.87%   +2.11%     
- Complexity      207     1568    +1361     
============================================
  Files          2436     2663     +227     
  Lines        133233   146123   +12890     
  Branches      20636    22393    +1757     
============================================
+ Hits          82274    93330   +11056     
- Misses        44911    45901     +990     
- Partials       6048     6892     +844

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`63.82% <89.90%> (+2.11%)`	⬆️
java-21	`63.71% <89.90%> (+2.08%)`	⬆️
skip-bytebuffers-false	`63.86% <89.90%> (+2.11%)`	⬆️
skip-bytebuffers-true	`63.68% <89.90%> (+35.96%)`	⬆️
temurin	`63.87% <89.90%> (+2.11%)`	⬆️
unittests	`63.86% <89.90%> (+2.11%)`	⬆️
unittests1	`55.59% <89.90%> (+8.70%)`	⬆️
unittests2	`34.17% <0.00%> (+6.44%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…_jsonextractscalar

gortiz · 2024-11-06T14:23:01Z

...va/org/apache/pinot/core/operator/transform/function/JsonExtractScalarTransformFunction.java

@@ -184,8 +191,7 @@ public long[] transformToLongValuesSV(ValueBlock valueBlock) {
      if (result instanceof Number) {
        _longValuesSV[i] = ((Number) result).longValue();
      } else {
-        // Handle scientific notation
-        _longValuesSV[i] = (long) Double.parseDouble(result.toString());
+          _longValuesSV[i] = Long.parseLong(result.toString());


I don't think we can merge this as it is. In JSON standard numbers can be written using the scientific notation. I think we are going to need to add our own parser here.

Do you mean user could write long values with scientific notation? If so, should we consider using BigInteger to parse it? I'm also okay using BigDecimal to parse all numeric values since the overhead should be okay comparing to the cost of parsing json string.

JSON standard does allow that for numbers but not for integers.
It's a question of whether we want to keep supporting conversion from json floating point ( scientific notation or a number with fractional part) string directly to long.

In terms of testing one can simulate it by adding the following to
JsonExtractScalarTransformFunctionTest.testJsonPathTransformFunctionDataProvider :

testArguments.add(new Object[]{ String.format("jsonExtractScalar(%s,'$.stringSV','LONG')", input), DataType.LONG, true });

I'd really avoid allocating object per row if result is going to be of primitive type.

JSON standard does allow that for numbers but not for integers.

That is not precise. In JSON there is only one type of numeric literal: number. A number is defined as number fraction exponent. Therefore you can write a number as 9007199254740993e0 and it will generate the number 9007199254740993e. We have to interpret that as 9007199254740992, but (long) Double.parseDouble("9007199254740993e0") returns 9007199254740992.

Using BigInteger/BigDecimal should be correct, but IMHO pretty expensive. I think it shouldn't be that hard to write our own parser. Something like:

int e = indexOfExponent(str); if (e < 0) { return Long.parse(str); } long base = Long.parse(str, 0, e, 10)); int exp = Integer.parseInt(str, e+1, 10)); long powerOfTwo = powerOfTwo(exp); // as suggested in https://stackoverflow.com/questions/46983772/fastest-way-to-obtain-a-power-of-10 long result = verifyOverflow(base * powerOfTen(exp)); // Not sure if checking if the value changed the sign is good enough

Alternatively, if we are not sure if that algorithm is correct, we can do:

int e = indexOfExponent(str); if (e < 0) { return Long.parse(str); } if (str.contains('.')) { throw whatever; } return toLong(new BigDecimal(str)); // verifying we fail if the value is not representable as double.

You're right, I was looking at productions only because the main images don't display properly for me.
What about the current behavior of parsing numbers with fractional parts ?
Both 123.456 and 122.11e1 are accepted (by code in main branch).

You are right, my algorithms are not correct because in order to keep backward compatibility we should round them in the same way double to long is doing, but without losing precision.

So 123.456 should be transformed into 123. I think it is acceptable if we fail in case we find a number that is too large or small to be represented as a long

I added custom methods for long and 'jsonLong' parsing.

…_jsonextractscalar

gortiz · 2024-11-11T08:10:59Z

pinot-core/src/main/java/org/apache/pinot/core/util/NumberUtils.java

+  public static long parseLong(CharSequence cs, int start, int end) {
+    if (cs == null) {
+      throw NULL_EXCEPTION;
+    }


I think this is over-engineering. We should throw new exceptions by default.

In most cases, performance is not going to be that problematic given that the code either won't be in the hotpath or an invalid value will abort the execution. In these cases throwing a new exception would be more useful to resolve a possible bug or problem in the dataset.

Instead I suggest to have one public static long parseLongFast(CharSequence cs, int start, int end) throws OurCheckedNumberFormatException. Then optionally add a public static long parseLong(CharSequence cs, int start, int end) that catches OurCheckedNumberFormatException and re-throws it as a new NumberFormatException.

By doing that we would have the ability to call the fast code that doesn't allocate but in that case we will need to deal with the exception. In cases where performance is not so critical or invalid values would abort the execution we could call parseLong.

I think we can repeat that with all other methods in this class.

The default exception thrown by Long.parseLong() doesn't have useful context information, only the value. A regular user won't follow the stack trace to understand what went wrong so the really useful context (e.g. name of function parameter that is being parsed) must be provided by the call site (making the original exception's message and stack trace unnecessary) . In addition to that the approach makes it possible to ignore the exception (e.g. when it's ok to evaluate it to null) without having to pay for exception instantiation. For other cases I'd recommend Long.parseLong().

A regular user won't follow the stack trace to understand what went wrong so the really useful context (e.g. name of function parameter that is being parsed) must be provided by the call site (making the original exception's message and stack trace unnecessary)

In addition to that the approach makes it possible to ignore the exception (e.g. when it's ok to evaluate it to null) without having to pay for exception instantiation.

Precisely that is why throwing a checked exception is better. Throwing a standard, unchecked exception without context is something that open the gates of bad code patterns. Devs are going to forget about this exception and we will end up having exceptions without context. The same happens right now in all the places where we are doing this trick.

I don't think throwing a stackless exception is a good idea. But I'm fine if you think it better as long as it is a non standard checked exception.

I agree about the checked exception but don't feel that multiplying methods is necessary.
Throwing stackless exception is not a generic solution, it's meant to help in specific cases.
I've seen plenty of places in the codebase that ignore caught exceptions (either completely or set some flag), suggesting that an error code (which this stackless exception is) is enough.
As for the hotpath, I think we can safely assume anything appearing in e.g. an sql function method is going to be executed often and thus - is going to be hot.

…_jsonextractscalar

bziobrowski added 2 commits October 30, 2024 17:42

Fix long parsing in json_extract_scalar() function.

afff481

Merge remote-tracking branch 'origin/master' into fix_long_parsing_in…

1c79dc9

…_jsonextractscalar

bziobrowski commented Oct 30, 2024

View reviewed changes

bziobrowski mentioned this pull request Oct 30, 2024

Adjust JsonExtractScalarTest to show failure of reading LONGs #14332

Draft

Jackie-Jiang added bugfix query labels Oct 30, 2024

bziobrowski added 2 commits November 5, 2024 10:47

Merge remote-tracking branch 'origin/master' into fix_long_parsing_in…

821e8ea

…_jsonextractscalar

Addressed pr issues.

4a2015d

gortiz reviewed Nov 6, 2024

View reviewed changes

bziobrowski added 2 commits November 8, 2024 16:06

Add custom json long parse function.

18dded0

Merge remote-tracking branch 'origin/master' into fix_long_parsing_in…

d615408

…_jsonextractscalar

gortiz reviewed Nov 11, 2024

View reviewed changes

bziobrowski added 2 commits November 12, 2024 10:13

Address code review issues.

be6390b

Merge remote-tracking branch 'origin/master' into fix_long_parsing_in…

b222158

…_jsonextractscalar

gortiz approved these changes Nov 12, 2024

View reviewed changes

gortiz merged commit 013e80f into apache:master Nov 12, 2024
21 checks passed

davecromberge pushed a commit to davecromberge/pinot that referenced this pull request Nov 22, 2024

Fix long value parsing in jsonextractscalar (apache#14337)

1edb828

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix long value parsing in jsonextractscalar #14337

Fix long value parsing in jsonextractscalar #14337

bziobrowski commented Oct 30, 2024

bziobrowski Oct 30, 2024

Jackie-Jiang Oct 30, 2024

gortiz Oct 31, 2024

Jackie-Jiang Oct 31, 2024

bziobrowski Nov 5, 2024

Jackie-Jiang Nov 5, 2024

codecov-commenter commented Oct 30, 2024 •

edited

Loading

gortiz Nov 6, 2024 •

edited

Loading

Jackie-Jiang Nov 6, 2024

bziobrowski Nov 7, 2024 •

edited

Loading

gortiz Nov 7, 2024

bziobrowski Nov 7, 2024 •

edited

Loading

gortiz Nov 7, 2024

bziobrowski Nov 8, 2024

gortiz Nov 11, 2024

bziobrowski Nov 12, 2024 •

edited

Loading

gortiz Nov 12, 2024

bziobrowski Nov 12, 2024

Fix long value parsing in jsonextractscalar #14337

Fix long value parsing in jsonextractscalar #14337

Conversation

bziobrowski commented Oct 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Oct 30, 2024 • edited Loading

Codecov Report

gortiz Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bziobrowski Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bziobrowski Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bziobrowski Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Oct 30, 2024 •

edited

Loading

gortiz Nov 6, 2024 •

edited

Loading

bziobrowski Nov 7, 2024 •

edited

Loading

bziobrowski Nov 7, 2024 •

edited

Loading

bziobrowski Nov 12, 2024 •

edited

Loading