[SPARK-11753][SQL][test-hadoop2.2] Make allowNonNumericNumbers option work #9759

viirya · 2015-11-17T05:10:03Z

What changes were proposed in this pull request?

Jackson suppprts allowNonNumericNumbers option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF". Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in JsonParsingOptionsSuite passed.

How was this patch tested?

JsonParsingOptionsSuite.

rxin · 2015-11-17T05:12:11Z

Can you add also nan, infinity, -infinity, inf, -inf to the test case? And also turn it on by default.

rxin · 2015-11-17T05:12:29Z

(Also update the documentation in DataFrameReader and readwrite.py to include this option.

viirya · 2015-11-17T05:14:45Z

ok.

SparkQA · 2015-11-17T07:16:13Z

Test build #46056 has finished for PR 9759 at commit 2777677.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-11-17T07:29:10Z

retest this please.

SparkQA · 2015-11-17T09:07:49Z

Test build #46067 has finished for PR 9759 at commit 2777677.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-11-17T18:33:00Z

Test build #46093 has finished for PR 9759 at commit b2a835d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-11-17T19:50:39Z

...src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala

-    val rdd = sqlContext.sparkContext.parallelize(Seq(str))
-    val df = sqlContext.read.option("allowNonNumericNumbers", "true").json(rdd)
+  test("allowNonNumericNumbers on") {
+    val testCases: Seq[String] = Seq("""{"age": NaN}""", """{"age": Infinity}""",


can we still read them if they are quoted?

No, so if we don't set JsonGenerator.Feature.QUOTE_NON_NUMERIC_NUMBERS to false, we can't read them normally.

SparkQA · 2015-11-20T10:17:50Z

Test build #46407 has finished for PR 9759 at commit 186fa5e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-11-27T03:21:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

@@ -100,34 +101,27 @@ object JacksonParser {
        parser.getFloatValue

      case (VALUE_STRING, FloatType) =>
-        // Special case handling for NaN and Infinity.


why are we removing the special handling for float types here?

yea, should revert it back. BTW, do we actually test "inf" and "-inf" before? Because "inf".toFloat is not legal.

viirya · 2015-11-27T09:36:32Z

...src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala

+    val testCases: Seq[String] = Seq("""{"age": NaN}""", """{"age": Infinity}""",
+      """{"age": -Infinity}""", """{"age": "NaN"}""", """{"age": "Infinity"}""",
+      """{"age": "-Infinity"}""")
+    val tests: Seq[Double => Boolean] = Seq(_.isNaN, _.isPosInfinity, _.isNegInfinity,


Besides, I found that "Inf", "-Inf" seems not working even JsonGenerator.Feature.QUOTE_NON_NUMERIC_NUMBERS is enabled.

Need to upgrade jackson library version in order to support "INF" and "-INF" (case-sensitive).

SparkQA · 2015-11-27T11:33:18Z

Test build #46812 has finished for PR 9759 at commit 6d90b24.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-12T09:36:37Z

Test build #58469 has finished for PR 9759 at commit 6d90b24.

This patch fails R style tests.
This patch does not merge cleanly.
This patch adds no public classes.

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JSONOptions.scala sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

viirya · 2016-05-16T07:44:51Z

@rxin I have revisited this recently. This should be useful. Please check if it is good for you now. Thanks.

viirya · 2016-05-16T07:45:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala

-      case VALUE_STRING => StringType
+      case VALUE_STRING =>
+        // If there is only one row, the following non-numeric numbers will be incorrectly
+        // recognized as StringType.


E.g., the two tests in JsonParsingOptionsSuite.

SparkQA · 2016-05-16T07:50:00Z

Test build #58626 has finished for PR 9759 at commit c74d715.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-16T08:01:01Z

Test build #58627 has finished for PR 9759 at commit 6f668c3.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-05-16T08:14:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

-          lowerCaseValue.equals("-infinity") ||
-          lowerCaseValue.equals("inf") ||
-          lowerCaseValue.equals("-inf")) {
+        if (value.equals("NaN") ||


"infinity".toDouble, "inf".toDouble are not legal. These non-numeric numbers are case-sensitive, both for Jackson and Scala.

I think we should also allow INF, -INF here, to be consistent with the legal inputs of allowNonNumericNumbers.

SparkQA · 2016-05-16T09:57:39Z

Test build #58628 has finished for PR 9759 at commit 1cfd1dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-05-18T10:02:33Z

ping @rxin also cc @cloud-fan @yhuai

cloud-fan · 2016-05-22T17:30:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

          value.toFloat
+        } else if (value.equals("+INF")) {


should we support INF (without +)? And it's weird that we don't need + for Infinity, but need it for INF

Although jackson supports Infinity, +Infinity, it only supports +INF. I am neutral about this. But be consistent with it seems be better?

think from an end user point of view, we should be more consistent with ourselves, not some random rule by jackson.

SparkQA · 2016-05-23T05:24:05Z

Test build #59119 has finished for PR 9759 at commit af1e3a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-05-23T06:25:47Z

The rest LGTM

viirya · 2016-05-24T07:23:01Z

Is this ready to merge now?

cloud-fan · 2016-05-24T16:44:39Z

thanks, merging to master and 2.0!

… work ## What changes were proposed in this pull request? Jackson suppprts `allowNonNumericNumbers` option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF". Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in `JsonParsingOptionsSuite` passed. ## How was this patch tested? `JsonParsingOptionsSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9759 from viirya/fix-json-nonnumric. (cherry picked from commit c24b6b6) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

zsxwing · 2016-05-31T17:27:01Z

@rxin, could we revert this one? Just hit a regression issue in jackson-module-scala 2.7.{1, 2, 3} about Option json serialization (FasterXML/jackson-module-scala#240) in my PR (#13335). We can remerge this one once jackson-module-scala fixes the issue.

srowen · 2016-05-31T18:23:32Z

Darn. Is there any workaround? Downgrade to 2.6.5?

zsxwing · 2016-05-31T18:25:47Z

Darn. Is there any workaround? Downgrade to 2.6.5?

2.6.5 doesn't have the allowNonNumericNumbers feature.

Edited: That's why I need to revert the whole patch.

srowen · 2016-05-31T18:39:15Z

What's the impact of the problem right now then? because it sounds like downgrading would simply cause another problem. It sounds like you have a workaround there?

zsxwing · 2016-05-31T18:50:36Z

What's the impact of the problem right now then? because it sounds like downgrading would simply cause another problem. It sounds like you have a workaround there?

The problem is if a class has an Option parameter, jackson-module-scala may not be able to generate the correct json automatically (except you define a custom serializer by yourself). This also impacts users' codes.

Reverting this PR just drops an unreleased feature, which I think not a big deal.

rxin · 2016-05-31T18:52:43Z

Yea I'm OK with reverting this, if the other thing is difficult to work around.

zsxwing · 2016-05-31T18:59:38Z

Sent #13417 to revert it.

viirya · 2016-12-26T05:41:56Z

@srowen @rxin @zsxwing The Option json serialization issue (FasterXML/jackson-module-scala#240) looks like fixed now. Do you think it is ok I try to upgrade Jackson now?

srowen · 2016-12-27T18:56:44Z

@viirya to what version? if it's a maintenance release that's usually fine

viirya · 2016-12-28T02:36:20Z

@srowen Unfortunately, this fixing is only included since 2.8.1. Even latest maintenance release 2.7.8 doesn't contain it.

limansky · 2017-01-31T16:21:47Z

Hi all. There are security issues in jackson-dataformat-xml prior to 2.7.4 and 2.8.0. Here are the links: FasterXML/jackson-dataformat-xml#199, FasterXML/jackson-dataformat-xml#190. Even though Spark itself doesn't use this module, this dependency forces Spark users to use affected version, to have consistent set of jackson libraries.

…s in JSON ## What changes were proposed in this pull request? This PR is based on apache#16199 and extracts the valid change from apache#9759 to resolve SPARK-18772 This avoids additional conversion try with `toFloat` and `toDouble`. For avoiding additional conversions, please refer the codes below: **Before** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NumberFormatException: For input string: "nan" ... ``` **After** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.RuntimeException: Cannot parse nan as DoubleType. ... ``` ## How was this patch tested? Unit tests added in `JsonSuite`. Closes apache#16199 Author: hyukjinkwon <gurwls223@gmail.com> Author: Nathan Howell <nhowell@godaddy.com> Closes apache#17956 from HyukjinKwon/SPARK-18772.

…s in JSON ## What changes were proposed in this pull request? This PR is based on #16199 and extracts the valid change from #9759 to resolve SPARK-18772 This avoids additional conversion try with `toFloat` and `toDouble`. For avoiding additional conversions, please refer the codes below: **Before** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NumberFormatException: For input string: "nan" ... ``` **After** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.RuntimeException: Cannot parse nan as DoubleType. ... ``` ## How was this patch tested? Unit tests added in `JsonSuite`. Closes #16199 Author: hyukjinkwon <gurwls223@gmail.com> Author: Nathan Howell <nhowell@godaddy.com> Closes #17956 from HyukjinKwon/SPARK-18772. (cherry picked from commit 3f98375) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…s in JSON ## What changes were proposed in this pull request? This PR is based on apache#16199 and extracts the valid change from apache#9759 to resolve SPARK-18772 This avoids additional conversion try with `toFloat` and `toDouble`. For avoiding additional conversions, please refer the codes below: **Before** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NumberFormatException: For input string: "nan" ... ``` **After** ```scala scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show() 17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.RuntimeException: Cannot parse nan as DoubleType. ... ``` ## How was this patch tested? Unit tests added in `JsonSuite`. Closes apache#16199 Author: hyukjinkwon <gurwls223@gmail.com> Author: Nathan Howell <nhowell@godaddy.com> Closes apache#17956 from HyukjinKwon/SPARK-18772.

Make allowNonNumericNumbers option work.

2777677

Not to quote non numeric numbers when enabling allowNonNumericNumbers.

b2a835d

rxin reviewed Nov 17, 2015
View reviewed changes

Deal with quoted non-numeric number.

186fa5e

rxin reviewed Nov 27, 2015
View reviewed changes

viirya added 2 commits November 27, 2015 16:56

Merge remote-tracking branch 'upstream/master' into fix-json-nonnumric

dc9abc3

For comments.

6d90b24

viirya reviewed Nov 27, 2015
View reviewed changes

viirya reviewed May 16, 2016
View reviewed changes

Update doc.

6f668c3

Fix dependency test.

1cfd1dc

viirya reviewed May 16, 2016
View reviewed changes

cloud-fan reviewed May 22, 2016
View reviewed changes

Support INF and +Infinity when allowNonNumericNumbers is off.

af1e3a1

viirya mentioned this pull request May 24, 2016

Added support for parsing 'INF' as positive infinity FasterXML/jackson-core#287

Closed

asfgit closed this in c24b6b6 May 24, 2016

HyukjinKwon mentioned this pull request May 12, 2017

[SPARK-18772][SQL] Avoid unnecessary conversion try for special floats in JSON #17956

Closed

HyukjinKwon mentioned this pull request Jun 20, 2018

[SPARK-24601] Update Jackson to 2.9.6 #21596

Closed

viirya deleted the fix-json-nonnumric branch December 27, 2023 18:19

[SPARK-11753][SQL][test-hadoop2.2] Make allowNonNumericNumbers option work #9759

[SPARK-11753][SQL][test-hadoop2.2] Make allowNonNumericNumbers option work #9759

Conversation

viirya commented Nov 17, 2015 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

rxin commented Nov 17, 2015

rxin commented Nov 17, 2015

viirya commented Nov 17, 2015

SparkQA commented Nov 17, 2015

viirya commented Nov 17, 2015

SparkQA commented Nov 17, 2015

SparkQA commented Nov 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 20, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 27, 2015

SparkQA commented May 12, 2016

viirya commented May 16, 2016

Choose a reason for hiding this comment

SparkQA commented May 16, 2016

SparkQA commented May 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 16, 2016

viirya commented May 18, 2016

cloud-fan May 22, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 23, 2016

cloud-fan commented May 23, 2016

viirya commented May 24, 2016

cloud-fan commented May 24, 2016

zsxwing commented May 31, 2016 • edited Loading

srowen commented May 31, 2016

zsxwing commented May 31, 2016 • edited Loading

srowen commented May 31, 2016

zsxwing commented May 31, 2016 • edited Loading

rxin commented May 31, 2016 • edited Loading

zsxwing commented May 31, 2016

viirya commented Dec 26, 2016

srowen commented Dec 27, 2016

viirya commented Dec 28, 2016

limansky commented Jan 31, 2017

viirya commented Nov 17, 2015 •

edited

Loading

cloud-fan May 22, 2016 •

edited

Loading

zsxwing commented May 31, 2016 •

edited

Loading

zsxwing commented May 31, 2016 •

edited

Loading

zsxwing commented May 31, 2016 •

edited

Loading

rxin commented May 31, 2016 •

edited

Loading