Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11753][SQL][test-hadoop2.2] Make allowNonNumericNumbers option work #9759

Closed
wants to merge 12 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Nov 17, 2015

What changes were proposed in this pull request?

Jackson suppprts allowNonNumericNumbers option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF". Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in JsonParsingOptionsSuite passed.

How was this patch tested?

JsonParsingOptionsSuite.

@rxin
Copy link
Contributor

rxin commented Nov 17, 2015

Can you add also nan, infinity, -infinity, inf, -inf to the test case? And also turn it on by default.

@rxin
Copy link
Contributor

rxin commented Nov 17, 2015

(Also update the documentation in DataFrameReader and readwrite.py to include this option.

@viirya
Copy link
Member Author

viirya commented Nov 17, 2015

ok.

@SparkQA
Copy link

SparkQA commented Nov 17, 2015

Test build #46056 has finished for PR 9759 at commit 2777677.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Nov 17, 2015

retest this please.

@SparkQA
Copy link

SparkQA commented Nov 17, 2015

Test build #46067 has finished for PR 9759 at commit 2777677.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2015

Test build #46093 has finished for PR 9759 at commit b2a835d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val rdd = sqlContext.sparkContext.parallelize(Seq(str))
val df = sqlContext.read.option("allowNonNumericNumbers", "true").json(rdd)
test("allowNonNumericNumbers on") {
val testCases: Seq[String] = Seq("""{"age": NaN}""", """{"age": Infinity}""",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we still read them if they are quoted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, so if we don't set JsonGenerator.Feature.QUOTE_NON_NUMERIC_NUMBERS to false, we can't read them normally.

@SparkQA
Copy link

SparkQA commented Nov 20, 2015

Test build #46407 has finished for PR 9759 at commit 186fa5e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -100,34 +101,27 @@ object JacksonParser {
parser.getFloatValue

case (VALUE_STRING, FloatType) =>
// Special case handling for NaN and Infinity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing the special handling for float types here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, should revert it back. BTW, do we actually test "inf" and "-inf" before? Because "inf".toFloat is not legal.

val testCases: Seq[String] = Seq("""{"age": NaN}""", """{"age": Infinity}""",
"""{"age": -Infinity}""", """{"age": "NaN"}""", """{"age": "Infinity"}""",
"""{"age": "-Infinity"}""")
val tests: Seq[Double => Boolean] = Seq(_.isNaN, _.isPosInfinity, _.isNegInfinity,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, I found that "Inf", "-Inf" seems not working even JsonGenerator.Feature.QUOTE_NON_NUMERIC_NUMBERS is enabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to upgrade jackson library version in order to support "INF" and "-INF" (case-sensitive).

@SparkQA
Copy link

SparkQA commented Nov 27, 2015

Test build #46812 has finished for PR 9759 at commit 6d90b24.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 12, 2016

Test build #58469 has finished for PR 9759 at commit 6d90b24.

  • This patch fails R style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JSONOptions.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala
	sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala
	sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
@viirya
Copy link
Member Author

viirya commented May 16, 2016

@rxin I have revisited this recently. This should be useful. Please check if it is good for you now. Thanks.

case VALUE_STRING => StringType
case VALUE_STRING =>
// If there is only one row, the following non-numeric numbers will be incorrectly
// recognized as StringType.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g., the two tests in JsonParsingOptionsSuite.

@SparkQA
Copy link

SparkQA commented May 16, 2016

Test build #58626 has finished for PR 9759 at commit c74d715.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 16, 2016

Test build #58627 has finished for PR 9759 at commit 6f668c3.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

lowerCaseValue.equals("-infinity") ||
lowerCaseValue.equals("inf") ||
lowerCaseValue.equals("-inf")) {
if (value.equals("NaN") ||
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"infinity".toDouble, "inf".toDouble are not legal. These non-numeric numbers are case-sensitive, both for Jackson and Scala.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also allow INF, -INF here, to be consistent with the legal inputs of allowNonNumericNumbers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. added.

@SparkQA
Copy link

SparkQA commented May 16, 2016

Test build #58628 has finished for PR 9759 at commit 1cfd1dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 18, 2016

ping @rxin also cc @cloud-fan @yhuai

value.toFloat
} else if (value.equals("+INF")) {
Copy link
Contributor

@cloud-fan cloud-fan May 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we support INF (without +)? And it's weird that we don't need + for Infinity, but need it for INF

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although jackson supports Infinity, +Infinity, it only supports +INF. I am neutral about this. But be consistent with it seems be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think from an end user point of view, we should be more consistent with ourselves, not some random rule by jackson.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. fixed.

@SparkQA
Copy link

SparkQA commented May 23, 2016

Test build #59119 has finished for PR 9759 at commit af1e3a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

The rest LGTM

@viirya
Copy link
Member Author

viirya commented May 24, 2016

Is this ready to merge now?

@cloud-fan
Copy link
Contributor

thanks, merging to master and 2.0!

@asfgit asfgit closed this in c24b6b6 May 24, 2016
asfgit pushed a commit that referenced this pull request May 24, 2016
… work

## What changes were proposed in this pull request?

Jackson suppprts `allowNonNumericNumbers` option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF".  Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in `JsonParsingOptionsSuite` passed.

## How was this patch tested?

`JsonParsingOptionsSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9759 from viirya/fix-json-nonnumric.

(cherry picked from commit c24b6b6)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@zsxwing
Copy link
Member

zsxwing commented May 31, 2016

@rxin, could we revert this one? Just hit a regression issue in jackson-module-scala 2.7.{1, 2, 3} about Option json serialization (FasterXML/jackson-module-scala#240) in my PR (#13335). We can remerge this one once jackson-module-scala fixes the issue.

@srowen
Copy link
Member

srowen commented May 31, 2016

Darn. Is there any workaround? Downgrade to 2.6.5?

@zsxwing
Copy link
Member

zsxwing commented May 31, 2016

Darn. Is there any workaround? Downgrade to 2.6.5?

2.6.5 doesn't have the allowNonNumericNumbers feature.

Edited: That's why I need to revert the whole patch.

@srowen
Copy link
Member

srowen commented May 31, 2016

What's the impact of the problem right now then? because it sounds like downgrading would simply cause another problem. It sounds like you have a workaround there?

@zsxwing
Copy link
Member

zsxwing commented May 31, 2016

What's the impact of the problem right now then? because it sounds like downgrading would simply cause another problem. It sounds like you have a workaround there?

The problem is if a class has an Option parameter, jackson-module-scala may not be able to generate the correct json automatically (except you define a custom serializer by yourself). This also impacts users' codes.

Reverting this PR just drops an unreleased feature, which I think not a big deal.

@rxin
Copy link
Contributor

rxin commented May 31, 2016

Yea I'm OK with reverting this, if the other thing is difficult to work around.

@zsxwing
Copy link
Member

zsxwing commented May 31, 2016

Sent #13417 to revert it.

@viirya
Copy link
Member Author

viirya commented Dec 26, 2016

@srowen @rxin @zsxwing The Option json serialization issue (FasterXML/jackson-module-scala#240) looks like fixed now. Do you think it is ok I try to upgrade Jackson now?

@srowen
Copy link
Member

srowen commented Dec 27, 2016

@viirya to what version? if it's a maintenance release that's usually fine

@viirya
Copy link
Member Author

viirya commented Dec 28, 2016

@srowen Unfortunately, this fixing is only included since 2.8.1. Even latest maintenance release 2.7.8 doesn't contain it.

@limansky
Copy link

Hi all. There are security issues in jackson-dataformat-xml prior to 2.7.4 and 2.8.0. Here are the links: FasterXML/jackson-dataformat-xml#199, FasterXML/jackson-dataformat-xml#190. Even though Spark itself doesn't use this module, this dependency forces Spark users to use affected version, to have consistent set of jackson libraries.

ghost pushed a commit to dbtsai/spark that referenced this pull request May 13, 2017
…s in JSON

## What changes were proposed in this pull request?

This PR is based on  apache#16199 and extracts the valid change from apache#9759 to resolve SPARK-18772

This avoids additional conversion try with `toFloat` and `toDouble`.

For avoiding additional conversions, please refer the codes below:

**Before**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NumberFormatException: For input string: "nan"
...
```

**After**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: Cannot parse nan as DoubleType.
...
```

## How was this patch tested?

Unit tests added in `JsonSuite`.

Closes apache#16199

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Nathan Howell <nhowell@godaddy.com>

Closes apache#17956 from HyukjinKwon/SPARK-18772.
asfgit pushed a commit that referenced this pull request May 13, 2017
…s in JSON

## What changes were proposed in this pull request?

This PR is based on  #16199 and extracts the valid change from #9759 to resolve SPARK-18772

This avoids additional conversion try with `toFloat` and `toDouble`.

For avoiding additional conversions, please refer the codes below:

**Before**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NumberFormatException: For input string: "nan"
...
```

**After**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: Cannot parse nan as DoubleType.
...
```

## How was this patch tested?

Unit tests added in `JsonSuite`.

Closes #16199

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Nathan Howell <nhowell@godaddy.com>

Closes #17956 from HyukjinKwon/SPARK-18772.

(cherry picked from commit 3f98375)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
robert3005 pushed a commit to palantir/spark that referenced this pull request May 19, 2017
…s in JSON

## What changes were proposed in this pull request?

This PR is based on  apache#16199 and extracts the valid change from apache#9759 to resolve SPARK-18772

This avoids additional conversion try with `toFloat` and `toDouble`.

For avoiding additional conversions, please refer the codes below:

**Before**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NumberFormatException: For input string: "nan"
...
```

**After**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: Cannot parse nan as DoubleType.
...
```

## How was this patch tested?

Unit tests added in `JsonSuite`.

Closes apache#16199

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Nathan Howell <nhowell@godaddy.com>

Closes apache#17956 from HyukjinKwon/SPARK-18772.
liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
…s in JSON

## What changes were proposed in this pull request?

This PR is based on  apache#16199 and extracts the valid change from apache#9759 to resolve SPARK-18772

This avoids additional conversion try with `toFloat` and `toDouble`.

For avoiding additional conversions, please refer the codes below:

**Before**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:30:41 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NumberFormatException: For input string: "nan"
...
```

**After**

```scala
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> spark.read.schema(StructType(Seq(StructField("a", DoubleType)))).option("mode", "FAILFAST").json(Seq("""{"a": "nan"}""").toDS).show()
17/05/12 11:44:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: Cannot parse nan as DoubleType.
...
```

## How was this patch tested?

Unit tests added in `JsonSuite`.

Closes apache#16199

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Nathan Howell <nhowell@godaddy.com>

Closes apache#17956 from HyukjinKwon/SPARK-18772.
@viirya viirya deleted the fix-json-nonnumric branch December 27, 2023 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants