Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11753][SQL][test-hadoop2.2] Make allowNonNumericNumbers option work #9759

Closed
wants to merge 12 commits into from
Closed
11 changes: 6 additions & 5 deletions dev/deps/spark-deps-hadoop-2.2
Original file line number Diff line number Diff line change
Expand Up @@ -72,12 +72,13 @@ hk2-utils-2.4.0-b34.jar
httpclient-4.5.2.jar
httpcore-4.4.4.jar
ivy-2.4.0.jar
jackson-annotations-2.5.3.jar
jackson-core-2.5.3.jar
jackson-annotations-2.7.3.jar
jackson-core-2.7.3.jar
jackson-core-asl-1.9.13.jar
jackson-databind-2.5.3.jar
jackson-databind-2.7.3.jar
jackson-mapper-asl-1.9.13.jar
jackson-module-scala_2.11-2.5.3.jar
jackson-module-paranamer-2.7.3.jar
jackson-module-scala_2.11-2.7.3.jar
janino-2.7.8.jar
javassist-3.18.1-GA.jar
javax.annotation-api-1.2.jar
Expand Down Expand Up @@ -127,7 +128,7 @@ objenesis-2.1.jar
opencsv-2.3.jar
oro-2.0.8.jar
osgi-resource-locator-1.0.1.jar
paranamer-2.6.jar
paranamer-2.8.jar
parquet-column-1.7.0.jar
parquet-common-1.7.0.jar
parquet-encoding-1.7.0.jar
Expand Down
11 changes: 6 additions & 5 deletions dev/deps/spark-deps-hadoop-2.3
Original file line number Diff line number Diff line change
Expand Up @@ -74,12 +74,13 @@ hk2-utils-2.4.0-b34.jar
httpclient-4.5.2.jar
httpcore-4.4.4.jar
ivy-2.4.0.jar
jackson-annotations-2.5.3.jar
jackson-core-2.5.3.jar
jackson-annotations-2.7.3.jar
jackson-core-2.7.3.jar
jackson-core-asl-1.9.13.jar
jackson-databind-2.5.3.jar
jackson-databind-2.7.3.jar
jackson-mapper-asl-1.9.13.jar
jackson-module-scala_2.11-2.5.3.jar
jackson-module-paranamer-2.7.3.jar
jackson-module-scala_2.11-2.7.3.jar
janino-2.7.8.jar
java-xmlbuilder-1.0.jar
javassist-3.18.1-GA.jar
Expand Down Expand Up @@ -134,7 +135,7 @@ objenesis-2.1.jar
opencsv-2.3.jar
oro-2.0.8.jar
osgi-resource-locator-1.0.1.jar
paranamer-2.6.jar
paranamer-2.8.jar
parquet-column-1.7.0.jar
parquet-common-1.7.0.jar
parquet-encoding-1.7.0.jar
Expand Down
11 changes: 6 additions & 5 deletions dev/deps/spark-deps-hadoop-2.4
Original file line number Diff line number Diff line change
Expand Up @@ -74,12 +74,13 @@ hk2-utils-2.4.0-b34.jar
httpclient-4.5.2.jar
httpcore-4.4.4.jar
ivy-2.4.0.jar
jackson-annotations-2.5.3.jar
jackson-core-2.5.3.jar
jackson-annotations-2.7.3.jar
jackson-core-2.7.3.jar
jackson-core-asl-1.9.13.jar
jackson-databind-2.5.3.jar
jackson-databind-2.7.3.jar
jackson-mapper-asl-1.9.13.jar
jackson-module-scala_2.11-2.5.3.jar
jackson-module-paranamer-2.7.3.jar
jackson-module-scala_2.11-2.7.3.jar
janino-2.7.8.jar
java-xmlbuilder-1.0.jar
javassist-3.18.1-GA.jar
Expand Down Expand Up @@ -134,7 +135,7 @@ objenesis-2.1.jar
opencsv-2.3.jar
oro-2.0.8.jar
osgi-resource-locator-1.0.1.jar
paranamer-2.6.jar
paranamer-2.8.jar
parquet-column-1.7.0.jar
parquet-common-1.7.0.jar
parquet-encoding-1.7.0.jar
Expand Down
11 changes: 6 additions & 5 deletions dev/deps/spark-deps-hadoop-2.6
Original file line number Diff line number Diff line change
Expand Up @@ -80,13 +80,14 @@ htrace-core-3.0.4.jar
httpclient-4.5.2.jar
httpcore-4.4.4.jar
ivy-2.4.0.jar
jackson-annotations-2.5.3.jar
jackson-core-2.5.3.jar
jackson-annotations-2.7.3.jar
jackson-core-2.7.3.jar
jackson-core-asl-1.9.13.jar
jackson-databind-2.5.3.jar
jackson-databind-2.7.3.jar
jackson-jaxrs-1.9.13.jar
jackson-mapper-asl-1.9.13.jar
jackson-module-scala_2.11-2.5.3.jar
jackson-module-paranamer-2.7.3.jar
jackson-module-scala_2.11-2.7.3.jar
jackson-xc-1.9.13.jar
janino-2.7.8.jar
java-xmlbuilder-1.0.jar
Expand Down Expand Up @@ -142,7 +143,7 @@ objenesis-2.1.jar
opencsv-2.3.jar
oro-2.0.8.jar
osgi-resource-locator-1.0.1.jar
paranamer-2.6.jar
paranamer-2.8.jar
parquet-column-1.7.0.jar
parquet-common-1.7.0.jar
parquet-encoding-1.7.0.jar
Expand Down
11 changes: 6 additions & 5 deletions dev/deps/spark-deps-hadoop-2.7
Original file line number Diff line number Diff line change
Expand Up @@ -80,13 +80,14 @@ htrace-core-3.1.0-incubating.jar
httpclient-4.5.2.jar
httpcore-4.4.4.jar
ivy-2.4.0.jar
jackson-annotations-2.5.3.jar
jackson-core-2.5.3.jar
jackson-annotations-2.7.3.jar
jackson-core-2.7.3.jar
jackson-core-asl-1.9.13.jar
jackson-databind-2.5.3.jar
jackson-databind-2.7.3.jar
jackson-jaxrs-1.9.13.jar
jackson-mapper-asl-1.9.13.jar
jackson-module-scala_2.11-2.5.3.jar
jackson-module-paranamer-2.7.3.jar
jackson-module-scala_2.11-2.7.3.jar
jackson-xc-1.9.13.jar
janino-2.7.8.jar
java-xmlbuilder-1.0.jar
Expand Down Expand Up @@ -143,7 +144,7 @@ objenesis-2.1.jar
opencsv-2.3.jar
oro-2.0.8.jar
osgi-resource-locator-1.0.1.jar
paranamer-2.6.jar
paranamer-2.8.jar
parquet-column-1.7.0.jar
parquet-common-1.7.0.jar
parquet-encoding-1.7.0.jar
Expand Down
8 changes: 7 additions & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@
<jline.version>${scala.version}</jline.version>
<jline.groupid>org.scala-lang</jline.groupid>
<codehaus.jackson.version>1.9.13</codehaus.jackson.version>
<fasterxml.jackson.version>2.5.3</fasterxml.jackson.version>
<fasterxml.jackson.version>2.7.3</fasterxml.jackson.version>
<snappy.version>1.1.2.4</snappy.version>
<netlib.java.version>1.1.2</netlib.java.version>
<calcite.version>1.2.0-incubating</calcite.version>
Expand All @@ -180,6 +180,7 @@
<antlr4.version>4.5.2-1</antlr4.version>
<jpam.version>1.1</jpam.version>
<selenium.version>2.52.0</selenium.version>
<paranamer.version>2.8</paranamer.version>

<test.java.home>${java.home}</test.java.home>
<test.exclude.tags></test.exclude.tags>
Expand Down Expand Up @@ -1821,6 +1822,11 @@
<artifactId>antlr4-runtime</artifactId>
<version>${antlr4.version}</version>
</dependency>
<dependency>
<groupId>com.thoughtworks.paranamer</groupId>
<artifactId>paranamer</artifactId>
<version>${paranamer.version}</version>
</dependency>
</dependencies>
</dependencyManagement>

Expand Down
3 changes: 3 additions & 0 deletions python/pyspark/sql/readwriter.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,9 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
set, it uses the default value, ``true``.
:param allowNumericLeadingZero: allows leading zeros in numbers (e.g. 00012). If None is
set, it uses the default value, ``false``.
:param allowNonNumericNumbers: allows using non-numeric numbers such as "NaN", "Infinity",
"-Infinity", "INF", "-INF", which are convertd to floating
point numbers, ``true``.
:param allowBackslashEscapingAnyCharacter: allows accepting quoting of all character
using backslash quoting mechanism. If None is
set, it uses the default value, ``false``.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -293,6 +293,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* </li>
* <li>`allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers
* (e.g. 00012)</li>
* <li>`allowNonNumericNumbers` (default `true`): allows using non-numeric numbers such as "NaN",
* "Infinity", "-Infinity", "INF", "-INF", which are convertd to floating point numbers.</li>
* <li>`allowBackslashEscapingAnyCharacter` (default `false`): allows accepting quoting of all
* character using backslash quoting mechanism</li>
* <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -129,13 +129,15 @@ object JacksonParser extends Logging {
case (VALUE_STRING, FloatType) =>
// Special case handling for NaN and Infinity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing the special handling for float types here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, should revert it back. BTW, do we actually test "inf" and "-inf" before? Because "inf".toFloat is not legal.

val value = parser.getText
val lowerCaseValue = value.toLowerCase()
if (lowerCaseValue.equals("nan") ||
lowerCaseValue.equals("infinity") ||
lowerCaseValue.equals("-infinity") ||
lowerCaseValue.equals("inf") ||
lowerCaseValue.equals("-inf")) {
if (value.equals("NaN") ||
value.equals("Infinity") ||
value.equals("+Infinity") ||
value.equals("-Infinity")) {
value.toFloat
} else if (value.equals("+INF") || value.equals("INF")) {
Float.PositiveInfinity
} else if (value.equals("-INF")) {
Float.NegativeInfinity
} else {
throw new SparkSQLJsonProcessingException(s"Cannot parse $value as FloatType.")
}
Expand All @@ -146,13 +148,15 @@ object JacksonParser extends Logging {
case (VALUE_STRING, DoubleType) =>
// Special case handling for NaN and Infinity.
val value = parser.getText
val lowerCaseValue = value.toLowerCase()
if (lowerCaseValue.equals("nan") ||
lowerCaseValue.equals("infinity") ||
lowerCaseValue.equals("-infinity") ||
lowerCaseValue.equals("inf") ||
lowerCaseValue.equals("-inf")) {
if (value.equals("NaN") ||
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"infinity".toDouble, "inf".toDouble are not legal. These non-numeric numbers are case-sensitive, both for Jackson and Scala.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also allow INF, -INF here, to be consistent with the legal inputs of allowNonNumericNumbers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. added.

value.equals("Infinity") ||
value.equals("+Infinity") ||
value.equals("-Infinity")) {
value.toDouble
} else if (value.equals("+INF") || value.equals("INF")) {
Double.PositiveInfinity
} else if (value.equals("-INF")) {
Double.NegativeInfinity
} else {
throw new SparkSQLJsonProcessingException(s"Cannot parse $value as DoubleType.")
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.datasources.json

import org.apache.spark.sql.QueryTest
import org.apache.spark.sql.test.SharedSQLContext
import org.apache.spark.sql.types.{DoubleType, StructField, StructType}

/**
* Test cases for various [[JSONOptions]].
Expand Down Expand Up @@ -93,23 +94,51 @@ class JsonParsingOptionsSuite extends QueryTest with SharedSQLContext {
assert(df.first().getLong(0) == 18)
}

// The following two tests are not really working - need to look into Jackson's
// JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS.
ignore("allowNonNumericNumbers off") {
val str = """{"age": NaN}"""
val rdd = spark.sparkContext.parallelize(Seq(str))
val df = spark.read.json(rdd)

assert(df.schema.head.name == "_corrupt_record")
test("allowNonNumericNumbers off") {
// non-quoted non-numeric numbers don't work if allowNonNumericNumbers is off.
var testCases: Seq[String] = Seq("""{"age": NaN}""", """{"age": Infinity}""",
"""{"age": +Infinity}""", """{"age": -Infinity}""", """{"age": INF}""",
"""{"age": +INF}""", """{"age": -INF}""")
testCases.foreach { str =>
val rdd = spark.sparkContext.parallelize(Seq(str))
val df = spark.read.option("allowNonNumericNumbers", "false").json(rdd)

assert(df.schema.head.name == "_corrupt_record")
}

// quoted non-numeric numbers should still work even allowNonNumericNumbers is off.
testCases = Seq("""{"age": "NaN"}""", """{"age": "Infinity"}""", """{"age": "+Infinity"}""",
"""{"age": "-Infinity"}""", """{"age": "INF"}""", """{"age": "+INF"}""",
"""{"age": "-INF"}""")
val tests: Seq[Double => Boolean] = Seq(_.isNaN, _.isPosInfinity, _.isPosInfinity,
_.isNegInfinity, _.isPosInfinity, _.isPosInfinity, _.isNegInfinity)
val schema = StructType(StructField("age", DoubleType, true) :: Nil)

testCases.zipWithIndex.foreach { case (str, idx) =>
val rdd = spark.sparkContext.parallelize(Seq(str))
val df = spark.read.option("allowNonNumericNumbers", "false").schema(schema).json(rdd)

assert(df.schema.head.name == "age")
assert(tests(idx)(df.first().getDouble(0)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it's double type? Shouldn't it be string if allowNonNumericNumbers is off?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from @rxin comment that we want to support quoted non-numeric numbers when allowNonNumericNumbers is off.

Copy link
Contributor

@cloud-fan cloud-fan May 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense to me. What if users really want to use "NaN" as string?
cc @rxin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I shouldn't change InferSchem. Tests here are also need to add few doubles. I will update it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, it should be number when the field is inferred as double/float type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I updated it.

Non-quoted non-numeric numbers are parsed as double when the corresponding field is double/float type. This behavior is as same as before this patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I say that anywhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I don't think that's what I meant there.

}
}

ignore("allowNonNumericNumbers on") {
val str = """{"age": NaN}"""
val rdd = spark.sparkContext.parallelize(Seq(str))
val df = spark.read.option("allowNonNumericNumbers", "true").json(rdd)

assert(df.schema.head.name == "age")
assert(df.first().getDouble(0).isNaN)
test("allowNonNumericNumbers on") {
val testCases: Seq[String] = Seq("""{"age": NaN}""", """{"age": Infinity}""",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we still read them if they are quoted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, so if we don't set JsonGenerator.Feature.QUOTE_NON_NUMERIC_NUMBERS to false, we can't read them normally.

"""{"age": +Infinity}""", """{"age": -Infinity}""", """{"age": +INF}""",
"""{"age": -INF}""", """{"age": "NaN"}""", """{"age": "Infinity"}""",
"""{"age": "-Infinity"}""")
val tests: Seq[Double => Boolean] = Seq(_.isNaN, _.isPosInfinity, _.isPosInfinity,
_.isNegInfinity, _.isPosInfinity, _.isNegInfinity, _.isNaN, _.isPosInfinity,
_.isNegInfinity, _.isPosInfinity, _.isNegInfinity)
val schema = StructType(StructField("age", DoubleType, true) :: Nil)
testCases.zipWithIndex.foreach { case (str, idx) =>
val rdd = spark.sparkContext.parallelize(Seq(str))
val df = spark.read.option("allowNonNumericNumbers", "true").schema(schema).json(rdd)

assert(df.schema.head.name == "age")
assert(tests(idx)(df.first().getDouble(0)))
}
}

test("allowBackslashEscapingAnyCharacter off") {
Expand Down