Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24659][SQL] GenericArrayData.equals should respect element type differences #21643

Closed

Conversation

rednaxelafx
Copy link
Contributor

What changes were proposed in this pull request?

Fix GenericArrayData.equals, so that it respects the actual types of the elements.
e.g. an instance that represents an array<int> and another instance that represents an array<long> should be considered incompatible, and thus should return false for equals.

GenericArrayData doesn't keep any schema information by itself, and rather relies on the Java objects referenced by its array field's elements to keep track of their own object types. So, the most straightforward way to respect their types is to call equals on the elements, instead of using Scala's == operator, which can have semantics that are not always desirable:

new java.lang.Integer(123) == new java.lang.Long(123L) // true in Scala
new java.lang.Integer(123).equals(new java.lang.Long(123L)) // false in Scala

How was this patch tested?

Added unit test in ComplexDataSuite

val array1 = new GenericArrayData(Array[Int](123))
val array2 = new GenericArrayData(Array[Long](123L))

assert(!array1.equals(array2))
Copy link
Member

@MaxGekk MaxGekk Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check positive case when two arrays have the same element type? and the same elements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Can you test short and byte, too?


// GenericArrayData
scala> val arrayByte= new GenericArrayData(Array[Short](123.toByte))
scala> val arrayShort = new GenericArrayData(Array[Short](123.toShort))
scala> val arrayInt = new GenericArrayData(Array[Int](123))
scala> val arrayLong = new GenericArrayData(Array[Long](123L))
scala> arrayByte.equals(arrayLong)
res8: Boolean = true

scala> arrayByte.equals(arrayInt)
res9: Boolean = true

scala> arrayShort.equals(arrayInt)
res10: Boolean = true

scala> arrayShort.equals(arrayLong)
res11: Boolean = true


// UnsafeArrayData
scala> val unsafeByte = ExpressionEncoder[Array[Byte]].resolveAndBind().toRow(arrayByte).getArray(0)
scala> val unsafeShort = ExpressionEncoder[Array[Short]].resolveAndBind().toRow(arrayShort).getArray(0)
scala> val unsafeInt = ExpressionEncoder[Array[Int]].resolveAndBind().toRow(arrayInt).getArray(0)
scala> val unsafeLong = ExpressionEncoder[Array[Long]].resolveAndBind().toRow(arrayLong).getArray(0)
scala> arrayByte.equals(arrayLong)
res12: Boolean = false

scala> arrayByte.equals(arrayInt)
res13: Boolean = false

scala> arrayShort.equals(arrayInt)
res14: Boolean = false

scala> arrayShort.equals(arrayLong)
res15: Boolean = false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without schema, Spark can never compare a generic and an unsafe array. This fix is not a real bug fix, but makes the GenericArrayData more clear about equals semantic, which is good to have, and might be useful when writing tests.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the changes break user's apps if arrays of int and long (other integer like type) are not comparable any more?

@@ -122,7 +122,7 @@ class GenericArrayData(val array: Array[Any]) extends ArrayData {
if (!o2.isInstanceOf[Double] || ! java.lang.Double.isNaN(o2.asInstanceOf[Double])) {
return false
}
case _ => if (o1 != o2) {
case _ => if (!o1.equals(o2)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any needs to handle Array[Byte] separately above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in java byte[] or other primitive arrays doesn't have a proper equals implementation.

scala> Array(1) == Array(1)
res0: Boolean = false

@SparkQA
Copy link

SparkQA commented Jun 26, 2018

Test build #92335 has finished for PR 21643 at commit d91b44a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rednaxelafx
Copy link
Contributor Author

#21643 (review)
For all practical purposes, no, this change shouldn't break any use cases that we support; there could be use cases that we didn't intend to support that might break.

In Spark SQL, SQL/DataFrame/Dataset operations are all type-checked, and the analyzer will strictly reject operations that involve incompatible types.
Array types in Spark SQL are neither covariant nor contravariant, so e.g. array<int> and array<long> would be considered incompatible, and the analyzer would have rejected equality comparison between them in the first place.

In all other cases, GenericArrayData is a Spark SQL internal type that shouldn't be used on the outside, and I don't think we ever intended to support direct access of this type from the outside.

@rednaxelafx
Copy link
Contributor Author

Thanks for your comments, @MaxGekk @maropu @cloud-fan , I've tweaked the unit test case to address your comments. Please check it out again.

arraysShouldEqual(0.toByte, 123.toByte, (-123).toByte) // Byte
arraysShouldEqual(0.toShort, 123.toShort, (-256).toShort) // Short
arraysShouldEqual(0, 123, -65536) // Int
arraysShouldEqual(0L, 123L, -65536L) // Long
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not important but if you are checking corner cases, probably, it makes sense to pass values like Long.MinValue and Double.MaxValue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good one. I can do that (and NaNs/Infinity for floating point types too)

@SparkQA
Copy link

SparkQA commented Jun 26, 2018

Test build #92352 has finished for PR 21643 at commit 083673c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 26, 2018

Test build #92353 has finished for PR 21643 at commit a30de22.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -104,4 +104,40 @@ class ComplexDataSuite extends SparkFunSuite {
// The copied data should not be changed externally.
assert(copied.getStruct(0, 1).getUTF8String(0).toString == "a")
}

test("SPARK-24659: GenericArrayData.equals should respect element type differences") {
import scala.reflect.ClassTag
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can move this import to the head of this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion! I'm used to making one-off imports inside a function when an import is only used within that function, so that the scope is as narrow as possible without being disturbing.
Are there any Spark coding style guidelines that suggest otherwise? If so I'll follow the guideline and always import at the beginning of the file.

@maropu
Copy link
Member

maropu commented Jun 27, 2018

LGTM except for one minor comment.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 1b9368f Jun 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants