-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24659][SQL] GenericArrayData.equals should respect element type differences #21643
[SPARK-24659][SQL] GenericArrayData.equals should respect element type differences #21643
Conversation
val array1 = new GenericArrayData(Array[Int](123)) | ||
val array2 = new GenericArrayData(Array[Long](123L)) | ||
|
||
assert(!array1.equals(array2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check positive case when two arrays have the same element type? and the same elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Can you test short and byte, too?
// GenericArrayData
scala> val arrayByte= new GenericArrayData(Array[Short](123.toByte))
scala> val arrayShort = new GenericArrayData(Array[Short](123.toShort))
scala> val arrayInt = new GenericArrayData(Array[Int](123))
scala> val arrayLong = new GenericArrayData(Array[Long](123L))
scala> arrayByte.equals(arrayLong)
res8: Boolean = true
scala> arrayByte.equals(arrayInt)
res9: Boolean = true
scala> arrayShort.equals(arrayInt)
res10: Boolean = true
scala> arrayShort.equals(arrayLong)
res11: Boolean = true
// UnsafeArrayData
scala> val unsafeByte = ExpressionEncoder[Array[Byte]].resolveAndBind().toRow(arrayByte).getArray(0)
scala> val unsafeShort = ExpressionEncoder[Array[Short]].resolveAndBind().toRow(arrayShort).getArray(0)
scala> val unsafeInt = ExpressionEncoder[Array[Int]].resolveAndBind().toRow(arrayInt).getArray(0)
scala> val unsafeLong = ExpressionEncoder[Array[Long]].resolveAndBind().toRow(arrayLong).getArray(0)
scala> arrayByte.equals(arrayLong)
res12: Boolean = false
scala> arrayByte.equals(arrayInt)
res13: Boolean = false
scala> arrayShort.equals(arrayInt)
res14: Boolean = false
scala> arrayShort.equals(arrayLong)
res15: Boolean = false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without schema, Spark can never compare a generic and an unsafe array. This fix is not a real bug fix, but makes the GenericArrayData
more clear about equals semantic, which is good to have, and might be useful when writing tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the changes break user's apps if arrays of int and long (other integer like type) are not comparable any more?
@@ -122,7 +122,7 @@ class GenericArrayData(val array: Array[Any]) extends ArrayData { | |||
if (!o2.isInstanceOf[Double] || ! java.lang.Double.isNaN(o2.asInstanceOf[Double])) { | |||
return false | |||
} | |||
case _ => if (o1 != o2) { | |||
case _ => if (!o1.equals(o2)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any needs to handle Array[Byte]
separately above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in java byte[]
or other primitive arrays doesn't have a proper equals
implementation.
scala> Array(1) == Array(1)
res0: Boolean = false
Test build #92335 has finished for PR 21643 at commit
|
#21643 (review) In Spark SQL, SQL/DataFrame/Dataset operations are all type-checked, and the analyzer will strictly reject operations that involve incompatible types. In all other cases, |
Thanks for your comments, @MaxGekk @maropu @cloud-fan , I've tweaked the unit test case to address your comments. Please check it out again. |
arraysShouldEqual(0.toByte, 123.toByte, (-123).toByte) // Byte | ||
arraysShouldEqual(0.toShort, 123.toShort, (-256).toShort) // Short | ||
arraysShouldEqual(0, 123, -65536) // Int | ||
arraysShouldEqual(0L, 123L, -65536L) // Long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not important but if you are checking corner cases, probably, it makes sense to pass values like Long.MinValue
and Double.MaxValue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good one. I can do that (and NaNs/Infinity for floating point types too)
Test build #92352 has finished for PR 21643 at commit
|
Test build #92353 has finished for PR 21643 at commit
|
@@ -104,4 +104,40 @@ class ComplexDataSuite extends SparkFunSuite { | |||
// The copied data should not be changed externally. | |||
assert(copied.getStruct(0, 1).getUTF8String(0).toString == "a") | |||
} | |||
|
|||
test("SPARK-24659: GenericArrayData.equals should respect element type differences") { | |||
import scala.reflect.ClassTag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can move this import to the head of this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion! I'm used to making one-off imports inside a function when an import is only used within that function, so that the scope is as narrow as possible without being disturbing.
Are there any Spark coding style guidelines that suggest otherwise? If so I'll follow the guideline and always import at the beginning of the file.
LGTM except for one minor comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
thanks, merging to master! |
What changes were proposed in this pull request?
Fix
GenericArrayData.equals
, so that it respects the actual types of the elements.e.g. an instance that represents an
array<int>
and another instance that represents anarray<long>
should be considered incompatible, and thus should return false forequals
.GenericArrayData
doesn't keep any schema information by itself, and rather relies on the Java objects referenced by itsarray
field's elements to keep track of their own object types. So, the most straightforward way to respect their types is to callequals
on the elements, instead of using Scala's==
operator, which can have semantics that are not always desirable:How was this patch tested?
Added unit test in
ComplexDataSuite