-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8782] [SQL] Fix code generation for ORDER BY NULL #7179
Conversation
This should block on #7176 being merged. |
// Test GenerateOrdering for all common types. For each type, we construct random input rows that | ||
// contain two columns of that type, then for pairs of randomly-generated rows we check that | ||
// GenerateOrdering agrees with RowOrdering. | ||
(DataTypeTestUtils.atomicTypes ++ Set(NullType)).foreach { dataType => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is total overkill, but it's a neat example of how randomized data generation plus a list of types can be used for exploratory testing.
Merged build triggered. |
Merged build started. |
StructField("b", dataType, nullable = true) :: Nil) | ||
val toCatalyst = CatalystTypeConverters.createToCatalystConverter(rowType) | ||
// Sort ordering is not defined for NaN, so skip any random inputs that contain it: | ||
def isIncomparable(v: Any): Boolean = v match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While working on this, I discovered that RowOrdering
and GenerateOrdering
disagree for inputs containing NaN. This isn't a bug per-se, since many systems have undefined behavior when sorting on NaN. For this reason, I think that some databases prohibit NaN and Infinity from being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we might use sorting for clustering as part of a sort-based distinct operator, I wonder whether this has any bad implications for performing distinct on columns that contain NaN. Should we warn about this undefined behavior somewhere in our documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that it's actually possible to crash the Sort
operator with "Comparison method violates its general contract!" errors if NaNs are present in the column being sorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
damn
Test build #36352 has started for PR 7179 at commit |
// contain two columns of that type, then for pairs of randomly-generated rows we check that | ||
// GenerateOrdering agrees with RowOrdering. | ||
(DataTypeTestUtils.atomicTypes ++ Set(NullType)).foreach { dataType => | ||
test(s"GenerateOrdering with $dataType") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nesting of the loops here is slightly misleading, because we'll always report a passed test for types where we don't have a data generator. We at least test that we're able to generate code for the ordering even if we don't actually execute that code. Maybe this is an okay trade-off, but it's a concern to watch out for.
Test build #36352 has finished for PR 7179 at commit
|
Merged build finished. Test PASSed. |
@@ -185,6 +185,7 @@ class CodeGenContext { | |||
// use c1 - c2 may overflow | |||
case dt: DataType if isPrimitiveType(dt) => s"($c1 > $c2 ? 1 : $c1 < $c2 ? -1 : 0)" | |||
case BinaryType => s"org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary($c1, $c2)" | |||
case NullType => "0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @davies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Can you take the testing stuff out of this pr and merge this first? |
Merged build triggered. |
Merged build started. |
21325e2
to
6ef49a6
Compare
@rxin, done. |
LGTM/ |
Merged build triggered. |
Merged build started. |
Test build #36441 has started for PR 7179 at commit |
Merged build finished. Test FAILed. |
Test build #36441 has finished for PR 7179 at commit
|
Merged build finished. Test FAILed. |
Jenkins, retest this please. |
Merged build triggered. |
Merged build started. |
Test build #36445 has started for PR 7179 at commit |
Test build #36445 has finished for PR 7179 at commit
|
Merged build finished. Test PASSed. |
Thanks - merged. |
This fixes code generation for queries containing
ORDER BY NULL
. Previously, the generated code would fail to compile.