Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Spark UT issues in RapidsDataFrameAggregateSuite #10943

Merged
merged 2 commits into from
Jun 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,67 @@
spark-rapids-shim-json-lines ***/
package org.apache.spark.sql.rapids.suites

import org.apache.spark.sql.DataFrameAggregateSuite
import org.apache.spark.sql.{DataFrameAggregateSuite, Row}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.rapids.utils.RapidsSQLTestsTrait
import org.apache.spark.sql.types._

class RapidsDataFrameAggregateSuite extends DataFrameAggregateSuite with RapidsSQLTestsTrait {
// example to show how to replace the logic of an excluded test case in Vanilla Spark
testRapids("collect functions" ) { // "collect functions" was excluded at RapidsTestSettings
// println("...")
import testImplicits._

testRapids("collect functions") {
val df = Seq((1, 2), (2, 2), (3, 4)).toDF("a", "b")
checkAnswer(
df.select(sort_array(collect_list($"a")), sort_array(collect_list($"b"))),
Seq(Row(Seq(1, 2, 3), Seq(2, 2, 4)))
)
checkAnswer(
df.select(sort_array(collect_set($"a")), sort_array(collect_set($"b"))),
Seq(Row(Seq(1, 2, 3), Seq(2, 4)))
)

checkDataset(
df.select(sort_array(collect_set($"a")).as("aSet")).as[Set[Int]],
Set(1, 2, 3))
checkDataset(
df.select(sort_array(collect_set($"b")).as("bSet")).as[Set[Int]],
Set(2, 4))
checkDataset(
df.select(sort_array(collect_set($"a")), sort_array(collect_set($"b")))
.as[(Set[Int], Set[Int])], Seq(Set(1, 2, 3) -> Set(2, 4)): _*)
}

testRapids("collect functions structs") {
val df = Seq((1, 2, 2), (2, 2, 2), (3, 4, 1))
.toDF("a", "x", "y")
.select($"a", struct($"x", $"y").as("b"))
checkAnswer(
df.select(sort_array(collect_list($"a")), sort_array(collect_list($"b"))),
Seq(Row(Seq(1, 2, 3), Seq(Row(2, 2), Row(2, 2), Row(4, 1))))
)
checkAnswer(
df.select(sort_array(collect_set($"a")), sort_array(collect_set($"b"))),
Seq(Row(Seq(1, 2, 3), Seq(Row(2, 2), Row(4, 1))))
)
}

testRapids("SPARK-17641: collect functions should not collect null values") {
val df = Seq(("1", 2), (null, 2), ("1", 4)).toDF("a", "b")
checkAnswer(
df.select(sort_array(collect_list($"a")), sort_array(collect_list($"b"))),
Seq(Row(Seq("1", "1"), Seq(2, 2, 4)))
)
checkAnswer(
df.select(sort_array(collect_set($"a")), sort_array(collect_set($"b"))),
Seq(Row(Seq("1"), Seq(2, 4)))
)
}

testRapids("collect functions should be able to cast to array type with no null values") {
val df = Seq(1, 2).toDF("a")
checkAnswer(df.select(sort_array(collect_list("a")) cast ArrayType(IntegerType, false)),
Seq(Row(Seq(1, 2))))
checkAnswer(df.select(sort_array(collect_set("a")) cast ArrayType(FloatType, false)),
Seq(Row(Seq(1.0, 2.0))))
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ abstract class BackendTestSettings {
// or a description like "This simply can't work on GPU".
// It should never be "unknown" or "need investigation"
case class KNOWN_ISSUE(reason: String) extends ExcludeReason
case class ADJUST_UT(reason: String) extends ExcludeReason
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this vs. KNOWN_ISSUE? Are we intending to fix these? If so, we should file a tracking issue and use KNOWN_ISSUE with the issue URL as the description. If we're not intending to fix these, why not use KNOWN_ISSUE with the same description?

Copy link
Collaborator Author

@thirtiseven thirtiseven Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to mark the test cases that Spark UT cases have issues themselves, something is wrong or not working for plugin. But we still want to test what it meant to test by adjusting the case. Maybe a better name would be INVALID_CASE or SPARK_UT_ISSUE?

cc @binmahone

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree KNOWN_ISSUE is good enough. Why not file an issue for tolerating non-determinism by sort and reference it?

Copy link
Collaborator

@binmahone binmahone Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ADJUST_UT in our context means "the spark test case will work for rapids, there is no bug in rapids, but the test case itself needs some modification". For example, if a spark test case looks like:

test("testcase1"){
val x = spark.sql("select sum(x), y from testdata group by y")
assert(x.at(0) == ('x0', 100))
assert(x.at(1) == ('x1', 200))
}

Notice the operations and assertions are hard coded in the test case.
We know that Rapids may return results in a different order so the test case will fail for us.
From framework perspective, we have no chance to ask it to do any sort before asserting results.
However, the test case is still meaningful and we should enable it to increase test coverage.

This is where ADJUST_UT can help. we can adjust the above test case to a new one (and at the same time exclude the old test case with reason ADJUST_UT):

test("NEW testcase1"){
val x = spark.sql("select sum(x), y from testdata group by y")
// sort x to make the result deterministic
...
// make assertions based on the deterministic result
...
}

By doing this, the test case testcase1 is considered to be "solved", there will be NO follow up issue, so it's not a known issue.

Based on our experience on Gluten, this type of case is very common, So I think it's necessary to add the ADJUST_UT enum.

What you think? @jlowe @gerashegalov

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with this, I like the ADJUST_UT label. We can always go back and look at all the tests we adjusted (audit them).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with this, I like the ADJUST_UT label. We can always go back and look at all the tests we adjusted (audit them).

thx Alessandro

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this makes sense to me. We can go with this for this PR

I wonder if we define test("testcase1") in our code, does it override the test in the base class? If it is possible then we could just do the override and not have a special ADJUST_UT exclude tag.

Currently the framework does not have this feature (everything has to be explicit now). But I agree with you that this is a good idea. @HaoYang670 please raise a framework feature request if you also see potential value of this.

Another idea for collect_list kind of issues with SQL, we could probably register our own UDAF as collect_list which will either be a simple delegate to the real collect_list , collect_list followed by sort otherwise.

Yeah, this would be the test case where ADJUST_UT is NOT finally needed. However extra sort will be a performance hurt. In my point of view, we can tolerate minor different result with Vainilla Spark, as long as both are correct answer under ANSI SQL standard. Do you think our team can reach concenses on this? @gerashegalov

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@gerashegalov gerashegalov Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However extra sort will be a performance hurt. In my point of view, we can tolerate minor different result with Vainilla Spark, as long as both are correct answer under ANSI SQL standard.

What I mean is overriding collect_list only in the test code, injecting the sort only in specially tagged tests where we know that that the order variance is permissible. We don't need to pay unnecessary performance penalty in the production code. I am bringing this up as an idea to discuss for follow-up work. This PR is fine

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @gerashegalov I can roughly imagine what you're suggesting. Still it will be good if you can bring it up as a formal proposal (for further discussion). For now, I'm not quite sure to what extent can your idea solve the inconsistent problem.

case class WONT_FIX_ISSUE(reason: String) extends ExcludeReason


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,11 @@ class RapidsTestSettings extends BackendTestSettings {
.exclude("casting to fixed-precision decimals", KNOWN_ISSUE("https://github.com/NVIDIA/spark-rapids/issues/10771"))
.exclude("SPARK-32828: cast from a derived user-defined type to a base type", KNOWN_ISSUE("https://github.com/NVIDIA/spark-rapids/issues/10771"))
enableSuite[RapidsDataFrameAggregateSuite]
.exclude("collect functions", KNOWN_ISSUE("https://github.com/NVIDIA/spark-rapids/issues/10772"))
.exclude("collect functions structs", KNOWN_ISSUE("https://github.com/NVIDIA/spark-rapids/issues/10772"))
.exclude("collect functions should be able to cast to array type with no null values", KNOWN_ISSUE("https://github.com/NVIDIA/spark-rapids/issues/10772"))
.exclude("SPARK-17641: collect functions should not collect null values", KNOWN_ISSUE("https://github.com/NVIDIA/spark-rapids/issues/10772"))
.exclude("SPARK-19471: AggregationIterator does not initialize the generated result projection before using it", KNOWN_ISSUE("https://github.com/NVIDIA/spark-rapids/issues/10772"))
.exclude("collect functions", ADJUST_UT("order of elements in the array is non-deterministic in collect"))
.exclude("collect functions structs", ADJUST_UT("order of elements in the array is non-deterministic in collect"))
.exclude("collect functions should be able to cast to array type with no null values", ADJUST_UT("order of elements in the array is non-deterministic in collect"))
.exclude("SPARK-17641: collect functions should not collect null values", ADJUST_UT("order of elements in the array is non-deterministic in collect"))
.exclude("SPARK-19471: AggregationIterator does not initialize the generated result projection before using it", WONT_FIX_ISSUE("Codegen related UT, not applicable for GPU"))
.exclude("SPARK-24788: RelationalGroupedDataset.toString with unresolved exprs should not fail", KNOWN_ISSUE("https://github.com/NVIDIA/spark-rapids/issues/10801"))
enableSuite[RapidsJsonExpressionsSuite]
.exclude("from_json - invalid data", KNOWN_ISSUE("https://github.com/NVIDIA/spark-rapids/issues/10849"))
Expand Down
Loading