-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for nested types to collect_set(...)
on the GPU [databricks]
#6079
Add support for nested types to collect_set(...)
on the GPU [databricks]
#6079
Conversation
Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
…ArrayType] Signed-off-by: Navin Kumar <navink@nvidia.com>
…collect set Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
Maybe the title should be changed to "...support nested types...." instead, because now we can call |
Adding some more tests and will change the title. |
Signed-off-by: Navin Kumar <navink@nvidia.com>
build |
collect_set(...)
on the GPUcollect_set(...)
on the GPU
…ollect_set on nested array types Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
build |
1 similar comment
build |
What is the behavior if there are NaNs? |
build |
We may have the results inconsistent with the CPU results, from Spark 3.1.3, as in this: #5958 (comment) |
It doesn't consistently handle NaN equality when it comes to arrays that contain NaN values. for example, the CPU will output:
And the GPU will output:
Usually when we use non-nested versions of floats and doubles, NaN values are considered unequal, but when collecting sets of nested array versions, NaN equality is considered on the CPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If NaNs are not supported, then we need to document it and have the operators marked as incompat or hasNaNs.
@@ -949,6 +949,10 @@ def gen_scalars_for_sql(data_gen, count, seed=0, force_no_nulls=False): | |||
# all of the basic types in a single struct | |||
all_basic_struct_gen = StructGen([['child'+str(ind), sub_gen] for ind, sub_gen in enumerate(all_basic_gens)]) | |||
|
|||
all_basic_struct_gen_no_nan = StructGen([['child'+str(ind), sub_gen] for ind, sub_gen in enumerate(all_basic_gens_no_nan)]) | |||
|
|||
array_struct_gen = StructGen([['child'+str(ind), sub_gen] for ind, sub_gen in enumerate(single_level_array_gens_no_nan)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have no_nan
in the name of this too? That way it is more clear what you are getting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it best to use incompat or hasNans at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think hasNans would probably be best. It would give us the most fine grained control over falling back to the CPU so we can do it only in situations that we know are problematic. But then we need to shift the documentation from incompat to one of the PartialSupport notes or in some other documentation.
…oat and incompat flag to remaining test Signed-off-by: Navin Kumar <navink@nvidia.com>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only concern now is that we have marked collect_set and incompatible for all data types. Even the ones where it would not be a problem. I am fine with this the way it is, because we have incompat operators turned on by default.
It looks like a number of tests are failing now because collect_set is marked as incompat. So either we update all of those tests or we have to go back and rethink how we are turning it off by default. |
I can update the the failing tests, but I did want to run this by you again: the incompatibility with NaNs is purely in the nested type support (when using a struct[Array(float)], struct[Array(double)], etc. sort of type). When it on pure float or double columns, then NaNs work fine. With the hasNans configuration, as far as I understand, this means that the assumption is all input doesn't have NaNs in these columns, whether nested or not, so maybe this kind of defeats the purpose of the existing NaN tests (but I guess they would still work in spite of the config). With incompat, it feels like kind too broad a stroke here. I am leaning towards moving this away from incompat to hasNans and working on partial support notes, but I wanted to run this thought process by you. |
I like the idea of using hasNans. We can even update the check so it will only fall back to the CPU for nested types that have float/double in them and hasNans is true. We can document that too. |
…imate float and incompat flag to remaining test" This reverts commit 2dfe8ba. Signed-off-by: Navin Kumar <navink@nvidia.com>
…tors Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
build |
|
||
private def isNestedArrayType(dt: DataType): Boolean = { | ||
dt match { | ||
case StructType(fields) => fields.exists(_.dataType.isInstanceOf[ArrayType]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if I have a struct of a struct with an array in it? I think this needs to be a recursive call.
…rrays Signed-off-by: Navin Kumar <navink@nvidia.com>
build |
There is currently an optimization performed in Spark 3.3 that affects the integration test that I'm using to test nested types. SortArray gets wrapped on the CollectSet in the HashAggregate. I explicitly separate these out on the test to use SortArray on the CPU (since it supports these complex types, and the GPU does not). |
…le Spark 3.3.0+ optimization Signed-off-by: Navin Kumar <navink@nvidia.com>
collect_set(...)
on the GPUcollect_set(...)
on the GPU [databricks]
build |
…ricks Signed-off-by: Navin Kumar <navink@nvidia.com>
build |
Fixes #5508
This adds support for Struct[Array] types in GpuCollectSet. This is due to the support added in cuDF from rapidsai/cudf#11228. A couple of caveats:
It does not support Map type as input since CPU Spark does not support map type data either at the moment. (See https://github.com/apache/spark/blob/58e07e0f4cca1e3a6387a7e0c57faeb6c5ec9ef5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180)
It currently does not yet support NaNs in struct[Array(Double)] or struct[Array(Float)] types at the moment.
To test the output,
sort_array
is forced to run on the CPU because we don't have an implementation for nested types that runs on the GPU.