Add support for nested types to `collect_set(...)` on the GPU [databricks] #6079

NVnavkumar · 2022-07-25T18:30:38Z

This adds support for Struct[Array] types in GpuCollectSet. This is due to the support added in cuDF from rapidsai/cudf#11228. A couple of caveats:

It does not support Map type as input since CPU Spark does not support map type data either at the moment. (See https://github.com/apache/spark/blob/58e07e0f4cca1e3a6387a7e0c57faeb6c5ec9ef5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180)
It currently does not yet support NaNs in struct[Array(Double)] or struct[Array(Float)] types at the moment.
To test the output, sort_array is forced to run on the CPU because we don't have an implementation for nested types that runs on the GPU.

Signed-off-by: Navin Kumar <navink@nvidia.com>

…ArrayType] Signed-off-by: Navin Kumar <navink@nvidia.com>

…collect set Signed-off-by: Navin Kumar <navink@nvidia.com>

Signed-off-by: Navin Kumar <navink@nvidia.com>

ttnghia · 2022-07-25T21:12:25Z

Maybe the title should be changed to "...support nested types...." instead, because now we can call collect_set on many nested data like structs of arrays, arrays of structs, arrays of arrays etc.

NVnavkumar · 2022-07-25T22:11:58Z

Maybe the title should be changed to "...support nested types...." instead, because now we can call collect_set on many nested data like structs of arrays, arrays of structs, arrays of arrays etc.

Adding some more tests and will change the title.

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2022-07-25T23:22:18Z

build

integration_tests/src/main/python/hash_aggregate_test.py

…ollect_set on nested array types Signed-off-by: Navin Kumar <navink@nvidia.com>

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2022-07-28T15:16:39Z

build

ttnghia · 2022-07-28T16:43:26Z

build

abellina · 2022-07-28T16:50:16Z

It currently does not yet support NaNs in struct[Array(Double)] or struct[Array(Float)] types at the moment.

What is the behavior if there are NaNs?

ttnghia · 2022-07-28T19:59:07Z

build

ttnghia · 2022-07-29T02:38:11Z

What is the behavior if there are NaNs?

We may have the results inconsistent with the CPU results, from Spark 3.1.3, as in this: #5958 (comment)

NVnavkumar · 2022-07-29T20:14:51Z

It currently does not yet support NaNs in struct[Array(Double)] or struct[Array(Float)] types at the moment.

What is the behavior if there are NaNs?

It doesn't consistently handle NaN equality when it comes to arrays that contain NaN values. for example, the CPU will output:

Row(sort_array(collect_set, true)=[Row(child0=[]), Row(child0=[nan, nan, nan, nan])])

And the GPU will output:

Row(sort_array(collect_set, true)=[Row(child0=[]), Row(child0=[nan, nan, nan, nan]), Row(child0=[nan, nan, nan, nan])])

Usually when we use non-nested versions of floats and doubles, NaN values are considered unequal, but when collecting sets of nested array versions, NaN equality is considered on the CPU.

revans2

If NaNs are not supported, then we need to document it and have the operators marked as incompat or hasNaNs.

revans2 · 2022-08-01T15:31:29Z

integration_tests/src/main/python/data_gen.py

@@ -949,6 +949,10 @@ def gen_scalars_for_sql(data_gen, count, seed=0, force_no_nulls=False):
 # all of the basic types in a single struct
 all_basic_struct_gen = StructGen([['child'+str(ind), sub_gen] for ind, sub_gen in enumerate(all_basic_gens)])

+all_basic_struct_gen_no_nan = StructGen([['child'+str(ind), sub_gen] for ind, sub_gen in enumerate(all_basic_gens_no_nan)])
+
+array_struct_gen =  StructGen([['child'+str(ind), sub_gen] for ind, sub_gen in enumerate(single_level_array_gens_no_nan)])


Can we have no_nan in the name of this too? That way it is more clear what you are getting.

Is it best to use incompat or hasNans at this point?

I think hasNans would probably be best. It would give us the most fine grained control over falling back to the CPU so we can do it only in situations that we know are problematic. But then we need to shift the documentation from incompat to one of the PartialSupport notes or in some other documentation.

integration_tests/src/main/python/hash_aggregate_test.py

…oat and incompat flag to remaining test Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2022-08-02T04:30:49Z

build

revans2

My only concern now is that we have marked collect_set and incompatible for all data types. Even the ones where it would not be a problem. I am fine with this the way it is, because we have incompat operators turned on by default.

revans2 · 2022-08-02T15:01:59Z

It looks like a number of tests are failing now because collect_set is marked as incompat. So either we update all of those tests or we have to go back and rethink how we are turning it off by default.

NVnavkumar · 2022-08-02T16:59:55Z

It looks like a number of tests are failing now because collect_set is marked as incompat. So either we update all of those tests or we have to go back and rethink how we are turning it off by default.

I can update the the failing tests, but I did want to run this by you again: the incompatibility with NaNs is purely in the nested type support (when using a struct[Array(float)], struct[Array(double)], etc. sort of type). When it on pure float or double columns, then NaNs work fine. With the hasNans configuration, as far as I understand, this means that the assumption is all input doesn't have NaNs in these columns, whether nested or not, so maybe this kind of defeats the purpose of the existing NaN tests (but I guess they would still work in spite of the config). With incompat, it feels like kind too broad a stroke here.

I am leaning towards moving this away from incompat to hasNans and working on partial support notes, but I wanted to run this thought process by you.

revans2 · 2022-08-02T17:29:19Z

I like the idea of using hasNans. We can even update the check so it will only fall back to the CPU for nested types that have float/double in them and hasNans is true. We can document that too.

…imate float and incompat flag to remaining test" This reverts commit 2dfe8ba. Signed-off-by: Navin Kumar <navink@nvidia.com>

…tors Signed-off-by: Navin Kumar <navink@nvidia.com>

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2022-08-02T23:31:36Z

build

revans2 · 2022-08-03T13:49:31Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

+
+        private def isNestedArrayType(dt: DataType): Boolean = {
+          dt match {
+            case StructType(fields) => fields.exists(_.dataType.isInstanceOf[ArrayType])


What if I have a struct of a struct with an array in it? I think this needs to be a recursive call.

…rrays Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2022-08-03T14:29:33Z

build

NVnavkumar · 2022-08-03T20:10:03Z

There is currently an optimization performed in Spark 3.3 that affects the integration test that I'm using to test nested types. SortArray gets wrapped on the CollectSet in the HashAggregate. I explicitly separate these out on the test to use SortArray on the CPU (since it supports these complex types, and the GPU does not).

…le Spark 3.3.0+ optimization Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2022-08-05T02:39:50Z

build

…ricks Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2022-08-05T03:26:42Z

build

NVnavkumar added 6 commits July 22, 2022 13:35

WIP: enable nested types with collect_set

74883ee

Signed-off-by: Navin Kumar <navink@nvidia.com>

tests for collect_set() with nested array types

a0fac74

Signed-off-by: Navin Kumar <navink@nvidia.com>

preparing to switch to local sorting

2aa4d1f

Signed-off-by: Navin Kumar <navink@nvidia.com>

Run sort_array on the CPU in order to test collect_set() with Struct[…

86ca609

…ArrayType] Signed-off-by: Navin Kumar <navink@nvidia.com>

Test cleanup and right now just ensure that Struct[Array] works with …

4ced710

…collect set Signed-off-by: Navin Kumar <navink@nvidia.com>

Updated docs

0abc00a

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar self-assigned this Jul 25, 2022

jlowe added this to the July 22 - Aug 5 milestone Jul 25, 2022

For integration tests, add array of struct case and array of array case

f764d30

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar marked this pull request as ready for review July 25, 2022 22:32

sameerz added the feature request New feature or request label Jul 26, 2022

ttnghia reviewed Jul 26, 2022

View reviewed changes

integration_tests/src/main/python/hash_aggregate_test.py Outdated Show resolved Hide resolved

NVnavkumar changed the title ~~Add support for Struct[Array] types to collect_set(...) on the GPU~~ Add support for nested types to collect_set(...) on the GPU Jul 27, 2022

NVnavkumar added 2 commits July 27, 2022 19:22

Fix window aggregation configuration, and add integration tests for c…

54d08a8

…ollect_set on nested array types Signed-off-by: Navin Kumar <navink@nvidia.com>

Add reduction integration tests

f984ce4

Signed-off-by: Navin Kumar <navink@nvidia.com>

revans2 reviewed Aug 1, 2022

View reviewed changes

Add incompat documentation, and update tests to remove approximate fl…

2dfe8ba

…oat and incompat flag to remaining test Signed-off-by: Navin Kumar <navink@nvidia.com>

revans2 previously approved these changes Aug 2, 2022

View reviewed changes

revans2 closed this Aug 2, 2022

revans2 reopened this Aug 2, 2022

NVnavkumar added 4 commits August 2, 2022 13:04

Revert "Add incompat documentation, and update tests to remove approx…

753bfe8

…imate float and incompat flag to remaining test" This reverts commit 2dfe8ba. Signed-off-by: Navin Kumar <navink@nvidia.com>

Cleanup test code, rename data_gen var, and remove unnecessary decora…

bd227fa

…tors Signed-off-by: Navin Kumar <navink@nvidia.com>

Update CollectSet with limited hasNans requirement and update tests

4cca48a

Signed-off-by: Navin Kumar <navink@nvidia.com>

Update partial support docs for CollectSet

0ef4784

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar dismissed revans2’s stale review via 0ef4784 August 2, 2022 23:31

revans2 reviewed Aug 3, 2022

View reviewed changes

Make this check recursive to handle structs that contain structs of a…

14b6ced

…rrays Signed-off-by: Navin Kumar <navink@nvidia.com>

Update integration test for collect_set on nested array types to hand…

7f928c6

…le Spark 3.3.0+ optimization Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar changed the title ~~Add support for nested types to collect_set(...) on the GPU~~ Add support for nested types to collect_set(...) on the GPU [databricks] Aug 5, 2022

Update window_function_test to handle optimization performed by Datab…

6f1e0da

…ricks Signed-off-by: Navin Kumar <navink@nvidia.com>

revans2 approved these changes Aug 5, 2022

View reviewed changes

ttnghia approved these changes Aug 5, 2022

View reviewed changes

NVnavkumar merged commit 080a59b into NVIDIA:branch-22.08 Aug 5, 2022

NVnavkumar mentioned this pull request Aug 5, 2022

[FEA] collect_set on struct[Array] #5508

Closed

thirtiseven mentioned this pull request Jul 24, 2023

Fix collect_set_on_nested_type tests failed #8783

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for nested types to `collect_set(...)` on the GPU [databricks] #6079

Add support for nested types to `collect_set(...)` on the GPU [databricks] #6079

NVnavkumar commented Jul 25, 2022 •

edited

Loading

ttnghia commented Jul 25, 2022

NVnavkumar commented Jul 25, 2022

NVnavkumar commented Jul 25, 2022

NVnavkumar commented Jul 28, 2022

ttnghia commented Jul 28, 2022

abellina commented Jul 28, 2022

ttnghia commented Jul 28, 2022

ttnghia commented Jul 29, 2022 •

edited

Loading

NVnavkumar commented Jul 29, 2022

revans2 left a comment

revans2 Aug 1, 2022

NVnavkumar Aug 1, 2022

revans2 Aug 2, 2022

NVnavkumar commented Aug 2, 2022

revans2 left a comment

revans2 commented Aug 2, 2022

NVnavkumar commented Aug 2, 2022

revans2 commented Aug 2, 2022

NVnavkumar commented Aug 2, 2022

revans2 Aug 3, 2022

NVnavkumar commented Aug 3, 2022

NVnavkumar commented Aug 3, 2022

NVnavkumar commented Aug 5, 2022

NVnavkumar commented Aug 5, 2022

Add support for nested types to collect_set(...) on the GPU [databricks] #6079

Add support for nested types to collect_set(...) on the GPU [databricks] #6079

Conversation

NVnavkumar commented Jul 25, 2022 • edited Loading

ttnghia commented Jul 25, 2022

NVnavkumar commented Jul 25, 2022

NVnavkumar commented Jul 25, 2022

NVnavkumar commented Jul 28, 2022

ttnghia commented Jul 28, 2022

abellina commented Jul 28, 2022

ttnghia commented Jul 28, 2022

ttnghia commented Jul 29, 2022 • edited Loading

NVnavkumar commented Jul 29, 2022

revans2 left a comment

Choose a reason for hiding this comment

revans2 Aug 1, 2022

Choose a reason for hiding this comment

NVnavkumar Aug 1, 2022

Choose a reason for hiding this comment

revans2 Aug 2, 2022

Choose a reason for hiding this comment

NVnavkumar commented Aug 2, 2022

revans2 left a comment

Choose a reason for hiding this comment

revans2 commented Aug 2, 2022

NVnavkumar commented Aug 2, 2022

revans2 commented Aug 2, 2022

NVnavkumar commented Aug 2, 2022

revans2 Aug 3, 2022

Choose a reason for hiding this comment

NVnavkumar commented Aug 3, 2022

NVnavkumar commented Aug 3, 2022

NVnavkumar commented Aug 5, 2022

NVnavkumar commented Aug 5, 2022

Add support for nested types to `collect_set(...)` on the GPU [databricks] #6079

Add support for nested types to `collect_set(...)` on the GPU [databricks] #6079

NVnavkumar commented Jul 25, 2022 •

edited

Loading

ttnghia commented Jul 29, 2022 •

edited

Loading