Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for nested types to collect_set(...) on the GPU [databricks] #6079

Merged
merged 17 commits into from
Aug 5, 2022

Conversation

NVnavkumar
Copy link
Collaborator

@NVnavkumar NVnavkumar commented Jul 25, 2022

Fixes #5508

This adds support for Struct[Array] types in GpuCollectSet. This is due to the support added in cuDF from rapidsai/cudf#11228. A couple of caveats:

  1. It does not support Map type as input since CPU Spark does not support map type data either at the moment. (See https://github.com/apache/spark/blob/58e07e0f4cca1e3a6387a7e0c57faeb6c5ec9ef5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L180)

  2. It currently does not yet support NaNs in struct[Array(Double)] or struct[Array(Float)] types at the moment.

  3. To test the output, sort_array is forced to run on the CPU because we don't have an implementation for nested types that runs on the GPU.

Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
…ArrayType]

Signed-off-by: Navin Kumar <navink@nvidia.com>
…collect set

Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar NVnavkumar self-assigned this Jul 25, 2022
@jlowe jlowe added this to the July 22 - Aug 5 milestone Jul 25, 2022
@ttnghia
Copy link
Collaborator

ttnghia commented Jul 25, 2022

Maybe the title should be changed to "...support nested types...." instead, because now we can call collect_set on many nested data like structs of arrays, arrays of structs, arrays of arrays etc.

@NVnavkumar
Copy link
Collaborator Author

Maybe the title should be changed to "...support nested types...." instead, because now we can call collect_set on many nested data like structs of arrays, arrays of structs, arrays of arrays etc.

Adding some more tests and will change the title.

Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar NVnavkumar marked this pull request as ready for review July 25, 2022 22:32
@NVnavkumar
Copy link
Collaborator Author

build

@sameerz sameerz added the feature request New feature or request label Jul 26, 2022
@NVnavkumar NVnavkumar changed the title Add support for Struct[Array] types to collect_set(...) on the GPU Add support for nested types to collect_set(...) on the GPU Jul 27, 2022
…ollect_set on nested array types

Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar
Copy link
Collaborator Author

build

1 similar comment
@ttnghia
Copy link
Collaborator

ttnghia commented Jul 28, 2022

build

@abellina
Copy link
Collaborator

It currently does not yet support NaNs in struct[Array(Double)] or struct[Array(Float)] types at the moment.

What is the behavior if there are NaNs?

@ttnghia
Copy link
Collaborator

ttnghia commented Jul 28, 2022

build

@ttnghia
Copy link
Collaborator

ttnghia commented Jul 29, 2022

What is the behavior if there are NaNs?

We may have the results inconsistent with the CPU results, from Spark 3.1.3, as in this: #5958 (comment)

@NVnavkumar
Copy link
Collaborator Author

It currently does not yet support NaNs in struct[Array(Double)] or struct[Array(Float)] types at the moment.

What is the behavior if there are NaNs?

It doesn't consistently handle NaN equality when it comes to arrays that contain NaN values. for example, the CPU will output:

Row(sort_array(collect_set, true)=[Row(child0=[]), Row(child0=[nan, nan, nan, nan])])

And the GPU will output:

Row(sort_array(collect_set, true)=[Row(child0=[]), Row(child0=[nan, nan, nan, nan]), Row(child0=[nan, nan, nan, nan])])

Usually when we use non-nested versions of floats and doubles, NaN values are considered unequal, but when collecting sets of nested array versions, NaN equality is considered on the CPU.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If NaNs are not supported, then we need to document it and have the operators marked as incompat or hasNaNs.

@@ -949,6 +949,10 @@ def gen_scalars_for_sql(data_gen, count, seed=0, force_no_nulls=False):
# all of the basic types in a single struct
all_basic_struct_gen = StructGen([['child'+str(ind), sub_gen] for ind, sub_gen in enumerate(all_basic_gens)])

all_basic_struct_gen_no_nan = StructGen([['child'+str(ind), sub_gen] for ind, sub_gen in enumerate(all_basic_gens_no_nan)])

array_struct_gen = StructGen([['child'+str(ind), sub_gen] for ind, sub_gen in enumerate(single_level_array_gens_no_nan)])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have no_nan in the name of this too? That way it is more clear what you are getting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it best to use incompat or hasNans at this point?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hasNans would probably be best. It would give us the most fine grained control over falling back to the CPU so we can do it only in situations that we know are problematic. But then we need to shift the documentation from incompat to one of the PartialSupport notes or in some other documentation.

…oat and incompat flag to remaining test

Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Aug 2, 2022
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern now is that we have marked collect_set and incompatible for all data types. Even the ones where it would not be a problem. I am fine with this the way it is, because we have incompat operators turned on by default.

@revans2
Copy link
Collaborator

revans2 commented Aug 2, 2022

It looks like a number of tests are failing now because collect_set is marked as incompat. So either we update all of those tests or we have to go back and rethink how we are turning it off by default.

@NVnavkumar
Copy link
Collaborator Author

It looks like a number of tests are failing now because collect_set is marked as incompat. So either we update all of those tests or we have to go back and rethink how we are turning it off by default.

I can update the the failing tests, but I did want to run this by you again: the incompatibility with NaNs is purely in the nested type support (when using a struct[Array(float)], struct[Array(double)], etc. sort of type). When it on pure float or double columns, then NaNs work fine. With the hasNans configuration, as far as I understand, this means that the assumption is all input doesn't have NaNs in these columns, whether nested or not, so maybe this kind of defeats the purpose of the existing NaN tests (but I guess they would still work in spite of the config). With incompat, it feels like kind too broad a stroke here.

I am leaning towards moving this away from incompat to hasNans and working on partial support notes, but I wanted to run this thought process by you.

@revans2
Copy link
Collaborator

revans2 commented Aug 2, 2022

I like the idea of using hasNans. We can even update the check so it will only fall back to the CPU for nested types that have float/double in them and hasNans is true. We can document that too.

@revans2 revans2 closed this Aug 2, 2022
@revans2 revans2 reopened this Aug 2, 2022
…imate float and incompat flag to remaining test"

This reverts commit 2dfe8ba.

Signed-off-by: Navin Kumar <navink@nvidia.com>
…tors

Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar
Copy link
Collaborator Author

build


private def isNestedArrayType(dt: DataType): Boolean = {
dt match {
case StructType(fields) => fields.exists(_.dataType.isInstanceOf[ArrayType])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I have a struct of a struct with an array in it? I think this needs to be a recursive call.

…rrays

Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar
Copy link
Collaborator Author

build

@NVnavkumar
Copy link
Collaborator Author

There is currently an optimization performed in Spark 3.3 that affects the integration test that I'm using to test nested types. SortArray gets wrapped on the CollectSet in the HashAggregate. I explicitly separate these out on the test to use SortArray on the CPU (since it supports these complex types, and the GPU does not).

…le Spark 3.3.0+ optimization

Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar NVnavkumar changed the title Add support for nested types to collect_set(...) on the GPU Add support for nested types to collect_set(...) on the GPU [databricks] Aug 5, 2022
@NVnavkumar
Copy link
Collaborator Author

build

…ricks

Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar
Copy link
Collaborator Author

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] collect_set on struct[Array]
6 participants