[SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum` #42798

itholic · 2023-09-04T08:58:43Z

What changes were proposed in this pull request?

This PR proposes to support string type columns for DataFrameGroupBy.sum.

Why are the changes needed?

To match the behavior with latest pandas.

Does this PR introduce any user-facing change?

Yes, from now on the DataFrameGroupBy.sum follows the behavior of latest pandas as below:

Test DataFrame

>>> psdf
   A    B  C      D
0  1  3.1  a   True
1  2  4.1  b  False
2  1  4.1  b  False
3  2  3.1  a   True

Before

>>> psdf.groupby("A").sum().sort_index()
     B  D
A
1  7.2  1
2  7.2  1

After

>>> psdf.groupby("A").sum().sort_index()
     B   C  D
A
1  7.2  ab  1
2  7.2  ba  1

How was this patch tested?

Updated the existing UTs to support string type columns.

Was this patch authored or co-authored using generative AI tooling?

No.

itholic · 2023-09-04T09:06:46Z

python/pyspark/pandas/groupby.py

+                if sfun.__name__ == "sum" and isinstance(
+                    psdf._internal.spark_type_for(label), StringType
+                ):
+                    output_scol = F.concat_ws("", F.collect_list(input_scol))


We should use combination of concat_ws and collect_list instead of sum to match the behavior with Pandas for string summation as below:

>>> import pyspark.sql.functions as sf >>> sdf.show() +---+ | A| +---+ | a| | b| | c| +---+ # Using `sum` over string type column returns `NULL` which is not matched with pandas. >>> sdf.select(sf.sum(sdf.A)).show() +------+ |sum(A)| +------+ | NULL| +------+ # Using combination of `concat_ws` and `collect_list` to match the pandas behavior >>> sdf.select(sf.concat_ws("", sf.collect_list(sdf.A))).show() +----------------------------+ |concat_ws(, collect_list(A))| +----------------------------+ | abc| +----------------------------+

…3295

zhengruifeng · 2023-09-04T10:32:32Z

@itholic I suspect the behavior is not deterministic: it depends on the internal order of collect_list

To make it deterministic: I think we need to collect_list both value and index, and sort by the indices before concat_ws

itholic · 2023-09-05T02:16:10Z

@zhengruifeng I think the problem is that the Pandas compute the concat without sorting, so the result can be difficult when the index is not sorted as below:

Problem

Pandas

>>> pdf
   A  B
4  a  1
3  b  2
2  c  3
>>> pdf.sum()
A    abc
B      6
dtype: object

Pandas API on Spark

>>> psdf
   A  B
4  a  1
3  b  2
2  c  3
>>> psdf.sum()
A    cba  # we internally sorted the index, so the result is different from Pandas
B      6
dtype: object

Solution

I think for now we can pick the one of three ways below:

We can document the warning note as below:

The result for string type column is non-deterministic since the implementation depends on `collect_list` API from PySpark which is non-deterministic as well.

We can collect_list both value and index, and sort by the indices before concat_ws as you suggested, and document the warning note as below:

The result for string type column can be different from Pandas when the index is not sorted, since we always sort the indexes before computing since the implementation depends on `collect_list` API from PySpark which is non-deterministic.

We don't support the string type column like so far, and add a note that why we don't support the string type column as below:
```
String type column is not support for now, because it might yield non-deterministic results unlike in Pandas.
```

WDYT? Also cc @HyukjinKwon, @ueshin @xinrong-meng , What strategy do we take for this situation? I believe that the same rules should apply to similar cases that already exist or may arise in the future.

HyukjinKwon · 2023-09-05T03:08:43Z

python/pyspark/pandas/groupby.py

@@ -910,7 +910,7 @@ def sum(self, numeric_only: Optional[bool] = True, min_count: int = 0) -> FrameL



I think you gotta fix the log above too since not we support strings too?

Yeah, we should update. Thanks for catching this out!

HyukjinKwon · 2023-09-05T03:12:04Z

python/pyspark/pandas/groupby.py

+                if sfun.__name__ == "sum" and isinstance(
+                    psdf._internal.spark_type_for(label), StringType
+                ):
+                    output_scol = F.concat_ws("", F.collect_list(input_scol))


can we sort by natural order? we have compute.ordered_head config. We can sort it by natural order, and perform collect_list.

I think maybe this is different case from head?

In head, we do sdf = sdf.orderBy(NATURAL_ORDER_COLUMN_NAME) and compute the sdf.limit(n), so that we can keep the order because DataFrame.limit doesn't shuffle the data.

But in this case, it shuffles the data again when computing the collect_list even after sorting the DataFrame by natural order in advance, so I think the order would not be guaranteed.

Please let me know if I missed something??

since string columns are computed together with numerical ones, I think we have to compute strings' sum in an aggregation way:

F.concat_ws("", F.array_sort( F.collect_list(F.struct(NATURAL_ORDER_COLUMN_NAME, input_scol)) )

For struct type, array_sort sort elements by first field then second field, IIRC

Yeah, then maybe we should extract the only string column from nested arrays to pass as arguments to concat_ws?

F.concat_ws( "", F.array_sort( F.collect_list(F.struct(NATURAL_ORDER_COLUMN_NAME, input_scol)) ).getField(input_scol_name), )

Just adjusted the comments. Thanks!

yes, should extract the string col

zhengruifeng · 2023-09-06T01:10:41Z

python/pyspark/pandas/groupby.py

+                        "",
+                        F.array_sort(
+                            F.collect_list(F.struct(NATURAL_ORDER_COLUMN_NAME, input_scol))
+                        ).getField(input_scol_name),


I think you will need F.transform to extract the strings.

Other wise, you can use F.reduce to directly concate the strings from structs [<long, string>]

Sounds good. Updated the code with transform. Thanks!

…3295

zhengruifeng · 2023-09-11T05:06:41Z

merged to master

[SPARK-43295][PS] Support string type columns for DataFrameGroupBy.sum

f719187

github-actions bot added PYTHON PANDAS API ON SPARK labels Sep 4, 2023

itholic commented Sep 4, 2023

View reviewed changes

Merge branch 'master' of https://github.com/apache/spark into SPARK-4…

93f1adb

…3295

fix tests

0f9578f

HyukjinKwon reviewed Sep 5, 2023

View reviewed changes

itholic added 3 commits September 5, 2023 13:41

address the log

aa7aa9b

Keep order

618550e

get string field

36c4472

zhengruifeng reviewed Sep 6, 2023

View reviewed changes

itholic added 3 commits September 6, 2023 10:35

Using transform

23fef8f

fix test

86d6be5

Merge branch 'master' of https://github.com/apache/spark into SPARK-4…

56fadd2

…3295

HyukjinKwon approved these changes Sep 7, 2023

View reviewed changes

zhengruifeng approved these changes Sep 11, 2023

View reviewed changes

zhengruifeng closed this in 3d119a5 Sep 11, 2023

itholic deleted the SPARK-43295 branch November 20, 2023 01:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum` #42798

[SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum` #42798

itholic commented Sep 4, 2023

itholic Sep 4, 2023

zhengruifeng commented Sep 4, 2023

itholic commented Sep 5, 2023

HyukjinKwon Sep 5, 2023

itholic Sep 5, 2023

HyukjinKwon Sep 5, 2023

itholic Sep 5, 2023

zhengruifeng Sep 5, 2023

itholic Sep 6, 2023

itholic Sep 6, 2023

zhengruifeng Sep 6, 2023

zhengruifeng Sep 6, 2023

itholic Sep 6, 2023

zhengruifeng commented Sep 11, 2023

		@@ -910,7 +910,7 @@ def sum(self, numeric_only: Optional[bool] = True, min_count: int = 0) -> FrameL

[SPARK-43295][PS] Support string type columns for DataFrameGroupBy.sum #42798

[SPARK-43295][PS] Support string type columns for DataFrameGroupBy.sum #42798

Conversation

itholic commented Sep 4, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

zhengruifeng commented Sep 4, 2023

itholic commented Sep 5, 2023

Problem

Solution

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng commented Sep 11, 2023

[SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum` #42798

[SPARK-43295][PS] Support string type columns for `DataFrameGroupBy.sum` #42798