[SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas #31429

Ngone51 · 2021-02-02T03:01:37Z

What changes were proposed in this pull request?

Resolve duplicate attributes for FlatMapCoGroupsInPandas.

Why are the changes needed?

When performing self-join on top of FlatMapCoGroupsInPandas, analysis can fail because of conflicting attributes. For example,

df = spark.createDataFrame([(1, 1)], ("column", "value"))
row = df.groupby("ColUmn").cogroup(
    df.groupby("COLUMN")
).applyInPandas(lambda r, l: r + l, "column long, value long")
row.join(row).show()

error:

...
Conflicting attributes: column#163321L,value#163322L
;;
’Join Inner
:- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L]
:  :- Project [ColUmn#163312L, column#163312L, value#163313L]
:  :  +- LogicalRDD [column#163312L, value#163313L], false
:  +- Project [COLUMN#163312L, column#163312L, value#163313L]
:     +- LogicalRDD [column#163312L, value#163313L], false
+- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L]
   :- Project [ColUmn#163312L, column#163312L, value#163313L]
   :  +- LogicalRDD [column#163312L, value#163313L], false
   +- Project [COLUMN#163312L, column#163312L, value#163313L]
      +- LogicalRDD [column#163312L, value#163313L], false
...

Does this PR introduce any user-facing change?

yes, the query like the above example won't fail.

How was this patch tested?

Adde unit tests.

Ngone51 · 2021-02-02T03:03:32Z

cc @d80tb7 @cloud-fan @HyukjinKwon Please take a look, thanks!

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

SparkQA · 2021-02-02T03:57:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39336/

SparkQA · 2021-02-02T04:15:35Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39336/

python/pyspark/sql/tests/test_pandas_map.py

HyukjinKwon

Looks making sense

address suggestion Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>

SparkQA · 2021-02-02T06:56:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39346/

SparkQA · 2021-02-02T07:24:17Z

Test build #134753 has finished for PR 31429 at commit 0ab2b7a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-02-02T07:25:41Z

Merged to master, branch-3.1 and branch-3.0.

…Pandas/MapInPandas ### What changes were proposed in this pull request? Resolve duplicate attributes for `FlatMapCoGroupsInPandas`. ### Why are the changes needed? When performing self-join on top of `FlatMapCoGroupsInPandas`, analysis can fail because of conflicting attributes. For example, ```scala df = spark.createDataFrame([(1, 1)], ("column", "value")) row = df.groupby("ColUmn").cogroup( df.groupby("COLUMN") ).applyInPandas(lambda r, l: r + l, "column long, value long") row.join(row).show() ``` error: ```scala ... Conflicting attributes: column#163321L,value#163322L ;; ’Join Inner :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L] : :- Project [ColUmn#163312L, column#163312L, value#163313L] : : +- LogicalRDD [column#163312L, value#163313L], false : +- Project [COLUMN#163312L, column#163312L, value#163313L] : +- LogicalRDD [column#163312L, value#163313L], false +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L] :- Project [ColUmn#163312L, column#163312L, value#163313L] : +- LogicalRDD [column#163312L, value#163313L], false +- Project [COLUMN#163312L, column#163312L, value#163313L] +- LogicalRDD [column#163312L, value#163313L], false ... ``` ### Does this PR introduce _any_ user-facing change? yes, the query like the above example won't fail. ### How was this patch tested? Adde unit tests. Closes #31429 from Ngone51/fix-conflcting-attrs-of-FlatMapCoGroupsInPandas. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit e9362c2) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Ngone51 · 2021-02-02T07:26:47Z

thanks all!

…Pandas/MapInPandas ### What changes were proposed in this pull request? Resolve duplicate attributes for `FlatMapCoGroupsInPandas`. ### Why are the changes needed? When performing self-join on top of `FlatMapCoGroupsInPandas`, analysis can fail because of conflicting attributes. For example, ```scala df = spark.createDataFrame([(1, 1)], ("column", "value")) row = df.groupby("ColUmn").cogroup( df.groupby("COLUMN") ).applyInPandas(lambda r, l: r + l, "column long, value long") row.join(row).show() ``` error: ```scala ... Conflicting attributes: column#163321L,value#163322L ;; ’Join Inner :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L] : :- Project [ColUmn#163312L, column#163312L, value#163313L] : : +- LogicalRDD [column#163312L, value#163313L], false : +- Project [COLUMN#163312L, column#163312L, value#163313L] : +- LogicalRDD [column#163312L, value#163313L], false +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L] :- Project [ColUmn#163312L, column#163312L, value#163313L] : +- LogicalRDD [column#163312L, value#163313L], false +- Project [COLUMN#163312L, column#163312L, value#163313L] +- LogicalRDD [column#163312L, value#163313L], false ... ``` ### Does this PR introduce _any_ user-facing change? yes, the query like the above example won't fail. ### How was this patch tested? Adde unit tests. Closes #31429 from Ngone51/fix-conflcting-attrs-of-FlatMapCoGroupsInPandas. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit e9362c2) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

SparkQA · 2021-02-02T08:26:44Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39346/

d80tb7

bit late to this but lgtm- thanks for fixing!

SparkQA · 2021-02-02T09:40:59Z

Test build #134759 has finished for PR 31429 at commit 11a6e24.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-02T10:19:48Z

Test build #134761 has finished for PR 31429 at commit 5758c77.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…Pandas/MapInPandas ### What changes were proposed in this pull request? Resolve duplicate attributes for `FlatMapCoGroupsInPandas`. ### Why are the changes needed? When performing self-join on top of `FlatMapCoGroupsInPandas`, analysis can fail because of conflicting attributes. For example, ```scala df = spark.createDataFrame([(1, 1)], ("column", "value")) row = df.groupby("ColUmn").cogroup( df.groupby("COLUMN") ).applyInPandas(lambda r, l: r + l, "column long, value long") row.join(row).show() ``` error: ```scala ... Conflicting attributes: column#163321L,value#163322L ;; ’Join Inner :- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L] : :- Project [ColUmn#163312L, column#163312L, value#163313L] : : +- LogicalRDD [column#163312L, value#163313L], false : +- Project [COLUMN#163312L, column#163312L, value#163313L] : +- LogicalRDD [column#163312L, value#163313L], false +- FlatMapCoGroupsInPandas [ColUmn#163312L], [COLUMN#163312L], <lambda>(column#163312L, value#163313L, column#163312L, value#163313L), [column#163321L, value#163322L] :- Project [ColUmn#163312L, column#163312L, value#163313L] : +- LogicalRDD [column#163312L, value#163313L], false +- Project [COLUMN#163312L, column#163312L, value#163313L] +- LogicalRDD [column#163312L, value#163313L], false ... ``` ### Does this PR introduce _any_ user-facing change? yes, the query like the above example won't fail. ### How was this patch tested? Adde unit tests. Closes apache#31429 from Ngone51/fix-conflcting-attrs-of-FlatMapCoGroupsInPandas. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Fix self-join after `applyInArrow`, the same issue of `applyInPandas` was fixed in #31429 ### Why are the changes needed? bug fix before: ``` In [1]: import pyarrow as pa In [2]: df = spark.createDataFrame([(1, 1)], ("k", "v")) In [3]: def arrow_func(key, table): ...: return pa.Table.from_pydict({"x": [2], "y": [2]}) ...: In [4]: df2 = df.groupby("k").applyInArrow(arrow_func, schema="x long, y long") In [5]: df2.show() 24/12/04 17:47:43 WARN CheckAllocator: More than one DefaultAllocationManager on classpath. Choosing first found +---+---+ | x| y| +---+---+ | 2| 2| +---+---+ In [6]: df2.join(df2) ... Failure when resolving conflicting references in Join: 'Join Inner :- FlatMapGroupsInArrow [k#0L], arrow_func(k#0L, v#1L)#2, [x#3L, y#4L] : +- Project [k#0L, k#0L, v#1L] : +- LogicalRDD [k#0L, v#1L], false +- FlatMapGroupsInArrow [k#12L], arrow_func(k#12L, v#13L)#2, [x#3L, y#4L] +- Project [k#12L, k#12L, v#13L] +- LogicalRDD [k#12L, v#13L], false Conflicting attributes: "x", "y". SQLSTATE: XX000 at org.apache.spark.SparkException$.internalError(SparkException.scala:92) at org.apache.spark.SparkException$.internalError(SparkException.scala:79) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:798) ``` after: ``` In [6]: df2.join(df2) Out[6]: DataFrame[x: bigint, y: bigint, x: bigint, y: bigint] In [7]: df2.join(df2).show() +---+---+---+---+ | x| y| x| y| +---+---+---+---+ | 2| 2| 2| 2| +---+---+---+---+ ``` ### Does this PR introduce _any_ user-facing change? bug fix ### How was this patch tested? added tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #49056 from zhengruifeng/fix_arrow_join. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

fix

0ab2b7a

github-actions bot added CORE PYTHON SQL labels Feb 2, 2021

ueshin reviewed Feb 2, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Show resolved Hide resolved

fix MapInPandas

11a6e24

HyukjinKwon reviewed Feb 2, 2021

View reviewed changes

python/pyspark/sql/tests/test_pandas_map.py Outdated Show resolved Hide resolved

HyukjinKwon approved these changes Feb 2, 2021

View reviewed changes

Update python/pyspark/sql/tests/test_pandas_map.py

5758c77

address suggestion Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>

Ngone51 changed the title ~~[SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas~~ [SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas Feb 2, 2021

HyukjinKwon closed this in e9362c2 Feb 2, 2021

Ngone51 deleted the fix-conflcting-attrs-of-FlatMapCoGroupsInPandas branch February 2, 2021 07:26

d80tb7 reviewed Feb 2, 2021

View reviewed changes

zhengruifeng mentioned this pull request Dec 4, 2024

[SPARK-50489][SQL][PYTHON] Fix self-join after applyInArrow #49056

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas #31429

[SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas #31429

Ngone51 commented Feb 2, 2021 •

edited by HyukjinKwon

Loading

Ngone51 commented Feb 2, 2021

SparkQA commented Feb 2, 2021

SparkQA commented Feb 2, 2021

HyukjinKwon left a comment

SparkQA commented Feb 2, 2021

SparkQA commented Feb 2, 2021

HyukjinKwon commented Feb 2, 2021

Ngone51 commented Feb 2, 2021

SparkQA commented Feb 2, 2021

d80tb7 left a comment

SparkQA commented Feb 2, 2021

SparkQA commented Feb 2, 2021

[SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas #31429

[SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas #31429

Conversation

Ngone51 commented Feb 2, 2021 • edited by HyukjinKwon Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Ngone51 commented Feb 2, 2021

SparkQA commented Feb 2, 2021

SparkQA commented Feb 2, 2021

HyukjinKwon left a comment

Choose a reason for hiding this comment

SparkQA commented Feb 2, 2021

SparkQA commented Feb 2, 2021

HyukjinKwon commented Feb 2, 2021

Ngone51 commented Feb 2, 2021

SparkQA commented Feb 2, 2021

d80tb7 left a comment

Choose a reason for hiding this comment

SparkQA commented Feb 2, 2021

SparkQA commented Feb 2, 2021

Ngone51 commented Feb 2, 2021 •

edited by HyukjinKwon

Loading