[SPARK-50489][SQL][PYTHON] Fix self-join after `applyInArrow` #49056

zhengruifeng · 2024-12-04T09:57:59Z

What changes were proposed in this pull request?

Fix self-join after applyInArrow, the same issue of applyInPandas was fixed in #31429

Why are the changes needed?

bug fix

before:

In [1]: import pyarrow as pa

In [2]: df = spark.createDataFrame([(1, 1)], ("k", "v"))

In [3]: def arrow_func(key, table):
   ...:     return pa.Table.from_pydict({"x": [2], "y": [2]})
   ...:

In [4]: df2 = df.groupby("k").applyInArrow(arrow_func, schema="x long, y long")

In [5]: df2.show()
24/12/04 17:47:43 WARN CheckAllocator: More than one DefaultAllocationManager on classpath. Choosing first found
+---+---+
|  x|  y|
+---+---+
|  2|  2|
+---+---+


In [6]: df2.join(df2)
...
Failure when resolving conflicting references in Join:
'Join Inner
:- FlatMapGroupsInArrow [k#0L], arrow_func(k#0L, v#1L)#2, [x#3L, y#4L]
:  +- Project [k#0L, k#0L, v#1L]
:     +- LogicalRDD [k#0L, v#1L], false
+- FlatMapGroupsInArrow [k#12L], arrow_func(k#12L, v#13L)#2, [x#3L, y#4L]
   +- Project [k#12L, k#12L, v#13L]
      +- LogicalRDD [k#12L, v#13L], false

Conflicting attributes: "x", "y". SQLSTATE: XX000
	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
	at org.apache.spark.SparkException$.internalError(SparkException.scala:79)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:798)

after:

In [6]: df2.join(df2)
Out[6]: DataFrame[x: bigint, y: bigint, x: bigint, y: bigint]

In [7]: df2.join(df2).show()
+---+---+---+---+
|  x|  y|  x|  y|
+---+---+---+---+
|  2|  2|  2|  2|
+---+---+---+---+

Does this PR introduce any user-facing change?

bug fix

How was this patch tested?

added tests

Was this patch authored or co-authored using generative AI tooling?

no

HyukjinKwon · 2024-12-05T00:41:02Z

Merged to master.

…eRelations#collectConflictPlans` ### What changes were proposed in this pull request? Add applyInArrow in `DeduplicateRelations#collectConflictPlans` ### Why are the changes needed? In #49056, I forgot to add `applyInArrow` in `DeduplicateRelations#collectConflictPlans` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tests added in #49056 ### Was this patch authored or co-authored using generative AI tooling? no Closes #49069 from zhengruifeng/apply_in_arrow_rule. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Zand100 · 2024-12-13T21:28:54Z

Hi @zhengruifeng does this pull request fix a bug introduced in #41347 ? We maintain a fork of spark, and we're wondering if we need to cherry-pick this bug fix now. We don't have #41347 in our fork. (If we don't need to cherry-pick this bug fix, we'll get all these commits when we upgrade.) Thank you!

fix

fcb92b5

github-actions bot added SQL PYTHON labels Dec 4, 2024

zhengruifeng requested review from HyukjinKwon and Ngone51 December 4, 2024 10:00

fix lint

8aa3772

HyukjinKwon approved these changes Dec 5, 2024

View reviewed changes

HyukjinKwon closed this in 7278bc7 Dec 5, 2024

zhengruifeng deleted the fix_arrow_join branch December 5, 2024 00:52

zhengruifeng mentioned this pull request Dec 5, 2024

[SPARK-50489][SQL][PYTHON][FOLLOW-UP] Add applyInArrow in DeduplicateRelations#collectConflictPlans #49069

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50489][SQL][PYTHON] Fix self-join after `applyInArrow` #49056

[SPARK-50489][SQL][PYTHON] Fix self-join after `applyInArrow` #49056

zhengruifeng commented Dec 4, 2024 •

edited

Loading

HyukjinKwon commented Dec 5, 2024

Zand100 commented Dec 13, 2024

[SPARK-50489][SQL][PYTHON] Fix self-join after applyInArrow #49056

[SPARK-50489][SQL][PYTHON] Fix self-join after applyInArrow #49056

Conversation

zhengruifeng commented Dec 4, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon commented Dec 5, 2024

Zand100 commented Dec 13, 2024

[SPARK-50489][SQL][PYTHON] Fix self-join after `applyInArrow` #49056

[SPARK-50489][SQL][PYTHON] Fix self-join after `applyInArrow` #49056

zhengruifeng commented Dec 4, 2024 •

edited

Loading