Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50489][SQL][PYTHON] Fix self-join after applyInArrow #49056

Closed
wants to merge 2 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Dec 4, 2024

What changes were proposed in this pull request?

Fix self-join after applyInArrow, the same issue of applyInPandas was fixed in #31429

Why are the changes needed?

bug fix

before:

In [1]: import pyarrow as pa

In [2]: df = spark.createDataFrame([(1, 1)], ("k", "v"))

In [3]: def arrow_func(key, table):
   ...:     return pa.Table.from_pydict({"x": [2], "y": [2]})
   ...:

In [4]: df2 = df.groupby("k").applyInArrow(arrow_func, schema="x long, y long")

In [5]: df2.show()
24/12/04 17:47:43 WARN CheckAllocator: More than one DefaultAllocationManager on classpath. Choosing first found
+---+---+
|  x|  y|
+---+---+
|  2|  2|
+---+---+


In [6]: df2.join(df2)
...
Failure when resolving conflicting references in Join:
'Join Inner
:- FlatMapGroupsInArrow [k#0L], arrow_func(k#0L, v#1L)#2, [x#3L, y#4L]
:  +- Project [k#0L, k#0L, v#1L]
:     +- LogicalRDD [k#0L, v#1L], false
+- FlatMapGroupsInArrow [k#12L], arrow_func(k#12L, v#13L)#2, [x#3L, y#4L]
   +- Project [k#12L, k#12L, v#13L]
      +- LogicalRDD [k#12L, v#13L], false

Conflicting attributes: "x", "y". SQLSTATE: XX000
	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
	at org.apache.spark.SparkException$.internalError(SparkException.scala:79)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:798)

after:

In [6]: df2.join(df2)
Out[6]: DataFrame[x: bigint, y: bigint, x: bigint, y: bigint]

In [7]: df2.join(df2).show()
+---+---+---+---+
|  x|  y|  x|  y|
+---+---+---+---+
|  2|  2|  2|  2|
+---+---+---+---+

Does this PR introduce any user-facing change?

bug fix

How was this patch tested?

added tests

Was this patch authored or co-authored using generative AI tooling?

no

@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng zhengruifeng deleted the fix_arrow_join branch December 5, 2024 00:52
HyukjinKwon pushed a commit that referenced this pull request Dec 6, 2024
…eRelations#collectConflictPlans`

### What changes were proposed in this pull request?
Add applyInArrow in `DeduplicateRelations#collectConflictPlans`

### Why are the changes needed?
In #49056, I forgot to add `applyInArrow` in `DeduplicateRelations#collectConflictPlans`

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
tests added in #49056

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #49069 from zhengruifeng/apply_in_arrow_rule.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@Zand100
Copy link

Zand100 commented Dec 13, 2024

Hi @zhengruifeng does this pull request fix a bug introduced in #41347 ? We maintain a fork of spark, and we're wondering if we need to cherry-pick this bug fix now. We don't have #41347 in our fork. (If we don't need to cherry-pick this bug fix, we'll get all these commits when we upgrade.) Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants