Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Memory profiler prototype #7

Closed
wants to merge 9 commits into from

Conversation

xinrong-meng
Copy link

No description provided.

@xinrong-meng
Copy link
Author

  File "/Users/xinrong.meng/spark/python/lib/pyspark.zip/pyspark/worker.py", line 735, in profiling_func
    accumulator.add({result_id: (codemap_dict, None)})
  File "/Users/xinrong.meng/spark/python/lib/pyspark.zip/pyspark/accumulators.py", line 173, in add
    self._value = self.accum_param.addInPlace(self._value, term)
  File "/Users/xinrong.meng/spark/python/lib/pyspark.zip/pyspark/sql/profiler.py", line 48, in addInPlace
    PStatsParam.addInPlace(orig_perf, perf),
  File "/Users/xinrong.meng/spark/python/lib/pyspark.zip/pyspark/profiler.py", line 263, in addInPlace
    value1.add(value2)
AttributeError: 'dict' object has no attribute 'add'

python/pyspark/worker.py Outdated Show resolved Hide resolved
@xinrong-meng
Copy link
Author

23/12/21 13:44:12 ERROR Executor: Exception in task 5.0 in stage 0.0 (TID 5) 16]
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/miniconda3/envs/dev/lib/python3.9/inspect.py", line 1006, in getsourcelines
    lines, lnum = findsource(object)
  File "/opt/miniconda3/envs/dev/lib/python3.9/inspect.py", line 835, in findsource
    raise OSError('could not get source code')
OSError: could not get source code

@xinrong-meng xinrong-meng reopened this Jan 2, 2024
@xinrong-meng
Copy link
Author

image

ueshin pushed a commit that referenced this pull request Feb 23, 2024
…n properly

### What changes were proposed in this pull request?
Make `ResolveRelations` handle plan id properly

### Why are the changes needed?
bug fix for Spark Connect, it won't affect classic Spark SQL

before this PR:
```
from pyspark.sql import functions as sf

spark.range(10).withColumn("value_1", sf.lit(1)).write.saveAsTable("test_table_1")
spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", sf.lit(2)).write.saveAsTable("test_table_2")

df1 = spark.read.table("test_table_1")
df2 = spark.read.table("test_table_2")
df3 = spark.read.table("test_table_1")

join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2)
join2 = df3.join(join1, how="left", on=join1.index==df3.id)

join2.schema
```

fails with
```
AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704
```

That is due to existing plan caching in `ResolveRelations` doesn't work with Spark Connect

```
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 '[apache#12]Join LeftOuter, '`==`('index, 'id)                     '[apache#12]Join LeftOuter, '`==`('index, 'id)
!:- '[apache#9]UnresolvedRelation [test_table_1], [], false         :- '[apache#9]SubqueryAlias spark_catalog.default.test_table_1
!+- '[apache#11]Project ['index, 'value_2]                          :  +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false
!   +- '[apache#10]Join Inner, '`==`('id, 'index)                   +- '[apache#11]Project ['index, 'value_2]
!      :- '[#7]UnresolvedRelation [test_table_1], [], false      +- '[apache#10]Join Inner, '`==`('id, 'index)
!      +- '[apache#8]UnresolvedRelation [test_table_2], [], false         :- '[apache#9]SubqueryAlias spark_catalog.default.test_table_1
!                                                                   :  +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false
!                                                                   +- '[apache#8]SubqueryAlias spark_catalog.default.test_table_2
!                                                                      +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false

Can not resolve 'id with plan 7
```

`[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to the cached one
```
:- '[apache#9]SubqueryAlias spark_catalog.default.test_table_1
   +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false
```

### Does this PR introduce _any_ user-facing change?
yes, bug fix

### How was this patch tested?
added ut

### Was this patch authored or co-authored using generative AI tooling?
ci

Closes apache#45214 from zhengruifeng/connect_fix_read_join.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants