[WIP] Memory profiler prototype #7

xinrong-meng · 2023-12-21T01:09:59Z

No description provided.

xinrong-meng · 2023-12-21T01:21:28Z

  File "/Users/xinrong.meng/spark/python/lib/pyspark.zip/pyspark/worker.py", line 735, in profiling_func
    accumulator.add({result_id: (codemap_dict, None)})
  File "/Users/xinrong.meng/spark/python/lib/pyspark.zip/pyspark/accumulators.py", line 173, in add
    self._value = self.accum_param.addInPlace(self._value, term)
  File "/Users/xinrong.meng/spark/python/lib/pyspark.zip/pyspark/sql/profiler.py", line 48, in addInPlace
    PStatsParam.addInPlace(orig_perf, perf),
  File "/Users/xinrong.meng/spark/python/lib/pyspark.zip/pyspark/profiler.py", line 263, in addInPlace
    value1.add(value2)
AttributeError: 'dict' object has no attribute 'add'

python/pyspark/worker.py

xinrong-meng · 2023-12-21T23:27:52Z

23/12/21 13:44:12 ERROR Executor: Exception in task 5.0 in stage 0.0 (TID 5) 16]
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/miniconda3/envs/dev/lib/python3.9/inspect.py", line 1006, in getsourcelines
    lines, lnum = findsource(object)
  File "/opt/miniconda3/envs/dev/lib/python3.9/inspect.py", line 835, in findsource
    raise OSError('could not get source code')
OSError: could not get source code

xinrong-meng · 2024-01-10T01:04:58Z

…n properly ### What changes were proposed in this pull request? Make `ResolveRelations` handle plan id properly ### Why are the changes needed? bug fix for Spark Connect, it won't affect classic Spark SQL before this PR: ``` from pyspark.sql import functions as sf spark.range(10).withColumn("value_1", sf.lit(1)).write.saveAsTable("test_table_1") spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", sf.lit(2)).write.saveAsTable("test_table_2") df1 = spark.read.table("test_table_1") df2 = spark.read.table("test_table_2") df3 = spark.read.table("test_table_1") join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2) join2 = df3.join(join1, how="left", on=join1.index==df3.id) join2.schema ``` fails with ``` AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704 ``` That is due to existing plan caching in `ResolveRelations` doesn't work with Spark Connect ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === '[apache#12]Join LeftOuter, '`==`('index, 'id) '[apache#12]Join LeftOuter, '`==`('index, 'id) !:- '[apache#9]UnresolvedRelation [test_table_1], [], false :- '[apache#9]SubqueryAlias spark_catalog.default.test_table_1 !+- '[apache#11]Project ['index, 'value_2] : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[apache#10]Join Inner, '`==`('id, 'index) +- '[apache#11]Project ['index, 'value_2] ! :- '[#7]UnresolvedRelation [test_table_1], [], false +- '[apache#10]Join Inner, '`==`('id, 'index) ! +- '[apache#8]UnresolvedRelation [test_table_2], [], false :- '[apache#9]SubqueryAlias spark_catalog.default.test_table_1 ! : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[apache#8]SubqueryAlias spark_catalog.default.test_table_2 ! +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false Can not resolve 'id with plan 7 ``` `[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to the cached one ``` :- '[apache#9]SubqueryAlias spark_catalog.default.test_table_1 +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? ci Closes apache#45214 from zhengruifeng/connect_fix_read_join. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

github-actions bot added CORE SQL PYTHON labels Dec 21, 2023

ueshin reviewed Dec 21, 2023

View reviewed changes

python/pyspark/worker.py Outdated Show resolved Hide resolved

ueshin force-pushed the profiler2 branch 2 times, most recently from bf4a4db to e57d679 Compare December 21, 2023 21:29

xinrong-meng force-pushed the xr-profiler2 branch from 394d09a to 7c7feb1 Compare December 21, 2023 23:27

ueshin force-pushed the profiler2 branch 6 times, most recently from 86c5ad1 to 1d593ba Compare December 29, 2023 00:25

xinrong-meng closed this Jan 2, 2024

xinrong-meng force-pushed the xr-profiler2 branch from 7c7feb1 to 1d593ba Compare January 2, 2024 23:53

memory profile

ce34c8f

xinrong-meng reopened this Jan 2, 2024

xinrong-meng added 3 commits January 8, 2024 16:45

get source lines w dill

8513d04

max line number

473f7aa

rmv debugging

0d1bd4f

xinrong-meng added 5 commits January 10, 2024 17:29

spark.sql.pyspark.udf.memoryProfiler.maxLine

e9200ac

1 diff

e1d09d6

type fix

1c0bc53

worker conf for ArrowPythonRunner

046f016

workerConf in BasePythonUDFRunner

e946f79

xinrong-meng closed this Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Memory profiler prototype #7

[WIP] Memory profiler prototype #7

xinrong-meng commented Dec 21, 2023

xinrong-meng commented Dec 21, 2023

xinrong-meng commented Dec 21, 2023

xinrong-meng commented Jan 10, 2024

[WIP] Memory profiler prototype #7

[WIP] Memory profiler prototype #7

Conversation

xinrong-meng commented Dec 21, 2023

xinrong-meng commented Dec 21, 2023

xinrong-meng commented Dec 21, 2023

xinrong-meng commented Jan 10, 2024