[SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests #46012

xi-db · 2024-04-11T17:51:07Z

What changes were proposed in this pull request?

While building the DataFrame step by step, each time a new DataFrame is generated with an empty schema, which is lazily computed on access. However, if a user's code frequently accesses the schema of these new DataFrames using methods such as df.columns, it will result in a large number of Analyze requests to the server. Each time, the entire plan needs to be reanalyzed, leading to poor performance, especially when constructing highly complex plans.

Now, by introducing plan cache in SparkConnectPlanner, we aim to reduce the overhead of repeated analysis during this process. This is achieved by saving significant computation if the resolved logical plan of a subtree of can be cached.

A minimal example of the problem:

import pyspark.sql.functions as F
df = spark.range(10)
for i in range(200):
  if str(i) not in df.columns: # <-- The df.columns call causes a new Analyze request in every iteration
    df = df.withColumn(str(i), F.col("id") + i)
df.show()

With this patch, the performance of the above code improved from ~110s to ~5s.

Why are the changes needed?

The performance improvement is huge in the above cases.

Does this PR introduce any user-facing change?

Yes, a static conf spark.connect.session.planCache.maxSize and a dynamic conf spark.connect.session.planCache.enabled are added.

spark.connect.session.planCache.maxSize: Sets the maximum number of cached resolved logical plans in Spark Connect Session. If set to a value less or equal than zero will disable the plan cache
spark.connect.session.planCache.enabled: When true, the cache of resolved logical plans is enabled if spark.connect.session.planCache.maxSize is greater than zero. When false, the cache is disabled even if spark.connect.session.planCache.maxSize is greater than zero. The caching is best-effort and not guaranteed.

How was this patch tested?

Some new tests are added in SparkConnectSessionHolderSuite.scala.

Was this patch authored or co-authored using generative AI tooling?

No.

beliefer · 2024-04-12T07:13:50Z

I submitted #43473 half year ago.
It seems not received enough attention.

…RK-47818-plan-cache

zhengruifeng · 2024-04-12T09:02:08Z

cc @HyukjinKwon and @ueshin

zhengruifeng · 2024-04-12T13:23:44Z

...ector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala

+   * @param transform Function to transform the relation into a logical plan.
+   * @return The logical plan.
+   */
+  private[connect] def usePlanCache(rel: proto.Relation, cachePlan: Boolean)(


I think we may want to exclude some kinds of plans:
1, LOCAL_RELATION, its size might be large;
2, CACHED_REMOTE_RELATION, it is already cached in dataFrameCache in the SessionHolder;
3, CATALOG, it may be a command like CREATE_TABLE/DROP_TEMP_VIEW , should not be skipped.

@zhengruifeng Does point 3 extend to all DDL/DML commands?

1, LOCAL_RELATION, its size might be large;
2, CACHED_REMOTE_RELATION, it is already cached in dataFrameCache in the SessionHolder;

We use a low default value for CONNECT_SESSION_PLAN_CACHE_SIZE, 5 entries to be specific, to prevent the cache from becoming too large. We haven't been too surgical in what kind of plans we cache to keep it as a simple wrapper in aiding common scenarios that may generate some repeated analysis/execute calls.

3, CATALOG, it may be a command like CREATE_TABLE/DROP_TEMP_VIEW , should not be skipped.

Commands should be fine. Each command will get a new plan_id, so a new command won't hit the cache. Besides, we only cache plans with plan_id set.

yea, I forgot the new plan_id. Then it's fine

HyukjinKwon · 2024-04-16T07:27:11Z

Merged to master.

…tPlanner to improve performance of Analyze requests ### What changes were proposed in this pull request? In [the previous PR](#46012), we cache plans of AnalyzePlan requests. We're also enabling it for ExecutePlan in this PR. ### Why are the changes needed? Some operations like spark.sql() issue ExecutePlan requests. By caching them, we can also improve performance if subsequent plans to be analyzed include the plan returned by ExecutePlan as a subtree. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46098 from xi-db/SPARK-47818-plan-cache-followup. Authored-by: Xi Lyu <xi.lyu@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

…tPlanner to improve performance of Analyze requests ### What changes were proposed in this pull request? In [the previous PR](apache#46012), we cache plans of AnalyzePlan requests. We're also enabling it for ExecutePlan in this PR. ### Why are the changes needed? Some operations like spark.sql() issue ExecutePlan requests. By caching them, we can also improve performance if subsequent plans to be analyzed include the plan returned by ExecutePlan as a subtree. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46098 from xi-db/SPARK-47818-plan-cache-followup. Authored-by: Xi Lyu <xi.lyu@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

…tPlanner to improve performance of Analyze requests ### What changes were proposed in this pull request? In [this previous PR](#46012), we introduced two new confs for the introduced plan cache - a static conf `spark.connect.session.planCache.maxSize` and a dynamic conf `spark.connect.session.planCache.enabled`. The plan cache is enabled by default with size 5. In this PR, we are marking them as internal because we don't expect users to deal with it. ### Why are the changes needed? These two confs are not expected to be used under normal circumstances, and we don't need to document them on the Spark Configuration reference page. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46638 from xi-db/SPARK-47818-plan-cache-followup2. Authored-by: Xi Lyu <xi.lyu@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

Introduce plan cache in SparkConnectPlanner

5930552

github-actions bot added SQL PYTHON CONNECT labels Apr 11, 2024

Fix typo

59dad93

xi-db added 2 commits April 12, 2024 10:26

Move the logic of checking plan id into usePlanCache

87e21d0

Merge remote-tracking branch 'origin/SPARK-47818-plan-cache' into SPA…

8ffd5e1

…RK-47818-plan-cache

xi-db changed the title ~~[WIP][SPARK-47818] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests~~ [SPARK-47818] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests Apr 12, 2024

xi-db changed the title ~~[SPARK-47818] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests~~ [SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests Apr 12, 2024

Removed unused imports

921d253

xi-db changed the title ~~[SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests~~ [WIP][SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests Apr 12, 2024

Fix failed tests due to missing plan ids

e19ef60

zhengruifeng reviewed Apr 12, 2024

View reviewed changes

xi-db added 3 commits April 12, 2024 15:38

Format code

9728a2d

Merge branch 'master' into SPARK-47818-plan-cache

7236216

Fix failed tests

03ba01e

github-actions bot added the CORE label Apr 15, 2024

Revert changes in tests

382eac7

github-actions bot removed CORE PYTHON labels Apr 15, 2024

xi-db changed the title ~~[WIP][SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests~~ [SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests Apr 16, 2024

HyukjinKwon approved these changes Apr 16, 2024

View reviewed changes

HyukjinKwon closed this in a1fc6d5 Apr 16, 2024

xi-db mentioned this pull request Apr 17, 2024

[SPARK-47818][CONNECT][FOLLOW-UP] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests #46098

Closed

xi-db mentioned this pull request May 17, 2024

[SPARK-47818][CONNECT][FOLLOW-UP] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests #46638

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests #46012

[SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests #46012

xi-db commented Apr 11, 2024 •

edited

Loading

beliefer commented Apr 12, 2024

zhengruifeng commented Apr 12, 2024

zhengruifeng Apr 12, 2024 •

edited

Loading

vicennial Apr 15, 2024

vicennial Apr 15, 2024

xi-db Apr 15, 2024

zhengruifeng Apr 16, 2024

HyukjinKwon commented Apr 16, 2024

[SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests #46012

[SPARK-47818][CONNECT] Introduce plan cache in SparkConnectPlanner to improve performance of Analyze requests #46012

Conversation

xi-db commented Apr 11, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

beliefer commented Apr 12, 2024

zhengruifeng commented Apr 12, 2024

zhengruifeng Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

vicennial Apr 15, 2024

Choose a reason for hiding this comment

vicennial Apr 15, 2024

Choose a reason for hiding this comment

xi-db Apr 15, 2024

Choose a reason for hiding this comment

zhengruifeng Apr 16, 2024

Choose a reason for hiding this comment

HyukjinKwon commented Apr 16, 2024

xi-db commented Apr 11, 2024 •

edited

Loading

zhengruifeng Apr 12, 2024 •

edited

Loading