[SPARK-45449][SQL] Cache Invalidation Issue with JDBC Table #43258

lyy-pineapple · 2023-10-07T03:49:22Z

What changes were proposed in this pull request?

Add an equals method to JDBCOptions that considers two instances equal if their JDBCOptions.parameters are the same.

Why are the changes needed?

We have identified a cache invalidation issue when caching JDBC tables in Spark SQL. The cached table is unexpectedly invalidated when queried, leading to a re-read from the JDBC table instead of retrieving data from the cache.
Example SQL:

CACHE TABLE cache_t SELECT * FROM mysql.test.test1;
SELECT * FROM cache_t;

Expected Behavior:
The expectation is that querying the cached table (cache_t) should retrieve the result from the cache without re-evaluating the execution plan.

Actual Behavior:
However, the cache is invalidated, and the content is re-read from the JDBC table.

Root Cause:
The issue lies in the CacheData class, where the comparison involves JDBCTable. The JDBCTable is a case class:

case class JDBCTable(ident: Identifier, schema: StructType, jdbcOptions: JDBCOptions)
The comparison of non-case class components, such as jdbcOptions, involves pointer comparison. This leads to unnecessary cache invalidation.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add uts

Was this patch authored or co-authored using generative AI tooling?

No

beliefer

cc @cloud-fan @MaxGekk
I have manually verified and it seems no problem.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala

...rc/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala

lyy-pineapple · 2023-10-08T03:51:51Z

Thank you for your review. All tests have passed. Do you have any further feedback or suggestions? @beliefer

beliefer · 2023-10-08T05:38:19Z

Thank you for your review. All tests have passed. Do you have any further feedback or suggestions? @beliefer

I'm OK. Please wait for the other owner's review.

lyy-pineapple · 2023-10-08T06:04:29Z

Could you review this PR when you get a chance @cloud-fan @MaxGekk @sigmod

cloud-fan · 2023-10-09T04:35:39Z

...rc/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala

+      sql("CACHE TABLE t1 SELECT id, name FROM h2.test.cache_t")
+      val plan = sql("select * from t1").queryExecution.sparkPlan
+      assert(plan.isInstanceOf[InMemoryTableScanExec])
+      sql("UNCACHE TABLE IF EXISTS t1")


nit: we don't need this UNCACHE TABLE as withTable will drop the table at the end.

thanks， I has removed it

cloud-fan · 2023-10-09T04:36:23Z

This is a good catch! I just have one question though: JDBCOptions is also used in JDBC v1, why does v1 not have this cache issue?

lyy-pineapple · 2023-10-09T07:24:32Z

This is a good catch! I just have one question though: JDBCOptions is also used in JDBC v1, why does v1 not have this cache issue?

Debugging revealed that when using v1, the construction of LogicalRelation through makeCopy reused JDBCRelation. Perhaps, someday, this could pose a potential issue for v1 as well.

cloud-fan · 2023-10-10T06:39:14Z

thanks, merging to master/3.5!

### What changes were proposed in this pull request? Add an equals method to `JDBCOptions` that considers two instances equal if their `JDBCOptions.parameters` are the same. ### Why are the changes needed? We have identified a cache invalidation issue when caching JDBC tables in Spark SQL. The cached table is unexpectedly invalidated when queried, leading to a re-read from the JDBC table instead of retrieving data from the cache. Example SQL: ``` CACHE TABLE cache_t SELECT * FROM mysql.test.test1; SELECT * FROM cache_t; ``` Expected Behavior: The expectation is that querying the cached table (cache_t) should retrieve the result from the cache without re-evaluating the execution plan. Actual Behavior: However, the cache is invalidated, and the content is re-read from the JDBC table. Root Cause: The issue lies in the `CacheData` class, where the comparison involves `JDBCTable`. The `JDBCTable` is a case class: `case class JDBCTable(ident: Identifier, schema: StructType, jdbcOptions: JDBCOptions)` The comparison of non-case class components, such as `jdbcOptions`, involves pointer comparison. This leads to unnecessary cache invalidation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add uts ### Was this patch authored or co-authored using generative AI tooling? No Closes #43258 from lyy-pineapple/spark-git-cache. Authored-by: liangyongyuan <liangyongyuan@xiaomi.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d073f2d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Oct 7, 2023

beliefer reviewed Oct 7, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala Outdated Show resolved Hide resolved

beliefer reviewed Oct 7, 2023

View reviewed changes

...rc/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala Outdated Show resolved Hide resolved

...rc/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala Outdated Show resolved Hide resolved

lyy-pineapple force-pushed the spark-git-cache branch from ed5c389 to 1075894 Compare October 7, 2023 06:14

beliefer reviewed Oct 7, 2023

View reviewed changes

...rc/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 9, 2023

View reviewed changes

lyy-pineapple force-pushed the spark-git-cache branch from cd7f15e to 9a5a0da Compare October 9, 2023 07:25

cloud-fan approved these changes Oct 9, 2023

View reviewed changes

beliefer approved these changes Oct 9, 2023

View reviewed changes

lyy-pineapple added 6 commits October 10, 2023 11:19

[SPARK-45449][SQL] Cache Invalidation Issue with JDBC Table

7e84771

remove comments

2332c7d

move function to tail

92c047f

remove comments

b25978e

Remove session clone, uncache table and drop table at finally

5d0a242

remove unuse cache

84a6bb5

lyy-pineapple force-pushed the spark-git-cache branch from 9a5a0da to 84a6bb5 Compare October 10, 2023 03:19

cloud-fan closed this in d073f2d Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45449][SQL] Cache Invalidation Issue with JDBC Table #43258

[SPARK-45449][SQL] Cache Invalidation Issue with JDBC Table #43258

lyy-pineapple commented Oct 7, 2023 •

edited

Loading

beliefer left a comment

lyy-pineapple commented Oct 8, 2023

beliefer commented Oct 8, 2023

lyy-pineapple commented Oct 8, 2023

cloud-fan Oct 9, 2023

lyy-pineapple Oct 9, 2023

cloud-fan commented Oct 9, 2023

lyy-pineapple commented Oct 9, 2023

cloud-fan commented Oct 10, 2023

[SPARK-45449][SQL] Cache Invalidation Issue with JDBC Table #43258

[SPARK-45449][SQL] Cache Invalidation Issue with JDBC Table #43258

Conversation

lyy-pineapple commented Oct 7, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

beliefer left a comment

Choose a reason for hiding this comment

lyy-pineapple commented Oct 8, 2023

beliefer commented Oct 8, 2023

lyy-pineapple commented Oct 8, 2023

cloud-fan Oct 9, 2023

Choose a reason for hiding this comment

lyy-pineapple Oct 9, 2023

Choose a reason for hiding this comment

cloud-fan commented Oct 9, 2023

lyy-pineapple commented Oct 9, 2023

cloud-fan commented Oct 10, 2023

lyy-pineapple commented Oct 7, 2023 •

edited

Loading