-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-45449][SQL] Cache Invalidation Issue with JDBC Table #43258
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @cloud-fan @MaxGekk
I have manually verified and it seems no problem.
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
Outdated
Show resolved
Hide resolved
...rc/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala
Outdated
Show resolved
Hide resolved
...rc/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala
Outdated
Show resolved
Hide resolved
ed5c389
to
1075894
Compare
...rc/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala
Outdated
Show resolved
Hide resolved
Thank you for your review. All tests have passed. Do you have any further feedback or suggestions? @beliefer |
I'm OK. Please wait for the other owner's review. |
Could you review this PR when you get a chance @cloud-fan @MaxGekk @sigmod |
sql("CACHE TABLE t1 SELECT id, name FROM h2.test.cache_t") | ||
val plan = sql("select * from t1").queryExecution.sparkPlan | ||
assert(plan.isInstanceOf[InMemoryTableScanExec]) | ||
sql("UNCACHE TABLE IF EXISTS t1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we don't need this UNCACHE TABLE as withTable
will drop the table at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I has removed it
This is a good catch! I just have one question though: |
Debugging revealed that when using v1, the construction of LogicalRelation through makeCopy reused JDBCRelation. Perhaps, someday, this could pose a potential issue for v1 as well. |
cd7f15e
to
9a5a0da
Compare
9a5a0da
to
84a6bb5
Compare
thanks, merging to master/3.5! |
### What changes were proposed in this pull request? Add an equals method to `JDBCOptions` that considers two instances equal if their `JDBCOptions.parameters` are the same. ### Why are the changes needed? We have identified a cache invalidation issue when caching JDBC tables in Spark SQL. The cached table is unexpectedly invalidated when queried, leading to a re-read from the JDBC table instead of retrieving data from the cache. Example SQL: ``` CACHE TABLE cache_t SELECT * FROM mysql.test.test1; SELECT * FROM cache_t; ``` Expected Behavior: The expectation is that querying the cached table (cache_t) should retrieve the result from the cache without re-evaluating the execution plan. Actual Behavior: However, the cache is invalidated, and the content is re-read from the JDBC table. Root Cause: The issue lies in the `CacheData` class, where the comparison involves `JDBCTable`. The `JDBCTable` is a case class: `case class JDBCTable(ident: Identifier, schema: StructType, jdbcOptions: JDBCOptions)` The comparison of non-case class components, such as `jdbcOptions`, involves pointer comparison. This leads to unnecessary cache invalidation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add uts ### Was this patch authored or co-authored using generative AI tooling? No Closes #43258 from lyy-pineapple/spark-git-cache. Authored-by: liangyongyuan <liangyongyuan@xiaomi.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d073f2d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Add an equals method to
JDBCOptions
that considers two instances equal if theirJDBCOptions.parameters
are the same.Why are the changes needed?
We have identified a cache invalidation issue when caching JDBC tables in Spark SQL. The cached table is unexpectedly invalidated when queried, leading to a re-read from the JDBC table instead of retrieving data from the cache.
Example SQL:
Expected Behavior:
The expectation is that querying the cached table (cache_t) should retrieve the result from the cache without re-evaluating the execution plan.
Actual Behavior:
However, the cache is invalidated, and the content is re-read from the JDBC table.
Root Cause:
The issue lies in the
CacheData
class, where the comparison involvesJDBCTable
. TheJDBCTable
is a case class:case class JDBCTable(ident: Identifier, schema: StructType, jdbcOptions: JDBCOptions)
The comparison of non-case class components, such as
jdbcOptions
, involves pointer comparison. This leads to unnecessary cache invalidation.Does this PR introduce any user-facing change?
No
How was this patch tested?
Add uts
Was this patch authored or co-authored using generative AI tooling?
No