Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45449][SQL] Cache Invalidation Issue with JDBC Table #43258

Closed
wants to merge 6 commits into from

Conversation

lyy-pineapple
Copy link
Contributor

@lyy-pineapple lyy-pineapple commented Oct 7, 2023

What changes were proposed in this pull request?

Add an equals method to JDBCOptions that considers two instances equal if their JDBCOptions.parameters are the same.

Why are the changes needed?

We have identified a cache invalidation issue when caching JDBC tables in Spark SQL. The cached table is unexpectedly invalidated when queried, leading to a re-read from the JDBC table instead of retrieving data from the cache.
Example SQL:

CACHE TABLE cache_t SELECT * FROM mysql.test.test1;
SELECT * FROM cache_t;

Expected Behavior:
The expectation is that querying the cached table (cache_t) should retrieve the result from the cache without re-evaluating the execution plan.

Actual Behavior:
However, the cache is invalidated, and the content is re-read from the JDBC table.

Root Cause:
The issue lies in the CacheData class, where the comparison involves JDBCTable. The JDBCTable is a case class:

case class JDBCTable(ident: Identifier, schema: StructType, jdbcOptions: JDBCOptions)
The comparison of non-case class components, such as jdbcOptions, involves pointer comparison. This leads to unnecessary cache invalidation.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add uts

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Oct 7, 2023
Copy link
Contributor

@beliefer beliefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @cloud-fan @MaxGekk
I have manually verified and it seems no problem.

@lyy-pineapple
Copy link
Contributor Author

Thank you for your review. All tests have passed. Do you have any further feedback or suggestions? @beliefer

@beliefer
Copy link
Contributor

beliefer commented Oct 8, 2023

Thank you for your review. All tests have passed. Do you have any further feedback or suggestions? @beliefer

I'm OK. Please wait for the other owner's review.

@lyy-pineapple
Copy link
Contributor Author

Could you review this PR when you get a chance @cloud-fan @MaxGekk @sigmod

sql("CACHE TABLE t1 SELECT id, name FROM h2.test.cache_t")
val plan = sql("select * from t1").queryExecution.sparkPlan
assert(plan.isInstanceOf[InMemoryTableScanExec])
sql("UNCACHE TABLE IF EXISTS t1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't need this UNCACHE TABLE as withTable will drop the table at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I has removed it

@cloud-fan
Copy link
Contributor

This is a good catch! I just have one question though: JDBCOptions is also used in JDBC v1, why does v1 not have this cache issue?

@lyy-pineapple
Copy link
Contributor Author

This is a good catch! I just have one question though: JDBCOptions is also used in JDBC v1, why does v1 not have this cache issue?

Debugging revealed that when using v1, the construction of LogicalRelation through makeCopy reused JDBCRelation. Perhaps, someday, this could pose a potential issue for v1 as well.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.5!

@cloud-fan cloud-fan closed this in d073f2d Oct 10, 2023
cloud-fan pushed a commit that referenced this pull request Oct 10, 2023
### What changes were proposed in this pull request?
Add an equals method to `JDBCOptions` that considers two instances equal if their `JDBCOptions.parameters` are the same.

### Why are the changes needed?
We have identified a cache invalidation issue when caching JDBC tables in Spark SQL. The cached table is unexpectedly invalidated when queried, leading to a re-read from the JDBC table instead of retrieving data from the cache.
Example SQL:

```
CACHE TABLE cache_t SELECT * FROM mysql.test.test1;
SELECT * FROM cache_t;
```
Expected Behavior:
The expectation is that querying the cached table (cache_t) should retrieve the result from the cache without re-evaluating the execution plan.

Actual Behavior:
However, the cache is invalidated, and the content is re-read from the JDBC table.

Root Cause:
The issue lies in the `CacheData` class, where the comparison involves `JDBCTable`. The `JDBCTable` is a case class:

`case class JDBCTable(ident: Identifier, schema: StructType, jdbcOptions: JDBCOptions)`
The comparison of non-case class components, such as `jdbcOptions`, involves pointer comparison. This leads to unnecessary cache invalidation.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add uts

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43258 from lyy-pineapple/spark-git-cache.

Authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit d073f2d)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants