Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flush transaction log cache in Delta flush_metadata_cache procedure #16466

Merged
merged 5 commits into from
Mar 22, 2023

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Mar 9, 2023

Description

Relates to #13737

Release notes

(x) Release notes are required, with the following suggested text:

# Delta Lake
* Flush internal caches of table snapshots and active data files with `flush_metadata_cache` procedure. ({issue}`16466`)

@cla-bot cla-bot bot added the cla-signed label Mar 9, 2023
@ebyhr ebyhr changed the title Ebi/delta flush cache Flush transaction log cache in Delta flush_metadata_cache procedure Mar 9, 2023
@ebyhr ebyhr force-pushed the ebi/delta-flush-cache branch from b5015ef to aa7008d Compare March 9, 2023 11:05
@github-actions github-actions bot added delta-lake Delta Lake connector hive Hive connector tests:hive labels Mar 9, 2023
@@ -56,7 +68,9 @@ protected void setup(Binder binder)
newExporter(binder).export(HiveMetastoreFactory.class)
.as(generator -> generator.generatedNameOf(CachingHiveMetastore.class));

newSetBinder(binder, Procedure.class).addBinding().toProvider(FlushHiveMetastoreCacheProcedure.class).in(Scopes.SINGLETON);
if (installFlushMetadataCacheProcedure) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this procedure installed for iceberg ?
tbh, I didn't know that it existed in iceberg already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather have the method always available in the connector (register it on HiveProcedureModule) and throw an exception (as your implementation does for delta) in case that the caching metastore is not used.

Copy link
Member Author

@ebyhr ebyhr Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another commit to avoid installing the procedure in Iceberg.

I'd rather have the method always available in the connector

What's the rationale or motivation of this? My opinion is basically the opposite (no need to install a procedure when it's unusable).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale or motivation of this?

I was thinking more from a user perspective where a user may be confused that the procedure is not available (e.g. : while using AWS Glue metastore) for the hive/delta connector even though is advertised to be available in the documentation.

@ebyhr ebyhr force-pushed the ebi/delta-flush-cache branch from aa7008d to 8e50366 Compare March 10, 2023 00:41
@github-actions github-actions bot added the iceberg Iceberg connector label Mar 10, 2023
@ebyhr ebyhr force-pushed the ebi/delta-flush-cache branch 2 times, most recently from 0e7d6ac to 266aec9 Compare March 13, 2023 23:57
@ebyhr ebyhr self-assigned this Mar 14, 2023
@ebyhr ebyhr force-pushed the ebi/delta-flush-cache branch from 266aec9 to 5a09dfb Compare March 14, 2023 10:20
@ebyhr ebyhr requested a review from findepi March 17, 2023 04:42
@@ -42,6 +42,18 @@
public class DecoratedHiveMetastoreModule
extends AbstractConfigurationAwareModule
{
private final boolean installFlushMetadataCacheProcedure;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the flush procedure should be installed whenever caching is supported

what about changing the parameter name and logic.
for example, we could skip binding CachingHiveMetastoreConfig for iceberg (and maybe some others)

Copy link
Member Author

@ebyhr ebyhr Mar 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the flush procedure should be installed whenever caching is supported

This makes it hard to exclude Hive's flush_metadata_cache from Delta Lake connector. I initially tried to exclude the procedure from DeltaLakeConnector#getProcedures, but Procedure class doesn't contain catalog or connector information (MethodHandle doesn't work nicely to compare objects).

TransactionLogAccess transactionLogAccess)
{
this.metastoreFactory = requireNonNull(metastoreFactory, "metastoreFactory is null");
this.cachingHiveMetastore = requireNonNull(cachingHiveMetastore, "cachingHiveMetastore is null");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to try to delegate to Hive's flush procedure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to delegate to the connector at first, but the code was a little redundant. Let me merge as-is. I will take another look later.

import static io.trino.tests.product.utils.QueryExecutors.onTrino;
import static java.lang.String.format;

public class TestDeltaLakeActiveFilesCache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a product test?
maybe file: -based tests are enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileHiveMetastore has the additional cache with listTablesCache #13115. It requires a refactoring to disable the cache (the duration is hard-coded now) or wait until the internal cache is expired.

@ebyhr ebyhr force-pushed the ebi/delta-flush-cache branch from 7892030 to 118e67d Compare March 20, 2023 03:00
@ebyhr
Copy link
Member Author

ebyhr commented Mar 20, 2023

Rebased on master to resolve conflicts.

@ebyhr ebyhr force-pushed the ebi/delta-flush-cache branch 3 times, most recently from c6ae7f8 to 5bb3280 Compare March 22, 2023 05:07
ebyhr and others added 2 commits March 22, 2023 15:17
The procedure was unusable because Iceberg connector always
disables caching metastore.
Co-Authored-By: Marius Grama <findinpath@gmail.com>
@ebyhr ebyhr force-pushed the ebi/delta-flush-cache branch from 5bb3280 to 3f88633 Compare March 22, 2023 06:17
@ebyhr ebyhr merged commit 1e5774d into trinodb:master Mar 22, 2023
@ebyhr ebyhr deleted the ebi/delta-flush-cache branch March 22, 2023 12:27
@ebyhr ebyhr mentioned this pull request Mar 22, 2023
@github-actions github-actions bot added this to the 411 milestone Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector docs hive Hive connector iceberg Iceberg connector
Development

Successfully merging this pull request may close these issues.

4 participants