Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register VACUUM operations in the delta log #24331

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mdesmet
Copy link
Contributor

@mdesmet mdesmet commented Dec 2, 2024

Description

Fixes #24293

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(*) Release notes are required, with the following suggested text:

## Section
* Added VACUUM START AND VACUUM END logging for Delta Lake `vacuum` procedure.

@cla-bot cla-bot bot added the cla-signed label Dec 2, 2024
@github-actions github-actions bot added the delta-lake Delta Lake connector label Dec 2, 2024
@mdesmet mdesmet force-pushed the feat/register-vacuum-delta-log branch 2 times, most recently from aa39226 to 2008d86 Compare December 2, 2024 14:26
@findinpath
Copy link
Contributor

In the description "Solves" -> "Fixes".

return vacuumLoggingEnabled;
}

@Config("delta.vacuum.logging.enabled")
Copy link
Contributor

@findinpath findinpath Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a specific configuration for this ?

I've landed on delta-io/delta@3ac120f
Maybe add this commit to description of the PR for reference.

BTW Trino does support concurrent writes on all the file systems. (Databricks has not out of the box support for AWS S3, while Trino does, which means there's a potential to cause eventually table corruption in a concurrent setting).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr : Pleae comment. I think the reasoning is that in Spark this is configurable and disabled by default.

Copy link
Member

@ebyhr ebyhr Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested adding this property for compatibility with Spark and also a kill-switch.

The current implementation still has possibility of conflicts. For instance, the logging may fail if there is a concurrent operation because we don't have a retry code in vacuum logic.

return vacuumLoggingEnabled;
}

@Config("delta.vacuum.logging.enabled")
Copy link
Member

@ebyhr ebyhr Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested adding this property for compatibility with Spark and also a kill-switch.

The current implementation still has possibility of conflicts. For instance, the logging may fail if there is a concurrent operation because we don't have a retry code in vacuum logic.

@mdesmet mdesmet force-pushed the feat/register-vacuum-delta-log branch from 2008d86 to 1ae966f Compare December 3, 2024 07:07
@mdesmet mdesmet marked this pull request as ready for review December 3, 2024 13:41
@mdesmet mdesmet force-pushed the feat/register-vacuum-delta-log branch from 1ae966f to 868ae1a Compare December 3, 2024 13:45
@mdesmet mdesmet force-pushed the feat/register-vacuum-delta-log branch 3 times, most recently from 722a27f to 44324e9 Compare December 4, 2024 08:47
@mdesmet mdesmet force-pushed the feat/register-vacuum-delta-log branch 3 times, most recently from c6fae4a to 6c315fa Compare December 5, 2024 10:03
@mdesmet mdesmet requested review from findinpath and ebyhr December 5, 2024 10:05
@mdesmet mdesmet force-pushed the feat/register-vacuum-delta-log branch from 6c315fa to 2b0f6d5 Compare December 5, 2024 10:07
@mdesmet mdesmet force-pushed the feat/register-vacuum-delta-log branch from 2b0f6d5 to 0806741 Compare December 6, 2024 06:46
@mdesmet
Copy link
Contributor Author

mdesmet commented Dec 6, 2024

There is funny failure. Why the sizeOfDataToDelete is different when running in CI vs running locally. Shouldn't it be expected to be deteriministic?

-- failure 1 --
[Rows for query [SELECT version, operation, MAP_FILTER(operation_parameters, (k, v) -> k <> 'queryId'), operation_metrics FROM "test_vacuum4b6nq4t53e$history"]] 
Expecting actual:
  (0, CREATE TABLE AS SELECT, {}, {}), (1, MERGE, {}, {}), (2, VACUUM START, {}, {sizeOfDataToDelete=0, numFilesToDelete=0}), (3, VACUUM END, {status=COMPLETED}, {numDeletedFiles=0, numVacuumedDirectories=1}), (4, VACUUM START, {}, {sizeOfDataToDelete=5012, numFilesToDelete=5}), (5, VACUUM END, {status=COMPLETED}, {numDeletedFiles=5, numVacuumedDirectories=1})
to contain exactly in any order:
  [(0, CREATE TABLE AS SELECT, {}, {}),
    (1, MERGE, {}, {}),
    (2, VACUUM START, {}, {sizeOfDataToDelete=0, numFilesToDelete=0}),
    (3, VACUUM END, {status=COMPLETED}, {numDeletedFiles=0, numVacuumedDirectories=1}),
    (4, VACUUM START, {}, {sizeOfDataToDelete=5025, numFilesToDelete=5}),
    (5, VACUUM END, {status=COMPLETED}, {numDeletedFiles=5, numVacuumedDirectories=1})]
elements not found:
  (4, VACUUM START, {}, {sizeOfDataToDelete=5025, numFilesToDelete=5})
and elements not expected:
  (4, VACUUM START, {}, {sizeOfDataToDelete=5012, numFilesToDelete=5})
at QueryAssertions$ResultAssert.lambda$matches$0(QueryAssertions$ResultAssert.java:740)

@pajaks pajaks self-requested a review December 6, 2024 08:28
@mdesmet mdesmet force-pushed the feat/register-vacuum-delta-log branch 2 times, most recently from 0dee91e to 31bc519 Compare December 9, 2024 08:48
@mdesmet
Copy link
Contributor Author

mdesmet commented Dec 9, 2024

As discussed with @ebyhr offline, made use of isGreaterThanOrEqualTo.

There is funny failure. Why the sizeOfDataToDelete is different when running in CI vs running locally. Shouldn't it be expected to be deteriministic?

@mdesmet mdesmet force-pushed the feat/register-vacuum-delta-log branch from 31bc519 to f4cf893 Compare December 9, 2024 10:14
@ebyhr
Copy link
Member

ebyhr commented Dec 10, 2024

Could you confirm CI failures?

@mdesmet mdesmet force-pushed the feat/register-vacuum-delta-log branch from f4cf893 to e65ea00 Compare December 10, 2024 04:24
@mdesmet mdesmet force-pushed the feat/register-vacuum-delta-log branch 3 times, most recently from ce141ba to e94c13a Compare December 15, 2024 03:35
@@ -1746,6 +1750,34 @@ public void testVacuum()
assertThat(getActiveFiles(tableName)).isEqualTo(updatedFiles);
// old files should be cleaned up
assertThat(getAllDataFilesFromTableDirectory(tableName)).isEqualTo(updatedFiles);
// operations should be logged
assertThat(query("SELECT version, operation, MAP_FILTER(operation_parameters, (k, v) -> k <> 'queryId') FROM \"" + tableName + "$history\"")).matches( """
VALUES\s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove redundant \s. Same for other places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line still has redundant \s.

@mdesmet mdesmet force-pushed the feat/register-vacuum-delta-log branch from e94c13a to 68c1e7c Compare December 17, 2024 12:57
@mdesmet mdesmet force-pushed the feat/register-vacuum-delta-log branch from 68c1e7c to 24955a0 Compare December 18, 2024 02:58
@mdesmet mdesmet requested review from pajaks and ebyhr December 18, 2024 05:10
@mdesmet mdesmet force-pushed the feat/register-vacuum-delta-log branch from 24955a0 to aad54e9 Compare December 19, 2024 06:50
@mdesmet mdesmet requested a review from ebyhr December 19, 2024 09:30
}

@Test
public void testVacuumWithParquetTransactionLogging()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet is used for both data files and checkpoints. Please rename to testVacuumTransactionLoggingOnCheckpoint or something.

testVacuumWithCheckPointInterval(true);
}

private void testVacuumWithCheckPointInterval(boolean isCheckPointInterval) throws InterruptedException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put throws InterruptedException and { on new lines.

@@ -1746,6 +1750,34 @@ public void testVacuum()
assertThat(getActiveFiles(tableName)).isEqualTo(updatedFiles);
// old files should be cleaned up
assertThat(getAllDataFilesFromTableDirectory(tableName)).isEqualTo(updatedFiles);
// operations should be logged
assertThat(query("SELECT version, operation, MAP_FILTER(operation_parameters, (k, v) -> k <> 'queryId') FROM \"" + tableName + "$history\"")).matches( """
VALUES\s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line still has redundant \s.

@@ -1746,6 +1761,34 @@ public void testVacuum()
assertThat(getActiveFiles(tableName)).isEqualTo(updatedFiles);
// old files should be cleaned up
assertThat(getAllDataFilesFromTableDirectory(tableName)).isEqualTo(updatedFiles);
// operations should be logged
assertThat(query("SELECT version, operation, MAP_FILTER(operation_parameters, (k, v) -> k <> 'queryId') FROM \"" + tableName + "$history\"")).matches( """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow a text block format of #23958.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector
Development

Successfully merging this pull request may close these issues.

Register VACUUM operations in the delta log in Delta Lake connector
4 participants