Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-4878] Fix incremental cleaner use case #6498

Closed

Conversation

parisni
Copy link
Contributor

@parisni parisni commented Aug 25, 2022

Change Logs

Describe context and summary for this change. Highlight if any code was copied.

Impact

Currently incremental cleaning is run for both KEEP_LATEST_COMMITS, KEEP_LATEST_BY_HOURS
policies. It is not run when KEEP_LATEST_FILE_VERSIONS.

This can lead to not cleaning files. This PR fixes this problem by enabling incremental cleaning for KEEP_LATEST_FILE_VERSIONS only.

Here is the scenario of the problem:

Say we have 3 committed files in partition-A and we add a new commit in partition-B, and we trigger cleaning for the first time (full partition scan):

partition-A/
commit-0.parquet
commit-1.parquet
commit-2.parquet
partition-B/
commit-3.parquet

In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, the cleaner will remove the commit-0.parquet to keep 3 commits.
For the next cleaning, incremental cleaning will trigger, and won't consider partition-A/ until a new commit change it. In case no later commit changes partition-A then commit-1.parquet will stay forever. However it should be removed by the cleaner.

Now if in case of KEEP_LATEST_FILE_VERSIONS, the cleaner will only keep commit-2.parquet. Then it makes sense that incremental cleaning won't consider partition-A until it is changed. Because there is only one commit.

This is why incremental cleaning should only be enabled with KEEP_LATEST_FILE_VERSIONS

Hope this is clear enough

Risk level: none | low | medium | high

Choose one. If medium or high, explain what verification was done to mitigate the risks.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan nsivabalan self-assigned this Sep 1, 2022
@nsivabalan nsivabalan added priority:critical production down; pipelines stalled; Need help asap. table-service labels Sep 1, 2022
@nsivabalan
Copy link
Contributor

nsivabalan commented Sep 1, 2022

@parisni : hey I am bit confused on your example. Can you illustrate which file group each commit is updating. whether each commit is a new file group or updating the same file group. If you can clarify that, I will go over the issue again to see if the fix makes sense.
really appreciate you putting up the fix.

@parisni
Copy link
Contributor Author

parisni commented Sep 1, 2022 via email

@nsivabalan
Copy link
Contributor

got it, thanks.
Let me go through the example you have put up and clarify few things.

Say we have 3 committed files in partition-A and we add a new commit in partition-B, and we trigger cleaning for the first time (full partition scan):

partition-A/
commit-0 added file1_V1.parquet
commit-1. added file1_V2.parquet
commit-2 added file1_V3.parquet
partition-B/
commit-3 added file2_V1.parquet

In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, the cleaner will remove the files created by commit-0 and keep 3 commits. ie. file1_V1.parquet will be cleaned up. But hudi also keeps track of earliest commit retained in this case which is commit2. This earliest commit retained is the one we will leverage later to do incremental cleaning.

For the next cleaning, incremental cleaning will trigger, and will comb through all commits >= earliest commit retained i.e. commit2. and so file2_v1 will be deleted this time. and will update the earliest commit retained to commit3 now.

this may not be applicable w/ KEEP_LATEST_FILE_VERSIONS. bcoz, we can't pin point a commit and say everything before that commit can be ignore for future cleaning. and thus incase of KEEP_LATEST_FILE_VERSIONS, we can't do incremental cleaning. It is in this policy, we might encounter a corner case where, a file group was updated only in Commit 1 and commit and never updated later. and after a long time, had a new version in say commit 100. we need to clean up the first version (assuming KEEP_LATEST_FILE_VERSIONS count is 2).

Let me know if this makes sense. or if you still feel, I am missing something, can you elaborate w/ an example.

@parisni
Copy link
Contributor Author

parisni commented Sep 5, 2022 via email

@parisni
Copy link
Contributor Author

parisni commented Sep 5, 2022 via email

@parisni
Copy link
Contributor Author

parisni commented Sep 5, 2022

@nsivabalan added a (git) commit, according to last comment

@nsivabalan
Copy link
Contributor

here is my take:
don't think we can go w/ this fix as is. but we can try something else.

we need to maintain another variable called, lastCompletedCommit whenever a cleaner kicks in. so, it will be added as part of clean commit metadata. infact we landed a patch 2 days back that adds this. #5478

let me walk through a scenario.
cleaner policy: based on latest file versions.
count: 2

Commit1 :
FG1_V1, FG2_V2

Commit2:
FG1_V2, FG3_V1

Commit3: FG1_V3, FG3_V2, FG4_V1

Commit4: FG1_V4, FG2_V2, FG4_V2, FG5_V1

Commit5: FG2_V3, FG5_V2

for first two commits, nothing will get cleaned up. and so "lastCompletedCommit" = null.

just after C3, lets say we trigger cleaning.
C3_C
FG_1_V1 will be cleaned up.
and will mark lastCompletedCommit = C3.

just after C4, we trigger cleaner again.
we can just look at file groups that got newer versions after C3 until now. in this case, just in C4.
So, while planning we can consider only file groups FG1, FG2, FG4 and FG5. reasoning is, those file groups which was never touched after last clean may not reach the max file version threshold. bcoz, they did not receive any updates only.

So, with C4_Clean, we will consider only FG1, FG2, FG4 and FG5 and eventually will clean up FG1_V2.

And update lastCompletedCommit = C4.

just after C5, we trigger cleaner again.
will consider only file groups updated after C4 (lastCompletedCommit from last clean). which is FG2, FG5.
and will result in cleaning up FG2_V1. since FG1, FG3 nor FG4 was touched, we don't need to even consider them while preparing the clean plan.

the current patch in its current state might miss out something. we can't rely on earlistRetainedCommit. For eg, just after C2, earlistCommitToRetain is being set to C2. but ideally we should consider lastCompletedCommit from the last time cleaner got triggered and do incremental polling for newer commits from there.

@nsivabalan
Copy link
Contributor

let me know if I am making sense. we can have a sync call too since this is dragging a bit. wanted to land in the next few days.

@nsivabalan nsivabalan changed the title Fix incremental cleaner use case [HUDI-4878] Fix incremental cleaner use case Sep 19, 2022
@nsivabalan
Copy link
Contributor

@parisni : hey hi. we have a code freeze coming up in a weeks time for 0.12.1. Just wanted to keep you informed.

@nsivabalan nsivabalan force-pushed the feat/hudi-clean-fix-incremental branch from 054e2a5 to 3c05d0a Compare September 21, 2022 23:08
@nsivabalan
Copy link
Contributor

@codope: Can you review this patch. I have overhauled the initial fix put up. But could result in good perf improv for cleaning. I am yet to write tests. but do take a look at my logic and let me know if it looks ok. or is there any case that I could be missing.

@nsivabalan nsivabalan force-pushed the feat/hudi-clean-fix-incremental branch 2 times, most recently from 058f2ae to 32826d6 Compare September 22, 2022 06:20
@parisni
Copy link
Contributor Author

parisni commented Sep 22, 2022

the current patch in its current state might miss out something. we can't rely on earlistRetainedCommit. For eg, just after C2, earlistCommitToRetain is being set to C2. but ideally we should consider lastCompletedCommit from the last time cleaner got triggered and do incremental polling for newer commits from there.

Agreed thanks

@nsivabalan nsivabalan added priority:blocker and removed priority:critical production down; pipelines stalled; Need help asap. labels Sep 22, 2022
@nsivabalan
Copy link
Contributor

@parisni : can you also review the patch. does this look ok. may be you can try it in your staging pipeline and let us know if things are working as expected (i.e. incremental cleaning kicks in for cleaning based on file versions)

Copy link
Contributor Author

@parisni parisni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALso I tested locally both policies KEEP_LATEST_COMMITS and KEEP_LATEST_FILE_VERSIONS and this looks good to me.
Thanks !

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to use lastCompletedCommitTimestamp in clean metadata! Left a few comments. Would be great if there is a unit test to run through the scenario discussed in previous comments.

@yihua yihua assigned parisni and unassigned nsivabalan and codope Sep 26, 2022
@yihua yihua removed the request for review from codope September 26, 2022 17:55
@nsivabalan nsivabalan force-pushed the feat/hudi-clean-fix-incremental branch from 32826d6 to 9dc742b Compare September 26, 2022 21:07
public void testKeepLatestFileVersions() throws Exception {
@ParameterizedTest
@ValueSource(Boolean.class)
public void testKeepLatestFileVersions(boolean ) throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing an argument here? did you intend to test with and without incremental mode enabled?

@codope codope force-pushed the feat/hudi-clean-fix-incremental branch from 9dc742b to 3f29d7a Compare September 27, 2022 11:47
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan added priority:critical production down; pipelines stalled; Need help asap. and removed priority:blocker labels Sep 27, 2022
@nsivabalan nsivabalan added priority:major degraded perf; unable to move forward; potential bugs and removed priority:critical production down; pipelines stalled; Need help asap. labels Oct 20, 2022
@nsivabalan nsivabalan added the pr:wip Work in Progress/PRs label Nov 2, 2022
@hussein-awala
Copy link
Member

I was working on the same idea before finding this PR. There is a case not taken into account in this PR.

Let's assume we have these configs:

  • the cleaning policy is KEEP_LATEST_FILE_VERSIONS
  • the incremental cleaning is activated
  • and the cleaner.fileversions.retained is X

if in the next clean we use a cleaner.fileversions.retained >= X, everything is fine, we just need to check the changed partitions since the last clean and clean them. But if in the next clean we change the cleaner.fileversions.retained to any value < X, we will not be able to use the incremental cleaning, and we will need to run the method getPartitionPathsForFullCleaning to recheck all the partitions.

So to decide if we can use the incremental cleaning or we need to run a brute force check, we need to save the cleaner.fileversions.retained value used in the last clean in the clean commit avro file.

@parisni parisni closed this Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr:wip Work in Progress/PRs priority:major degraded perf; unable to move forward; potential bugs table-service
Projects
Status: 🚧 Needs Repro
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

5 participants