-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-4878] Fix incremental cleaner use case #6498
Conversation
@parisni : hey I am bit confused on your example. Can you illustrate which file group each commit is updating. whether each commit is a new file group or updating the same file group. If you can clarify that, I will go over the issue again to see if the fix makes sense. |
each commit is a new file group
That's it. Hope this help
On September 1, 2022 6:48:09 PM UTC, Sivabalan Narayanan ***@***.***> wrote:
***@***.*** : hey I a, bit confused on your example. Can you illustrate which file group each commit is updating. whether each commit is a new file group or updating the same file group. If you can clarify that, I will go over the issue again to see if the fix makes sense.
…really appreciate you putting up the fix.
--
Reply to this email directly or view it on GitHub:
#6498 (comment)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
got it, thanks. Say we have 3 committed files in partition-A and we add a new commit in partition-B, and we trigger cleaning for the first time (full partition scan):
In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, the cleaner will remove the files created by commit-0 and keep 3 commits. ie. file1_V1.parquet will be cleaned up. But hudi also keeps track of For the next cleaning, incremental cleaning will trigger, and will comb through all commits >= this may not be applicable w/ KEEP_LATEST_FILE_VERSIONS. bcoz, we can't pin point a commit and say everything before that commit can be ignore for future cleaning. and thus incase of KEEP_LATEST_FILE_VERSIONS, we can't do incremental cleaning. It is in this policy, we might encounter a corner case where, a file group was updated only in Commit 1 and commit and never updated later. and after a long time, had a new version in say commit 100. we need to clean up the first version (assuming KEEP_LATEST_FILE_VERSIONS count is 2). Let me know if this makes sense. or if you still feel, I am missing something, can you elaborate w/ an example. |
For the next cleaning, incremental cleaning will trigger, and will comb through all commits >= earliest commit retained i.e. commit2. and so file2_v1 will be deleted this time. and will update the earliest commit retained to commit3 now.
I assume you meant file1_v2 ?
Let me read again the source that's not what I understood and also tested so far.
…On September 4, 2022 1:30:03 AM UTC, Sivabalan Narayanan ***@***.***> wrote:
got it, thanks.
Let me go through the example you have put up and clarify few things.
Say we have 3 committed files in partition-A and we add a new commit in partition-B, and we trigger cleaning for the first time (full partition scan):
```
partition-A/
commit-0 added file1_V1.parquet
commit-1. added file1_V2.parquet
commit-2 added file1_V3.parquet
partition-B/
commit-3 added file2_V1.parquet
```
In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, the cleaner will remove the files created by commit-0 and keep 3 commits. ie. file1_V1.parquet will be cleaned up. But hudi also keeps track of `earliest commit retained` in this case which is commit2. This `earliest commit retained` is the one we will leverage later to do incremental cleaning.
For the next cleaning, incremental cleaning will trigger, and will comb through all commits >= `earliest commit retained` i.e. commit2. and so file2_v1 will be deleted this time. and will update the `earliest commit retained` to commit3 now.
this may not be applicable w/ KEEP_LATEST_FILE_VERSIONS. bcoz, we can't pin point a commit and say everything before that commit can be ignore for future cleaning. and thus incase of KEEP_LATEST_FILE_VERSIONS, we can't do incremental cleaning. It is in this policy, we might encounter a corner case where, a file group was updated only in Commit 1 and commit and never updated later. and after a long time, had a new version in say commit 100. we need to clean up the first version (assuming KEEP_LATEST_FILE_VERSIONS count is 2).
Let me know if this makes sense. or if you still feel, I am missing something, can you elaborate w/ an example.
--
Reply to this email directly or view it on GitHub:
#6498 (comment)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
well, after read again in particular that method
https://github.com/apache/hudi/blob/ca8a57a21d163e573e3a617fd6173fa4b913666c/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L177
it does what exactly what the logger says :
```
LOG.info("Incremental Cleaning mode is enabled. Looking up
partition-paths that have since changed "
+ "since last cleaned at " +
cleanMetadata.getEarliestCommitToRetain()
+ ". New Instant to retain : " + newInstantToRetain);
```
But you were right : newInstantToRetain is `commit2`. In my mind it was
the cleaning Commit which woud have been `commit4`.
this may not be applicable w/ KEEP_LATEST_FILE_VERSIONS. bcoz, we
can't pin point a commit and say everything before that commit can
be ignore for future cleaning. and thus incase of
KEEP_LATEST_FILE_VERSIONS, we can't do incremental cleaning
if we pin the cleaning commit (`commit4`) then we can apply incremental
cleaning together with `KEEP_LATEST_FILE_VERSIONS`.
…On Mon, 2022-09-05 at 08:32 +0000, Nicolas Paris wrote:
> For the next cleaning, incremental cleaning will trigger, and will
> comb through all commits >= earliest commit retained i.e. commit2.
> and so file2_v1 will be deleted this time. and will update
> the earliest commit retained to commit3 now.
I assume you meant file1_v2 ?
Let me read again the source that's not what I understood and also
tested so far.
On September 4, 2022 1:30:03 AM UTC, Sivabalan Narayanan
***@***.***> wrote:
> got it, thanks.
> Let me go through the example you have put up and clarify few
> things.
>
>
> Say we have 3 committed files in partition-A and we add a new
> commit in partition-B, and we trigger cleaning for the first time
> (full partition scan):
>
> ```
> partition-A/
> commit-0 added file1_V1.parquet
> commit-1. added file1_V2.parquet
> commit-2 added file1_V3.parquet
> partition-B/
> commit-3 added file2_V1.parquet
> ```
>
> In the case say we have KEEP_LATEST_COMMITS with
> CLEANER_COMMITS_RETAINED=3, the cleaner will remove the files
> created by commit-0 and keep 3 commits. ie. file1_V1.parquet will
> be cleaned up. But hudi also keeps track of `earliest commit
> retained` in this case which is commit2. This `earliest commit
> retained` is the one we will leverage later to do incremental
> cleaning.
>
> For the next cleaning, incremental cleaning will trigger, and will
> comb through all commits >= `earliest commit retained` i.e.
> commit2. and so file2_v1 will be deleted this time. and will update
> the `earliest commit retained` to commit3 now.
>
> this may not be applicable w/ KEEP_LATEST_FILE_VERSIONS. bcoz, we
> can't pin point a commit and say everything before that commit can
> be ignore for future cleaning. and thus incase of
> KEEP_LATEST_FILE_VERSIONS, we can't do incremental cleaning. It is
> in this policy, we might encounter a corner case where, a file
> group was updated only in Commit 1 and commit and never updated
> later. and after a long time, had a new version in say commit 100.
> we need to clean up the first version (assuming
> KEEP_LATEST_FILE_VERSIONS count is 2).
>
> Let me know if this makes sense. or if you still feel, I am missing
> something, can you elaborate w/ an example.
>
>
>
>
> --
> Reply to this email directly or view it on GitHub:
> #6498 (comment)
> You are receiving this because you were mentioned.
>
> Message ID: ***@***.***>
|
@nsivabalan added a (git) commit, according to last comment |
here is my take: we need to maintain another variable called, let me walk through a scenario. Commit1 : Commit2: Commit3: FG1_V3, FG3_V2, FG4_V1 Commit4: FG1_V4, FG2_V2, FG4_V2, FG5_V1 Commit5: FG2_V3, FG5_V2 for first two commits, nothing will get cleaned up. and so "lastCompletedCommit" = null. just after C3, lets say we trigger cleaning. just after C4, we trigger cleaner again. So, with C4_Clean, we will consider only FG1, FG2, FG4 and FG5 and eventually will clean up FG1_V2. And update just after C5, we trigger cleaner again. the current patch in its current state might miss out something. we can't rely on earlistRetainedCommit. For eg, just after C2, earlistCommitToRetain is being set to C2. but ideally we should consider lastCompletedCommit from the last time cleaner got triggered and do incremental polling for newer commits from there. |
let me know if I am making sense. we can have a sync call too since this is dragging a bit. wanted to land in the next few days. |
@parisni : hey hi. we have a code freeze coming up in a weeks time for 0.12.1. Just wanted to keep you informed. |
054e2a5
to
3c05d0a
Compare
@codope: Can you review this patch. I have overhauled the initial fix put up. But could result in good perf improv for cleaning. I am yet to write tests. but do take a look at my logic and let me know if it looks ok. or is there any case that I could be missing. |
058f2ae
to
32826d6
Compare
Agreed thanks |
@parisni : can you also review the patch. does this look ok. may be you can try it in your staging pipeline and let us know if things are working as expected (i.e. incremental cleaning kicks in for cleaning based on file versions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ALso I tested locally both policies KEEP_LATEST_COMMITS
and KEEP_LATEST_FILE_VERSIONS
and this looks good to me.
Thanks !
...client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
Outdated
Show resolved
Hide resolved
...client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to use lastCompletedCommitTimestamp
in clean metadata! Left a few comments. Would be great if there is a unit test to run through the scenario discussed in previous comments.
...client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
Show resolved
Hide resolved
...client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
Outdated
Show resolved
Hide resolved
...client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
Outdated
Show resolved
Hide resolved
...client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
Show resolved
Hide resolved
32826d6
to
9dc742b
Compare
public void testKeepLatestFileVersions() throws Exception { | ||
@ParameterizedTest | ||
@ValueSource(Boolean.class) | ||
public void testKeepLatestFileVersions(boolean ) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing an argument here? did you intend to test with and without incremental mode enabled?
9dc742b
to
3f29d7a
Compare
I was working on the same idea before finding this PR. There is a case not taken into account in this PR. Let's assume we have these configs:
if in the next clean we use a So to decide if we can use the incremental cleaning or we need to run a brute force check, we need to save the |
Change Logs
Describe context and summary for this change. Highlight if any code was copied.
Impact
Currently incremental cleaning is run for both KEEP_LATEST_COMMITS, KEEP_LATEST_BY_HOURS
policies. It is not run when KEEP_LATEST_FILE_VERSIONS.
This can lead to not cleaning files. This PR fixes this problem by enabling incremental cleaning for KEEP_LATEST_FILE_VERSIONS only.
Here is the scenario of the problem:
Say we have 3 committed files in partition-A and we add a new commit in partition-B, and we trigger cleaning for the first time (full partition scan):
In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, the cleaner will remove the commit-0.parquet to keep 3 commits.
For the next cleaning, incremental cleaning will trigger, and won't consider partition-A/ until a new commit change it. In case no later commit changes partition-A then commit-1.parquet will stay forever. However it should be removed by the cleaner.
Now if in case of KEEP_LATEST_FILE_VERSIONS, the cleaner will only keep commit-2.parquet. Then it makes sense that incremental cleaning won't consider partition-A until it is changed. Because there is only one commit.
This is why incremental cleaning should only be enabled with KEEP_LATEST_FILE_VERSIONS
Hope this is clear enough
Risk level: none | low | medium | high
Choose one. If medium or high, explain what verification was done to mitigate the risks.
Contributor's checklist