-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd: compact: clean partial / marked blocks concurrently #3115
cmd: compact: clean partial / marked blocks concurrently #3115
Conversation
Clean partially uploaded and blocks marked for deletion concurrently with the whole compaction/downsampling process. One iteration could potentially take a few days so it should be nice to periodically clean unneeded blocks in the background. Without this, there are huge spikes in block storage usage. The spike's size depends on how long it takes to complete one iteration. The implementation of this is simple - factored out the deletion part into a separate function. It is called at the end of an iteration + concurrently if `--wait` has been specified. Add a mutex to protect from concurrent runs. Delete blocks from the deletion mark map so that we wouldn't try to delete same blocks twice or more. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
53072a4
to
1fd6a5f
Compare
Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
1fd6a5f
to
09d60b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks very good! Just one small nit, thanks!
cmd/thanos/compact.go
Outdated
// No need to resync before partial uploads and delete marked blocks. Last sync should be valid. | ||
compact.BestEffortCleanAbortedPartialUploads(ctx, logger, sy.Partial(), bkt, partialUploadDeleteAttempts, blocksCleaned, blockCleanupFailures) | ||
if err := blocksCleaner.DeleteMarkedBlocks(ctx); err != nil { | ||
return errors.Wrap(err, "error cleaning marked blocks") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit, is it better to be cleaning marked blocks
? error
seems redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's do this. Still valid comment 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks!
Couple of comments but overall LGTM 💪
cmd/thanos/compact.go
Outdated
// No need to resync before partial uploads and delete marked blocks. Last sync should be valid. | ||
compact.BestEffortCleanAbortedPartialUploads(ctx, logger, sy.Partial(), bkt, partialUploadDeleteAttempts, blocksCleaned, blockCleanupFailures) | ||
if err := blocksCleaner.DeleteMarkedBlocks(ctx); err != nil { | ||
return errors.Wrap(err, "error cleaning marked blocks") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's do this. Still valid comment 👍
// since one iteration potentially could take a long time. | ||
if conf.cleanupBlocksInterval > 0 { | ||
g.Add(func() error { | ||
// Wait the whole period at the beginning because we've executed this on boot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... why not just removing this and removing boot time execution? (: Same stuff right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: actually gave it a second thought. We need to explicitly run it at boot time to make sure that we don't have flaky tests because we depend there on a failure happening and a cleanup. It's not guaranteed to happen if we do everything concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds wrong that we do more complex code only because we don't want to change tests 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests would be much more complex and probably out of the scope of this PR. Actually, it's not just that, I think it's nice that we do this at least once. Imagine where someone doesn't use --wait
and the whole Thanos Compact process ended before the clean-up has happened. Space usage would never go down in the remote object storage even though it could. And the user then could be charged more as a result of this not happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, LGTM!
Where we are with this? I want to cut 0.16.0-rc.0 tomorrow 🤗 |
…d_periodically Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Remove "error" from the `error` and just directly call the function. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for now, but I think we could improve a bit in future (: But not a blocker, LGTM!
Thanks 👍
// since one iteration potentially could take a long time. | ||
if conf.cleanupBlocksInterval > 0 { | ||
g.Add(func() error { | ||
// Wait the whole period at the beginning because we've executed this on boot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds wrong that we do more complex code only because we don't want to change tests 🤔
…d_periodically Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Forgot to remove this part while solving conflicts. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
I guess since the approvals are there and I have cleaned up the CHANGELOG.md, I'll merge this. I also ran this for a bit locally with |
) * cmd: compact: clean partial / marked blocks concurrently Clean partially uploaded and blocks marked for deletion concurrently with the whole compaction/downsampling process. One iteration could potentially take a few days so it should be nice to periodically clean unneeded blocks in the background. Without this, there are huge spikes in block storage usage. The spike's size depends on how long it takes to complete one iteration. The implementation of this is simple - factored out the deletion part into a separate function. It is called at the end of an iteration + concurrently if `--wait` has been specified. Add a mutex to protect from concurrent runs. Delete blocks from the deletion mark map so that we wouldn't try to delete same blocks twice or more. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * *: update changelog, e2e tests Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * cmd: compact: fix according to comments Remove "error" from the `error` and just directly call the function. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * CHANGELOG: cleanups Forgot to remove this part while solving conflicts. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * CHANGELOG: update Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * CHANGELOG: clean whitespace Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> Signed-off-by: Chans321 <tsschand@gmail.com>
Changes
Clean partially uploaded and blocks marked for deletion concurrently
with the whole compaction/downsampling process. One iteration could
potentially take a few days so it should be nice to periodically clean
unneeded blocks in the background. Without this, there are huge spikes
in block storage usage. The spike's size depends on how long it takes to
complete one iteration.
The implementation of this is simple - factored out the deletion part
into a separate function. It is called at the end of an iteration +
concurrently if
--wait
has been specified. Add a mutex to protect fromconcurrent runs.
Verification
Updated e2e tests.