Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd: compact: clean partial / marked blocks concurrently #3115

Merged

Conversation

GiedriusS
Copy link
Member

@GiedriusS GiedriusS commented Sep 2, 2020

Changes

Clean partially uploaded and blocks marked for deletion concurrently
with the whole compaction/downsampling process. One iteration could
potentially take a few days so it should be nice to periodically clean
unneeded blocks in the background. Without this, there are huge spikes
in block storage usage. The spike's size depends on how long it takes to
complete one iteration.

The implementation of this is simple - factored out the deletion part
into a separate function. It is called at the end of an iteration +
concurrently if --wait has been specified. Add a mutex to protect from
concurrent runs.

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Verification

Updated e2e tests.

Clean partially uploaded and blocks marked for deletion concurrently
with the whole compaction/downsampling process. One iteration could
potentially take a few days so it should be nice to periodically clean
unneeded blocks in the background. Without this, there are huge spikes
in block storage usage. The spike's size depends on how long it takes to
complete one iteration.

The implementation of this is simple - factored out the deletion part
into a separate function. It is called at the end of an iteration +
concurrently if `--wait` has been specified. Add a mutex to protect from
concurrent runs. Delete blocks from the deletion mark map so that we
wouldn't try to delete same blocks twice or more.

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
@GiedriusS GiedriusS force-pushed the clean_partial_marked_periodically branch 4 times, most recently from 53072a4 to 1fd6a5f Compare September 3, 2020 17:41
Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
@GiedriusS GiedriusS force-pushed the clean_partial_marked_periodically branch from 1fd6a5f to 09d60b5 Compare September 3, 2020 19:23
@GiedriusS GiedriusS marked this pull request as ready for review September 4, 2020 12:27
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks very good! Just one small nit, thanks!

// No need to resync before partial uploads and delete marked blocks. Last sync should be valid.
compact.BestEffortCleanAbortedPartialUploads(ctx, logger, sy.Partial(), bkt, partialUploadDeleteAttempts, blocksCleaned, blockCleanupFailures)
if err := blocksCleaner.DeleteMarkedBlocks(ctx); err != nil {
return errors.Wrap(err, "error cleaning marked blocks")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, is it better to be cleaning marked blocks? error seems redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's do this. Still valid comment 👍

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks!

Couple of comments but overall LGTM 💪

// No need to resync before partial uploads and delete marked blocks. Last sync should be valid.
compact.BestEffortCleanAbortedPartialUploads(ctx, logger, sy.Partial(), bkt, partialUploadDeleteAttempts, blocksCleaned, blockCleanupFailures)
if err := blocksCleaner.DeleteMarkedBlocks(ctx); err != nil {
return errors.Wrap(err, "error cleaning marked blocks")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's do this. Still valid comment 👍

cmd/thanos/compact.go Show resolved Hide resolved
cmd/thanos/compact.go Outdated Show resolved Hide resolved
// since one iteration potentially could take a long time.
if conf.cleanupBlocksInterval > 0 {
g.Add(func() error {
// Wait the whole period at the beginning because we've executed this on boot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... why not just removing this and removing boot time execution? (: Same stuff right?

Copy link
Member Author

@GiedriusS GiedriusS Oct 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: actually gave it a second thought. We need to explicitly run it at boot time to make sure that we don't have flaky tests because we depend there on a failure happening and a cleanup. It's not guaranteed to happen if we do everything concurrently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds wrong that we do more complex code only because we don't want to change tests 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests would be much more complex and probably out of the scope of this PR. Actually, it's not just that, I think it's nice that we do this at least once. Imagine where someone doesn't use --wait and the whole Thanos Compact process ended before the clean-up has happened. Space usage would never go down in the remote object storage even though it could. And the user then could be charged more as a result of this not happening.

cmd/thanos/compact.go Show resolved Hide resolved
Copy link
Member

@onprem onprem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM!

@bwplotka
Copy link
Member

Where we are with this? I want to cut 0.16.0-rc.0 tomorrow 🤗

…d_periodically

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Remove "error" from the `error` and just directly call the function.

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for now, but I think we could improve a bit in future (: But not a blocker, LGTM!

Thanks 👍

// since one iteration potentially could take a long time.
if conf.cleanupBlocksInterval > 0 {
g.Add(func() error {
// Wait the whole period at the beginning because we've executed this on boot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds wrong that we do more complex code only because we don't want to change tests 🤔

…d_periodically

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Forgot to remove this part while solving conflicts.

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
@GiedriusS
Copy link
Member Author

I guess since the approvals are there and I have cleaned up the CHANGELOG.md, I'll merge this. I also ran this for a bit locally with --wait-interval=1s --wait --compact.cleanup-interval=1s locally and it seems to work (tm). Will probably deploy this into production quite quickly because the saw-tooth pattern is becoming really spiky due to the fact that one iteration is taking a long time :( I hope that's ok with everyone! Thanks for the reviews ❤️

@GiedriusS GiedriusS merged commit 57076a5 into thanos-io:master Oct 22, 2020
Chans321 pushed a commit to Chans321/thanos that referenced this pull request Oct 23, 2020
)

* cmd: compact: clean partial / marked blocks concurrently

Clean partially uploaded and blocks marked for deletion concurrently
with the whole compaction/downsampling process. One iteration could
potentially take a few days so it should be nice to periodically clean
unneeded blocks in the background. Without this, there are huge spikes
in block storage usage. The spike's size depends on how long it takes to
complete one iteration.

The implementation of this is simple - factored out the deletion part
into a separate function. It is called at the end of an iteration +
concurrently if `--wait` has been specified. Add a mutex to protect from
concurrent runs. Delete blocks from the deletion mark map so that we
wouldn't try to delete same blocks twice or more.

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>

* *: update changelog, e2e tests

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>

* cmd: compact: fix according to comments

Remove "error" from the `error` and just directly call the function.

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>

* CHANGELOG: cleanups

Forgot to remove this part while solving conflicts.

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>

* CHANGELOG: update

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>

* CHANGELOG: clean whitespace

Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Signed-off-by: Chans321 <tsschand@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants