Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow setting TSDB block duration for receive service #1496

Merged
merged 3 commits into from
Sep 10, 2019

Conversation

fhalim
Copy link
Contributor

@fhalim fhalim commented Sep 5, 2019

Allow setting frequency at which TSDB blocks are created for the receive service.

Changes

Extracted tsdb minBlockDuration and maxBlockDuration used in receive service into flag that defaults to the current value of 2h.

Verification

Built and ran service locally to ensure that the flag takes effect as intended.

@fhalim fhalim changed the title Allow setting TSDB block duration Allow setting TSDB block duration for receive service Sep 5, 2019
Signed-off-by: Fawad Halim <fawad@fawad.net>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious what's the use case?

This is always tricky as it can have negative effects e.g with upload, query and compact characteristics. Plus usually there is no need to change this. Any particular reason?

If there is use case I am fine with adding this but as Hidden flag maybe (:

PTAL @squat @metalmatze

@@ -60,6 +60,8 @@ func registerReceive(m map[string]setupFunc, app *kingpin.Application, name stri

replicationFactor := cmd.Flag("receive.replication-factor", "How many times to replicate incoming write requests.").Default("1").Uint64()

tsdbBlockDuration := modelDuration(cmd.Flag("tsdb.blockduration", "Duration for local TSDB blocks").Default("2h"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe make this hidden as it's not recommended to tweak this? Plus we keep more verbose naming (- between words)

Suggested change
tsdbBlockDuration := modelDuration(cmd.Flag("tsdb.blockduration", "Duration for local TSDB blocks").Default("2h"))
tsdbBlockDuration := modelDuration(cmd.Flag("tsdb.block-duration", "Duration for local TSDB blocks").Hidden().Default("2h"))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case is that we generate an really high volume of metrics during a business day, so 2 hours of WAL add up to a pretty huge amount of disk space. That translates to expensive persistant volumes, backups etc. For the receive service, I'm planning to use it as a gateway to the object store, so it's desirable to reduce the duration enough that the average WAL backlog is small enough that we can restore it in a reasonable amount of time on failure.

I've marked the flag as hidden, and corrected the naming. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tsdb supports wal compression so maybe having it enabled would solve your use case without changing the block sizes?
In my tests this halves the size of the wal.

https://github.com/prometheus/prometheus/blob/937cc1a52af503d6be93119c6579228e43a60ec0/tsdb/db.go#L85-L86

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Thanos receive should always unconditionally have compression enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAL compression will definitely help reduce the disk size, reducing the amount of time restores will take. Updating the However, I'd still like to be able to control the maximum amount of time I depend on the block storage device for.

It seems that in Prometheus the max block duration defaults to 10% of the retention period which does not have a lower bound to it (https://github.com/prometheus/prometheus/blob/master/cmd/prometheus/main.go#L317). Is there a reason why that should not be allowed for the receive component?

Signed-off-by: Fawad Halim <fawad@fawad.net>
Signed-off-by: Fawad Halim <fawad@fawad.net>
Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm ok with this as a hidden flag.

@brancz brancz merged commit 507195a into thanos-io:master Sep 10, 2019
wbh1 pushed a commit to wbh1/thanos that referenced this pull request Sep 17, 2019
* Allow setting TSDB block duration

Signed-off-by: Fawad Halim <fawad@fawad.net>

* PR feedback: corrected flag naming, made the flag hidden by default

Signed-off-by: Fawad Halim <fawad@fawad.net>

* Enable WALCompression on Receive service

Signed-off-by: Fawad Halim <fawad@fawad.net>
@fhalim fhalim deleted the feature/receive_tsdb_block branch November 15, 2019 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants