Skip to content

Commit

Permalink
Merge branch 'main' into feature/remove-old-ui
Browse files Browse the repository at this point in the history
  • Loading branch information
Nexucis committed Feb 23, 2022
2 parents 852eac6 + 06c1a6a commit 82925cf
Show file tree
Hide file tree
Showing 27 changed files with 700 additions and 178 deletions.
10 changes: 5 additions & 5 deletions .busybox-versions
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Auto generated by busybox-updater.sh. DO NOT EDIT
amd64=768a51a5f71827471e6e58f0d6200c2fa24f2cb5cde1ecbd67fe28f93d4ef464
arm64=042d6195e1793b226d1632117cccb4c4906c8ab393b8b68328ad43cf59c64f9d
arm=239809417d1e79388ae1bdb59c167d86f18ebaad37dafb5a93d241fe3c79b0df
ppc64le=f30732299f06265688d63a454723a0d718c7509f51b0dacb9bf7f58388bb32b2
s390x=97babce614354ac9a263fa7c8e48a5b062318a9ae77f6c31179bf6fb2200106f
amd64=42977f138f0655240a4bd4aed4fe1731cff3bc57077ff695ea7cd4653fc1c6e6
arm64=2f0470d84715de55c3446dd074e954b8d84f887c16fd0bb2de54b3734ba5ae83
arm=5fb75cf689dcccfc5198aa4cbd7ecf04bc7e44e74220154b4f0f75a7c855318f
ppc64le=a9a9102107c48b12f1e31e722a26d6ad985111b9840d0f72b92e1cce815b83f7
s390x=9f6a7897398d997568a69d3c5badc9cdc75c71cd0aedc497571e5c6e9635e7db
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re

## Unreleased

### Fixed

### Changed

### Removed

## [v0.25.0 - <in progress>](https://github.com/thanos-io/thanos/tree/release-0.25)

### Added

- [#5110](https://github.com/thanos-io/thanos/pull/5110) Block: Do not upload DebugMeta files to obj store.
Expand All @@ -28,6 +36,7 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re
- [#4667](https://github.com/thanos-io/thanos/pull/4667) Add a pure aws-sdk auth for s3 storage.
- [#5111](https://github.com/thanos-io/thanos/pull/5111) Add matcher support to Query Rules endpoint.
- [#5117](https://github.com/thanos-io/thanos/pull/5117) Bucket replicate: Added flag `--ignore-marked-for-deletion` to avoid replication of blocks with the deletion mark.
- [#5148](https://github.com/thanos-io/thanos/pull/5148) Receive: Add tenant tag for tracing spans.

## Changed
- [#5145](https://github.com/thanos-io/thanos/pull/5145) UI: Remove old UI.
Expand All @@ -36,6 +45,7 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re

### Fixed

- [#5102](https://github.com/thanos-io/thanos/pull/5102) Fix: Filter block rows in bucket UI according to searched block ID
- [#5051](https://github.com/thanos-io/thanos/pull/5051) Prober: Remove spam of changing probe status.
- [#4918](https://github.com/thanos-io/thanos/pull/4918) Tracing: Fixing force tracing with Jaeger.
- [#4879](https://github.com/thanos-io/thanos/pull/4879) Bucket verify: Fixed bug causing wrong number of blocks to be checked.
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.25.0-dev
0.26.0-dev
Binary file added docs/img/compaction_progress_metrics.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
55 changes: 55 additions & 0 deletions docs/operating/compactor-backlog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Troubleshoot Compactor Backlog

The compactor is one of the most important components in Thanos. It is responsible for doing compaction, downsampling, and retention of the data in the object storage.

When your system contains a lot of block producers (Sidecar, Rule, Receiver, etc) or the scale is large, the compactor might not be able to keep up with the data producing rate and it falls behind, which causes a lot of backlogged work. This document will help you to troubleshoot the backlog compaction issue and how to scale the compactor.

## Detect the backlog

Self-monitoring for the monitoring system is important. We highly recommend you set up the Thanos Grafana dashboards and alerts to monitor the Thanos components. Without self-monitoring, it is hard to detect the issue and fix the problems.

If you find these issues in your own Thanos deployment, then your compactor might be backlogged and needs to scale more:
1. The long-term query in your Grafana dashboard is much slower than usual.
2. The compactor stops downsampling blocks and `thanos_compact_downsample_total` metric doesn't increase.
3. The compactor stops retention. If the retention period is set to 1 month but you can still see older blocks from the bucket UI then that's the case.
4. `thanos_compact_iterations_total` metric doesn't increase or the rate is very low.

In the current implementation, the compactor will perform compaction, downsampling, and retention phases in order, which means that if the compaction work is not finished, the downsampling and retention phase won't start. So that's why you will find symptom 2 and 3 mentioned above.

For symptom 4, `thanos_compact_iterations_total` metric doesn't increase means that the Thanos compactor is currently working on the current compaction iteration and cannot finish it in a long time. It is very similar to the case of a message queue. The producers are the components that upload blocks to your object storage. And the compactor is the consumer to deal with the jobs from producers. If the data production rate is higher than the processing rate of the consumer, then the compactor will fall behind.

### Compactor progress metrics

Since the Thanos v0.24 release, four new metrics `thanos_compact_todo_compactions`, `thanos_compact_todo_compaction_blocks`, `thanos_compact_todo_downsample_blocks` and `thanos_compact_todo_deletion_blocks` are added to show the compaction, downsampling and retention progress and backlog.

`thanos_compact_todo_compactions`: The number of planned compactions. `thanos_compact_todo_compaction_blocks`: The number of blocks that are planned to be compacted. `thanos_compact_todo_downsample_blocks`: The number of blocks that are queued to be downsampled. `thanos_compact_todo_deletion_blocks`: The number of blocks that are queued for retention.

To use these metrics, for example you can use `sum(thanos_compact_todo_compactions)` to get the overall compaction backlog or use `sum(thanos_compact_todo_compactions) by (group)` to get which compaction group is the slowest one.

![compaction-progress](../img/compaction_progress_metrics.png)

This feature works by syncing block metadata from the object storage every 5 minutes by default and then simulating the compaction planning process to calculate how much work needs to be done. You can change the default 5m interval by setting `compact.progress-interval` flag or disable it by setting `compact.progress-interval=0`.

## Solutions

### Scale the compactor

To prevent the compactor from falling behind, you can scale the compactor.

To scale the compactor vertically, you can give it more CPU and memory resources. It also has two flags `compact.concurrency` and `downsample.concurrency` to control the number of workers to do compaction and downsampling.

To scale the compactor horizontally, you can run multiple compactor instances with different `min-time` and `max-time` flags so that each compactor will only work on the blocks that are within the time range.

Another way of horizontal scaling is to run multiple compactors by sharding. You can use the relabel config to shard the compactors by external labels like `cluster` so that blocks with the same external label will be processed by the same compactor.

### Bucket tools

There are some bucket tools that can help you to troubleshoot and solve this issue.

You can use `bucket ls`, `bucket inspect` and `bucket web` UI to view your current blocks.

If a lot of blocks are older than the retention period and the compactor is not performing retention, then to clean them up, you can do the following:
1. Stop the current running compactor instances to avoid data race
2. Run `thanos tools bucket retention` command to perform retention on your blocks directly. This step won't delete your blocks right away. It just adds deletion markers to the blocks for retention.
3. You can restart the compactor instances now and they will clean up the blocks with deletion markers after some period (by default 2d).
4. You can also choose to not restart the compactor instances and just run `thanos tools bucket cleanup --delete-delay=0s` to delete those marked blocks right away.
4 changes: 2 additions & 2 deletions examples/alerts/alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,8 +196,8 @@ rules:
severity: warning
- alert: ThanosRuleNoEvaluationFor10Intervals
annotations:
description: Thanos Rule {{$labels.job}} has {{$value | humanize}}% rule groups
that did not evaluate for at least 10x of their expected interval.
description: Thanos Rule {{$labels.job}} has rule groups that did not evaluate
for at least 10x of their expected interval.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulenoevaluationfor10intervals
summary: Thanos Rule has rule groups that did not evaluate for 10 intervals.
expr: |
Expand Down
4 changes: 2 additions & 2 deletions examples/alerts/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -511,8 +511,8 @@ groups:
severity: warning
- alert: ThanosRuleNoEvaluationFor10Intervals
annotations:
description: Thanos Rule {{$labels.job}} has {{$value | humanize}}% rule groups
that did not evaluate for at least 10x of their expected interval.
description: Thanos Rule {{$labels.job}} has rule groups that did not evaluate
for at least 10x of their expected interval.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulenoevaluationfor10intervals
summary: Thanos Rule has rule groups that did not evaluate for 10 intervals.
expr: |
Expand Down
2 changes: 1 addition & 1 deletion mixin/alerts/rule.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@
// NOTE: This alert will give false positive if no rules are configured.
alert: 'ThanosRuleNoEvaluationFor10Intervals',
annotations: {
description: 'Thanos Rule {{$labels.job}}%s has {{$value | humanize}}%% rule groups that did not evaluate for at least 10x of their expected interval.' % location,
description: 'Thanos Rule {{$labels.job}}%s has rule groups that did not evaluate for at least 10x of their expected interval.' % location,
summary: 'Thanos Rule has rule groups that did not evaluate for 10 intervals.',
},
expr: |||
Expand Down
2 changes: 1 addition & 1 deletion mixin/runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@
|ThanosRuleConfigReloadFailure|Thanos Rule has not been able to reload configuration.|Thanos Rule {{$labels.job}} has not been able to reload its configuration.|info|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosruleconfigreloadfailure](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosruleconfigreloadfailure)|
|ThanosRuleQueryHighDNSFailures|Thanos Rule is having high number of DNS failures.|Thanos Rule {{$labels.job}} has {{$value humanize}}% of failing DNS queries for query endpoints.|warning|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulequeryhighdnsfailures](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulequeryhighdnsfailures)|
|ThanosRuleAlertmanagerHighDNSFailures|Thanos Rule is having high number of DNS failures.|Thanos Rule {{$labels.instance}} has {{$value humanize}}% of failing DNS queries for Alertmanager endpoints.|warning|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulealertmanagerhighdnsfailures](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulealertmanagerhighdnsfailures)|
|ThanosRuleNoEvaluationFor10Intervals|Thanos Rule has rule groups that did not evaluate for 10 intervals.|Thanos Rule {{$labels.job}} has {{$value humanize}}% rule groups that did not evaluate for at least 10x of their expected interval.|info|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulenoevaluationfor10intervals](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulenoevaluationfor10intervals)|
|ThanosRuleNoEvaluationFor10Intervals|Thanos Rule has rule groups that did not evaluate for 10 intervals.|Thanos Rule {{$labels.job}} has rule groups that did not evaluate for at least 10x of their expected interval.|info|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulenoevaluationfor10intervals](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosrulenoevaluationfor10intervals)|
|ThanosNoRuleEvaluations|Thanos Rule did not perform any rule evaluations.|Thanos Rule {{$labels.instance}} did not perform any rule evaluations in the past 10 minutes.|critical|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosnoruleevaluations](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosnoruleevaluations)|

## thanos-sidecar
Expand Down
37 changes: 27 additions & 10 deletions pkg/replicate/replicator.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import (
"github.com/thanos-io/thanos/pkg/component"
"github.com/thanos-io/thanos/pkg/extprom"
thanosmodel "github.com/thanos-io/thanos/pkg/model"
"github.com/thanos-io/thanos/pkg/objstore"
"github.com/thanos-io/thanos/pkg/objstore/client"
"github.com/thanos-io/thanos/pkg/prober"
"github.com/thanos-io/thanos/pkg/runutil"
Expand Down Expand Up @@ -167,15 +168,7 @@ func RunReplicate(
}, []string{"result"})
replicationRunDuration.WithLabelValues(labelSuccess)
replicationRunDuration.WithLabelValues(labelError)

fetcher, err := thanosblock.NewMetaFetcher(
logger,
32,
fromBkt,
"",
reg,
[]thanosblock.MetadataFilter{thanosblock.NewTimePartitionMetaFilter(*minTime, *maxTime)},
)
fetcher, err := newMetaFetcher(logger, fromBkt, reg, *minTime, *maxTime, 32, ignoreMarkedForDeletion)
if err != nil {
return errors.Wrapf(err, "create meta fetcher with bucket %v", fromBkt)
}
Expand All @@ -186,7 +179,6 @@ func RunReplicate(
resolutions,
compactions,
blockIDs,
ignoreMarkedForDeletion,
).Filter
metrics := newReplicationMetrics(reg)
ctx, cancel := context.WithCancel(context.Background())
Expand Down Expand Up @@ -243,3 +235,28 @@ func RunReplicate(

return nil
}

func newMetaFetcher(
logger log.Logger,
fromBkt objstore.InstrumentedBucket,
reg prometheus.Registerer,
minTime,
maxTime thanosmodel.TimeOrDurationValue,
concurrency int,
ignoreMarkedForDeletion bool,
) (*thanosblock.MetaFetcher, error) {
filters := []thanosblock.MetadataFilter{
thanosblock.NewTimePartitionMetaFilter(minTime, maxTime),
}
if ignoreMarkedForDeletion {
filters = append(filters, thanosblock.NewIgnoreDeletionMarkFilter(logger, fromBkt, 0, concurrency))
}
return thanosblock.NewMetaFetcher(
logger,
concurrency,
fromBkt,
"",
reg,
filters,
)
}
36 changes: 13 additions & 23 deletions pkg/replicate/scheme.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,11 @@ import (

// BlockFilter is block filter that filters out compacted and unselected blocks.
type BlockFilter struct {
logger log.Logger
labelSelector labels.Selector
resolutionLevels map[compact.ResolutionLevel]struct{}
compactionLevels map[int]struct{}
blockIDs []ulid.ULID
ignoreMarkedForDeletion bool
logger log.Logger
labelSelector labels.Selector
resolutionLevels map[compact.ResolutionLevel]struct{}
compactionLevels map[int]struct{}
blockIDs []ulid.ULID
}

// NewBlockFilter returns block filter.
Expand All @@ -44,7 +43,6 @@ func NewBlockFilter(
resolutionLevels []compact.ResolutionLevel,
compactionLevels []int,
blockIDs []ulid.ULID,
ignoreMarkedForDeletion bool,
) *BlockFilter {
allowedResolutions := make(map[compact.ResolutionLevel]struct{})
for _, resolutionLevel := range resolutionLevels {
Expand All @@ -56,20 +54,16 @@ func NewBlockFilter(
}

return &BlockFilter{
labelSelector: labelSelector,
logger: logger,
resolutionLevels: allowedResolutions,
compactionLevels: allowedCompactions,
blockIDs: blockIDs,
ignoreMarkedForDeletion: ignoreMarkedForDeletion,
labelSelector: labelSelector,
logger: logger,
resolutionLevels: allowedResolutions,
compactionLevels: allowedCompactions,
blockIDs: blockIDs,
}
}

// Filter return true if block is non-compacted and matches selector.
func (bf *BlockFilter) Filter(b *metadata.Meta, markedForDeletion bool) bool {
if bf.ignoreMarkedForDeletion && markedForDeletion {
return false
}
func (bf *BlockFilter) Filter(b *metadata.Meta) bool {
if len(b.Thanos.Labels) == 0 {
level.Error(bf.logger).Log("msg", "filtering block", "reason", "labels should not be empty")
return false
Expand Down Expand Up @@ -121,7 +115,7 @@ func (bf *BlockFilter) Filter(b *metadata.Meta, markedForDeletion bool) bool {
return true
}

type blockFilterFunc func(b *metadata.Meta, markedForDeletion bool) bool
type blockFilterFunc func(b *metadata.Meta) bool

// TODO: Add filters field.
type replicationScheme struct {
Expand Down Expand Up @@ -198,11 +192,7 @@ func (rs *replicationScheme) execute(ctx context.Context) error {
}

for id, meta := range metas {
_, err := rs.fromBkt.ReaderWithExpectedErrs(rs.fromBkt.IsObjNotFoundErr).Get(ctx, path.Join(meta.ULID.String(), metadata.DeletionMarkFilename))
if err != nil && !rs.fromBkt.IsObjNotFoundErr(err) {
return errors.Wrapf(err, "failed to read deletion mark from origin bucket block %s", meta.ULID.String())
}
if rs.blockFilter(meta, !rs.fromBkt.IsObjNotFoundErr(err)) {
if rs.blockFilter(meta) {
level.Info(rs.logger).Log("msg", "adding block to be replicated", "block_uuid", id.String())
availableBlocks = append(availableBlocks, meta)
}
Expand Down
20 changes: 17 additions & 3 deletions pkg/replicate/scheme_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,20 @@ import (
"github.com/prometheus/prometheus/model/labels"
"github.com/prometheus/prometheus/tsdb"

"github.com/thanos-io/thanos/pkg/block"
"github.com/thanos-io/thanos/pkg/block/metadata"
"github.com/thanos-io/thanos/pkg/compact"
"github.com/thanos-io/thanos/pkg/model"
"github.com/thanos-io/thanos/pkg/objstore"
"github.com/thanos-io/thanos/pkg/testutil"
)

var (
minTime = time.Unix(0, 0)
maxTime, _ = time.Parse(time.RFC3339, "9999-12-31T23:59:59Z")
minTimeDuration = model.TimeOrDurationValue{Time: &minTime}
maxTimeDuration = model.TimeOrDurationValue{Time: &maxTime}
)

func testLogger(testName string) log.Logger {
return log.With(
level.NewFilter(log.NewLogfmtLogger(log.NewSyncWriter(os.Stderr)), level.AllowDebug()),
Expand Down Expand Up @@ -374,8 +381,15 @@ func TestReplicationSchemeAll(t *testing.T) {
selector = c.selector
}

filter := NewBlockFilter(logger, selector, []compact.ResolutionLevel{compact.ResolutionLevelRaw}, []int{1}, c.blockIDs, c.ignoreMarkedForDeletion).Filter
fetcher, err := block.NewMetaFetcher(logger, 32, objstore.WithNoopInstr(originBucket), "", nil, nil)
filter := NewBlockFilter(logger, selector, []compact.ResolutionLevel{compact.ResolutionLevelRaw}, []int{1}, c.blockIDs).Filter
fetcher, err := newMetaFetcher(
logger, objstore.WithNoopInstr(originBucket),
nil,
minTimeDuration,
maxTimeDuration,
32,
c.ignoreMarkedForDeletion,
)
testutil.Ok(t, err)

r := newReplicationScheme(logger, newReplicationMetrics(nil), filter, fetcher, objstore.WithNoopInstr(originBucket), targetBucket, nil)
Expand Down
9 changes: 5 additions & 4 deletions pkg/shipper/shipper.go
Original file line number Diff line number Diff line change
Expand Up @@ -470,14 +470,15 @@ func WriteMetaFile(logger log.Logger, dir string, meta *Meta) error {

// ReadMetaFile reads the given meta from <dir>/thanos.shipper.json.
func ReadMetaFile(dir string) (*Meta, error) {
b, err := ioutil.ReadFile(filepath.Join(dir, filepath.Clean(MetaFilename)))
fpath := filepath.Join(dir, filepath.Clean(MetaFilename))
b, err := os.ReadFile(fpath)
if err != nil {
return nil, err
return nil, errors.Wrapf(err, "failed to read %s", fpath)
}
var m Meta

var m Meta
if err := json.Unmarshal(b, &m); err != nil {
return nil, err
return nil, errors.Wrapf(err, "failed to parse %s as JSON: %q", fpath, string(b))
}
if m.Version != MetaVersion1 {
return nil, errors.Errorf("unexpected meta file version %d", m.Version)
Expand Down
Loading

0 comments on commit 82925cf

Please sign in to comment.