Skip to content

Commit

Permalink
Merge branch 'master' into feature/docs-hugo
Browse files Browse the repository at this point in the history
# Conflicts:
#	docs/getting_started.md
#	docs/proposals/approved/201901-read-write-operations-bucket.md
  • Loading branch information
bwplotka committed Mar 30, 2019
2 parents dae29eb + 7a83cda commit 140d864
Show file tree
Hide file tree
Showing 57 changed files with 1,137 additions and 796 deletions.
Empty file added .dep-finished
Empty file.
10 changes: 5 additions & 5 deletions .promu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ build:
path: ./cmd/thanos
flags: -a -tags netgo
ldflags: |
-X {{repoPath}}/vendor/github.com/prometheus/common/version.Version={{.Version}}
-X {{repoPath}}/vendor/github.com/prometheus/common/version.Revision={{.Revision}}
-X {{repoPath}}/vendor/github.com/prometheus/common/version.Branch={{.Branch}}
-X {{repoPath}}/vendor/github.com/prometheus/common/version.BuildUser={{user}}@{{host}}
-X {{repoPath}}/vendor/github.com/prometheus/common/version.BuildDate={{date "20060102-15:04:05"}}
-X github.com/prometheus/common/version.Version={{.Version}}
-X github.com/prometheus/common/version.Revision={{.Revision}}
-X github.com/prometheus/common/version.Branch={{.Branch}}
-X github.com/prometheus/common/version.BuildUser={{user}}@{{host}}
-X github.com/prometheus/common/version.BuildDate={{date "20060102-15:04:05"}}
crossbuild:
platforms:
- linux/amd64
Expand Down
30 changes: 24 additions & 6 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,35 @@ We use *breaking* word for marking changes that are not backward compatible (rel

### Added
- [#811](https://github.com/improbable-eng/thanos/pull/811) Remote write receiver
- [#798](https://github.com/improbable-eng/thanos/pull/798) Ability to limit the maximum concurrent about of Series() calls in Thanos Store and the maximum amount of samples.

New options:

* `--store.grpc.series-sample-limit` limits the amount of samples that might be retrieved on a single Series() call. By default it is 0. Consider enabling it by setting it to more than 0 if you are running on limited resources.
* `--store.grpc.series-max-concurrency` limits the number of concurrent Series() calls in Thanos Store. By default it is 20. Considering making it lower or bigger depending on the scale of your deployment.

New metrics:
* `thanos_bucket_store_queries_dropped_total` shows how many queries were dropped due to the samples limit;
* `thanos_bucket_store_queries_concurrent_max` is a constant metric which shows how many Series() calls can concurrently be executed by Thanos Store;
* `thanos_bucket_store_queries_in_flight` shows how many queries are currently "in flight" i.e. they are being executed;
* `thanos_bucket_store_gate_duration_seconds` shows how many seconds it took for queries to pass through the gate in both cases - when that fails and when it does not.

New tracing span:
* `store_query_gate_ismyturn` shows how long it took for a query to pass (or not) through the gate.

:warning: **WARNING** :warning: #798 adds a new default limit to Thanos Store: `--store.grpc.series-max-concurrency`. Most likely you will want to make it the same as `--query.max-concurrent` on Thanos Query.

### Fixed
- [#921](https://github.com/improbable-eng/thanos/pull/921) `thanos_objstore_bucket_last_successful_upload_time` now does not appear when no blocks have been uploaded so far
- [#966](https://github.com/improbable-eng/thanos/pull/966) Bucket: verify no longer warns about overlapping blocks, that overlap `0s`

## [v0.3.2](https://github.com/improbable-eng/thanos/releases/tag/v0.3.2) - 2019.03.04

### Added
- [#851](https://github.com/improbable-eng/thanos/pull/851) New read API endpoint for api/v1/rules and api/v1/alerts.
- [#873](https://github.com/improbable-eng/thanos/pull/873) Store: fix set index cache LRU

:warning: **WARING** :warning: #873 fix fixes actual handling of `index-cache-size`. Handling of limit for this cache was
:warning: **WARNING** :warning: #873 fix fixes actual handling of `index-cache-size`. Handling of limit for this cache was
broken so it was unbounded all the time. From this release actual value matters and is extremely low by default. To "revert"
the old behaviour (no boundary), use a large enough value.

Expand Down Expand Up @@ -55,14 +73,14 @@ the old behaviour (no boundary), use a large enough value.
- [#649](https://github.com/improbable-eng/thanos/issues/649) - Fixed store label values api to add also external label values.
- [#396](https://github.com/improbable-eng/thanos/issues/396) - Fixed sidecar logic for proxying series that has more than 2^16 samples from Prometheus.
- [#732](https://github.com/improbable-eng/thanos/pull/732) - Fixed S3 authentication sequence. You can see new sequence enumerated [here](https://github.com/improbable-eng/thanos/blob/master/docs/storage.md#aws-s3-configuration)
- [#745](https://github.com/improbable-eng/thanos/pull/745) - Fixed race conditions and edge cases for Thanos Querier fanout logic.
- [#745](https://github.com/improbable-eng/thanos/pull/745) - Fixed race conditions and edge cases for Thanos Querier fanout logic.
- [#651](https://github.com/improbable-eng/thanos/issues/651) - Fixed index cache when asked buffer size is bigger than cache max size.

### Changed

- [#529](https://github.com/improbable-eng/thanos/pull/529) Massive improvement for compactor. Downsampling memory consumption was reduce to only store labels and single chunks per each series.
- Qurerier UI: Store page now shows the store APIs per component type.
- Prometheus and TSDB deps are now up to date with ~2.7.0 Prometheus version. Lot's of things has changed. See details [here #704](https://github.com/improbable-eng/thanos/pull/704) Known changes that affects us:
- Prometheus and TSDB deps are now up to date with ~2.7.0 Prometheus version. Lot's of things has changed. See details [here #704](https://github.com/improbable-eng/thanos/pull/704) Known changes that affects us:
- prometheus/prometheus/discovery/file
- [ENHANCEMENT] Discovery: Improve performance of previously slow updates of changes of targets. #4526
- [BUGFIX] Wait for service discovery to stop before exiting #4508 ??
Expand All @@ -83,11 +101,11 @@ the old behaviour (no boundary), use a large enough value.
- S3 provider:
- Added `put_user_metadata` option to config.
- Added `insecure_skip_verify` option to config.

### Deprecated

- Tests against Prometheus below v2.2.1. This does not mean *lack* of support for those. Only that we don't tests the compatibility anymore. See [#758](https://github.com/improbable-eng/thanos/issues/758) for details.

## [v0.2.1](https://github.com/improbable-eng/thanos/releases/tag/v0.2.1) - 2018.12.27

### Added
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ The philosophy of Thanos and our community is borrowing much from UNIX philosoph
* Write components that work together
* e.g. blocks should be stored in native prometheus format
* Make it easy to read, write, and, run components
* e.g. reduce complexity in system design and implementation
* e.g. reduce complexity in system design and implementation

## Adding New Features / Components

Expand Down
15 changes: 13 additions & 2 deletions MAINTAINERS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,19 @@
|-----------------------|------------------------|--------------------------|------------------------------------------------------------|
| Bartłomiej Płotka | bwplotka@gmail.com | `@bwplotka` | [@bwplotka](https://github.com/bwplotka) |
| Dominic Green | dom@improbable.io | `@domgreen` | [@domgreen](https://github.com/domgreen) |
| Frederic Branczyk | fbranczyk@gmail.com | `@brancz` | [@brancz](https://github.com/brancz) |
| Giedrius Statkevičius | giedriuswork@gmail.com | `@Giedrius Statkevičius` | [@GiedriusS](https://github.com/GiedriusS) |
| Frederic Branczyk | fbranczyk@gmail.com | `@brancz` | [@brancz](https://github.com/brancz) |
| Giedrius Statkevičius | giedriuswork@gmail.com | `@Giedrius Statkevičius` | [@GiedriusS](https://github.com/GiedriusS) |
| Povilas Versockas | p.versockas@gmail.com | `@povilasv` | [@povilasv](https://github.com/povilasv)

We are bunch of people from different companies with various interests and skills.
We are from different parts of Europe: Germany, Lithuania, Poland and UK.
We have something in common though: We all share the love for OpenSource, Golang, Prometheus, :coffee: and Observability topics.

As either Software Developers or SRE (or both!) we've chosen to maintain (mostly in our free time) Thanos, the de facto way to scale awesome [Prometheus](https://prometheus.io) project.

Feel free to contact us (preferably on Slack) anytime for feedback, questions or :beers:/:coffee:/:tea:.

Especially feedback, please share if you have ideas what we can do better!

## Storage plugins maintainers

Expand Down
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@

## Overview

Thanos is a set of components that can be composed into a highly available metric
system with unlimited storage capacity, which can be added seamlessly on top of existing
Prometheus deployments.
Thanos is a set of components that can be composed into a highly available metric
system with unlimited storage capacity, which can be added seamlessly on top of existing
Prometheus deployments.

Thanos leverages the Prometheus 2.0 storage format to cost-efficiently store historical metric
Thanos leverages the Prometheus 2.0 storage format to cost-efficiently store historical metric
data in any object storage while retaining fast query latencies. Additionally, it provides
a global query view across all Prometheus installations and can merge data from Prometheus
a global query view across all Prometheus installations and can merge data from Prometheus
HA pairs on the fly.

Concretely the aims of the project are:
Expand Down Expand Up @@ -65,7 +65,7 @@ Contributions are very welcome! See our [CONTRIBUTING.md](CONTRIBUTING.md) for m

## Community

Thanos is an open source project and we welcome new contributers and members
Thanos is an open source project and we welcome new contributers and members
of the community. Here are ways to get in touch with the community:

* Slack: [#thanos](https://join.slack.com/t/improbable-eng/shared_invite/enQtMzQ1ODcyMzQ5MjM4LWY5ZWZmNGM2ODc5MmViNmQ3ZTA3ZTY3NzQwOTBlMTkzZmIxZTIxODk0OWU3YjZhNWVlNDU3MDlkZGViZjhkMjc)
Expand Down
4 changes: 2 additions & 2 deletions benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,14 @@ We started a prometheus instance with thanos sidecar, scraping 50 targets every
* 0.022 average query time (min: 0.018s, max: 0.073s)
* *0.812 queries per second*

This shows an added latency of *85-150%* when using thanos query. This is more or less expected, as network operations will have to be done twice. We took a profile of thanos query while under load, finding that about a third of the time is being spent evaluating the promql queries. We are looking into including newer versions of the prometheus libraries into thanos that include optimisations to this component.
This shows an added latency of *85-150%* when using thanos query. This is more or less expected, as network operations will have to be done twice. We took a profile of thanos query while under load, finding that about a third of the time is being spent evaluating the promql queries. We are looking into including newer versions of the prometheus libraries into thanos that include optimisations to this component.

Although we have not tested federated prometheus in the same controlled environment, in theory it should incur a similar overhead, as we will still be performing two network hops.

### Store performance
To test the store component, we generated 1 year of simulated metrics (100 timeseries taking random values every 15s, a total of 210 million samples). We were able to run heavy queries to touch all 210 million of these samples, e.g. a sum over 100 timeseries takes about 34.6 seconds. Smaller queries, for example fetching 1 year of samples from a single timeseries, were able to run in about 500 milliseconds.

When enabling downsampling over these timeseries, we were able to reduce query times by over 90%.
When enabling downsampling over these timeseries, we were able to reduce query times by over 90%.

### Ingestion
To try to find the limits of a single thanos-query service, we spun up a number prometheus instances, each scraping 10 metric-producing endpoints every second. We attached a thanos-query endpoint in front of these scrapers, and ran queries that would touch fetch a most recent metric from each of them. Each metric producing endpoint would serve 100 metrics, taking random values, and the query would fetch the most recent value from each of these metrics:
Expand Down
8 changes: 7 additions & 1 deletion cmd/thanos/compact.go
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,9 @@ func registerCompact(m map[string]setupFunc, app *kingpin.Application, name stri

haltOnError := cmd.Flag("debug.halt-on-error", "Halt the process if a critical compaction error is detected.").
Hidden().Default("true").Bool()
acceptMalformedIndex := cmd.Flag("debug.accept-malformed-index",
"Compaction index verification will ignore out of order label names.").
Hidden().Default("false").Bool()

httpAddr := regHTTPAddrFlag(cmd)

Expand Down Expand Up @@ -102,6 +105,7 @@ func registerCompact(m map[string]setupFunc, app *kingpin.Application, name stri
objStoreConfig,
time.Duration(*syncDelay),
*haltOnError,
*acceptMalformedIndex,
*wait,
map[compact.ResolutionLevel]time.Duration{
compact.ResolutionLevelRaw: time.Duration(*retentionRaw),
Expand All @@ -125,6 +129,7 @@ func runCompact(
objStoreConfig *pathOrContent,
syncDelay time.Duration,
haltOnError bool,
acceptMalformedIndex bool,
wait bool,
retentionByResolution map[compact.ResolutionLevel]time.Duration,
component string,
Expand Down Expand Up @@ -162,7 +167,8 @@ func runCompact(
}
}()

sy, err := compact.NewSyncer(logger, reg, bkt, syncDelay, blockSyncConcurrency)
sy, err := compact.NewSyncer(logger, reg, bkt, syncDelay,
blockSyncConcurrency, acceptMalformedIndex)
if err != nil {
return errors.Wrap(err, "create syncer")
}
Expand Down
16 changes: 12 additions & 4 deletions cmd/thanos/query.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ import (
"github.com/improbable-eng/thanos/pkg/tracing"
"github.com/improbable-eng/thanos/pkg/ui"
"github.com/oklog/run"
"github.com/opentracing/opentracing-go"
opentracing "github.com/opentracing/opentracing-go"
"github.com/pkg/errors"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/common/route"
Expand All @@ -39,7 +39,7 @@ import (
"github.com/prometheus/tsdb/labels"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials"
"gopkg.in/alecthomas/kingpin.v2"
kingpin "gopkg.in/alecthomas/kingpin.v2"
)

// registerQuery registers a query command.
Expand Down Expand Up @@ -91,6 +91,8 @@ func registerQuery(m map[string]setupFunc, app *kingpin.Application, name string
enablePartialResponse := cmd.Flag("query.partial-response", "Enable partial response for queries if no partial_response param is specified.").
Default("true").Bool()

storeResponseTimeout := modelDuration(cmd.Flag("store.response-timeout", "If a Store doesn't send any data in this specified duration then a Store will be ignored and partial data will be returned if it's enabled. 0 disables timeout.").Default("0ms"))

m[name] = func(g *run.Group, logger log.Logger, reg *prometheus.Registry, tracer opentracing.Tracer, _ bool) error {
peer, err := newPeerFn(logger, reg, true, *httpAdvertiseAddr, true)
if err != nil {
Expand Down Expand Up @@ -139,6 +141,7 @@ func registerQuery(m map[string]setupFunc, app *kingpin.Application, name string
*webPrefixHeaderName,
*maxConcurrentQueries,
time.Duration(*queryTimeout),
time.Duration(*storeResponseTimeout),
*replicaLabel,
peer,
selectorLset,
Expand Down Expand Up @@ -254,6 +257,7 @@ func runQuery(
webPrefixHeaderName string,
maxConcurrentQueries int,
queryTimeout time.Duration,
storeResponseTimeout time.Duration,
replicaLabel string,
peer cluster.Peer,
selectorLset labels.Labels,
Expand All @@ -276,7 +280,10 @@ func runQuery(
}

fileSDCache := cache.New()
dnsProvider := dns.NewProvider(logger, extprom.NewSubsystem(reg, "query_store_api"))
dnsProvider := dns.NewProvider(
logger,
extprom.WrapRegistererWithPrefix("thanos_querier_store_apis", reg),
)

var (
stores = query.NewStoreSet(
Expand Down Expand Up @@ -304,7 +311,7 @@ func runQuery(
},
dialOpts,
)
proxy = store.NewProxyStore(logger, stores.Get, component.Query, selectorLset)
proxy = store.NewProxyStore(logger, stores.Get, component.Query, selectorLset, storeResponseTimeout)
queryableCreator = query.NewQueryableCreator(logger, proxy, replicaLabel)
engine = promql.NewEngine(
promql.EngineOpts{
Expand Down Expand Up @@ -355,6 +362,7 @@ func runQuery(
}
fileSDCache.Update(update)
stores.Update(ctxUpdate)
dnsProvider.Resolve(ctxUpdate, append(fileSDCache.Addresses(), storeAddrs...))
case <-ctxUpdate.Done():
return nil
}
Expand Down
12 changes: 8 additions & 4 deletions cmd/thanos/rule.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,15 @@ import (
"github.com/improbable-eng/thanos/pkg/extprom"
"github.com/improbable-eng/thanos/pkg/objstore/client"
"github.com/improbable-eng/thanos/pkg/promclient"
"github.com/improbable-eng/thanos/pkg/rule/api"
v1 "github.com/improbable-eng/thanos/pkg/rule/api"
"github.com/improbable-eng/thanos/pkg/runutil"
"github.com/improbable-eng/thanos/pkg/shipper"
"github.com/improbable-eng/thanos/pkg/store"
"github.com/improbable-eng/thanos/pkg/store/storepb"
"github.com/improbable-eng/thanos/pkg/tracing"
"github.com/improbable-eng/thanos/pkg/ui"
"github.com/oklog/run"
"github.com/opentracing/opentracing-go"
opentracing "github.com/opentracing/opentracing-go"
"github.com/pkg/errors"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/common/model"
Expand All @@ -52,7 +52,7 @@ import (
"github.com/prometheus/prometheus/util/strutil"
"github.com/prometheus/tsdb/labels"
"google.golang.org/grpc"
"gopkg.in/alecthomas/kingpin.v2"
kingpin "gopkg.in/alecthomas/kingpin.v2"
)

// registerRule registers a rule command.
Expand Down Expand Up @@ -257,7 +257,11 @@ func runRule(

// FileSD query addresses.
fileSDCache := cache.New()
dnsProvider := dns.NewProvider(logger, extprom.NewSubsystem(reg, "rule_query"))

dnsProvider := dns.NewProvider(
logger,
extprom.WrapRegistererWithPrefix("thanos_ruler_query_apis", reg),
)

// Hit the HTTP query API of query peers in randomized order until we get a result
// back or the context get canceled.
Expand Down
8 changes: 6 additions & 2 deletions cmd/thanos/sidecar.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,12 @@ import (
"github.com/improbable-eng/thanos/pkg/store"
"github.com/improbable-eng/thanos/pkg/store/storepb"
"github.com/oklog/run"
"github.com/opentracing/opentracing-go"
opentracing "github.com/opentracing/opentracing-go"
"github.com/pkg/errors"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/tsdb/labels"
"google.golang.org/grpc"
"gopkg.in/alecthomas/kingpin.v2"
kingpin "gopkg.in/alecthomas/kingpin.v2"
)

func registerSidecar(m map[string]setupFunc, app *kingpin.Application, name string) {
Expand Down Expand Up @@ -137,6 +137,10 @@ func runSidecar(
return err
}

level.Info(logger).Log(
"msg", "successfully loaded prometheus external labels",
"external_labels", m.Labels().String(),
)
promUp.Set(1)
lastHeartbeat.Set(float64(time.Now().UnixNano()) / 1e9)
return nil
Expand Down
Loading

0 comments on commit 140d864

Please sign in to comment.