Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release v0.37 #109

Merged
merged 706 commits into from
Dec 2, 2024
Merged

Release v0.37 #109

merged 706 commits into from
Dec 2, 2024

Conversation

jnyi
Copy link
Collaborator

@jnyi jnyi commented Dec 2, 2024

merge db_main branch to release branch which has been running for a few weeks, a few highlights to call out:

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

thibaultmg and others added 30 commits October 16, 2024 14:50
* fix serverAsClient goroutines leak

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

* fix lint

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

* update changelog

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

* delete invalid comment

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

* remove temp dev test

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

* remove timer channel drain

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

---------

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>
If we account stats for remote write and local writes we will count them
twice since the remote write will be counted locally again by the remote
receiver instance.

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>
We have seen deadlocks with endpoint discovery caused by the metric
collector hanging and not releasing the store labels lock. This causes
the endpoint update to hang, which also makes all endpoint readers hang on
acquiring a read lock for the resolved endpoints slice.

This commit makes sure the Collect method on the metrics collector has
a built in timeout to guard against cases where an upstream call never
reads from the collection channel.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
…ne (thanos-io#7382)

* *: Ensure objstore flag values are masked & disable debug/pprof/cmdline

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

* small fix

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

---------

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
In LabelNames and LabelValues gRPC calls were not pruned properly. While
results are not wrong, this leads to inefficient fan-out for setups with
many endpoints.
We took the opportunity to unify the store filtering and generally also
the larger layout of the gRPC methods, including logging and tracing.

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>
Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
* Appending warn to changelog about breaking change

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* Including warning emoji

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

---------

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
…7392)

If we have a new querier it will create query hints even without the
pushdown feature being present anymore. Old sidecars will then trigger
query pushdown which leads to broken max,min,max_over_time and
min_over_time.

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>
* *: Using native histograms for grpc middleware metrics

Since we updated the middleware library, we can now use native histograms to keep track of latencies in grpc calls.
This is a semi-breaking change if people enabled native histogram collection on their Prometheus monitoring Thanos instances.

Signed-off-by: Pedro Tanaka <pedro.stanaka@gmail.com>

adding change log

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* removing empty space;

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

* Put full disclaimer in changelog

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>

---------

Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
* compact: recover from panics (thanos-io#7318)

For thanos-io#6775, it would be useful
to know the exact block IDs to aid debugging.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* Sidecar: wait for prometheus on startup (thanos-io#7323)

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>

* Receive: fix serverAsClient.Series goroutines leak (thanos-io#6948)

* fix serverAsClient goroutines leak

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

* fix lint

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

* update changelog

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

* delete invalid comment

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

* remove temp dev test

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

* remove timer channel drain

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

---------

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

* Receive: fix stats (thanos-io#7373)

If we account stats for remote write and local writes we will count them
twice since the remote write will be counted locally again by the remote
receiver instance.

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>

* *: Ensure objstore flag values are masked & disable debug/pprof/cmdline (thanos-io#7382)

* *: Ensure objstore flag values are masked & disable debug/pprof/cmdline

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

* small fix

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

---------

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

* Query: dont pass query hints to avoid triggering pushdown (thanos-io#7392)

If we have a new querier it will create query hints even without the
pushdown feature being present anymore. Old sidecars will then trigger
query pushdown which leads to broken max,min,max_over_time and
min_over_time.

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>

* Cut patch release v0.35.1

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

---------

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>
Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Co-authored-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
Co-authored-by: Michael Hoffmann <mhoffm@posteo.de>
Co-authored-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>
Previously we defered starting the gRPC server by blocking the whole
startup until we could ping prometheus. This breaks usecases that rely
on the config reloader to start prometheus.
We fix it by using a channel to defer starting the grpc server
and loading external labels in an actor concurrently.

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>
* Uupdate Prometheus

Signed-off-by: alanprot <alanprot@gmail.com>

* Updating prometheus to 4e664035e84e

Signed-off-by: alanprot <alanprot@gmail.com>

* Temporarily pinning prometheus common

Signed-off-by: alanprot <alanprot@gmail.com>

* fixing lint

Signed-off-by: alanprot <alanprot@gmail.com>

* Using jsoniter to encode promql responses

Signed-off-by: alanprot <alanprot@gmail.com>

* Removing e2e test case with unvalid hifen on a matcher -> prometheus now support this use case

Signed-off-by: alanprot <alanprot@gmail.com>

* Updating prometheus to v0.52.2-0.20240606174736-edd558884b24

Signed-off-by: alanprot <alanprot@gmail.com>

* pinning grpc to v1.63.2

Signed-off-by: alanprot <alanprot@gmail.com>

---------

Signed-off-by: alanprot <alanprot@gmail.com>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-21-10.us-west-2.compute.internal>
Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>
Allow suppressing environment variables expansion errors when unset, and
thus keep the reloader from crashing. Instead leave them as is.

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
* Update adopters.yml

Signed-off-by: Rishabh Soni <risrock02@gmail.com>

* Add files via upload

Signed-off-by: Rishabh Soni <risrock02@gmail.com>

---------

Signed-off-by: Rishabh Soni <risrock02@gmail.com>
Signed-off-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com>
Signed-off-by: Pedro Tanaka <pedro.tanaka@shopify.com>
Recently ran into an issue with Istio in particular, where leaving the
trailing dot on the SRV record returned by `dnssrvnoa` lookups led to an
inability to connect to the endpoint. Removing the trailing dot fixes
this behaviour.

Now, technically, this is a valid URL, and it shouldn't be a problem.
One could definitely argue that Istio should be responsible here for
ensuring that the traffic is delivered. The problem seems rooted in how
Istio attempts to do wildcard matching or URLs it receives - including
the dot leads it to lookup an empty DNS field, which is invalid.

The approach I take here is actually copied from how Prometheus does it.
Therefore I hope we can sneak this through with the argument that 'this
is how Prometheus does it', regardless of whether or not this is
philosophically correct...

Signed-off-by: verejoel <j.verezhak@gmail.com>
Bumps [go.opentelemetry.io/contrib/propagators/autoprop](https://github.com/open-telemetry/opentelemetry-go-contrib) from 0.38.0 to 0.53.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-go-contrib/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/CHANGELOG.md)
- [Commits](open-telemetry/opentelemetry-go-contrib@zpages/v0.38.0...zpages/v0.53.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/contrib/propagators/autoprop
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [go.opentelemetry.io/contrib/samplers/jaegerremote](https://github.com/open-telemetry/opentelemetry-go-contrib) from 0.7.0 to 0.22.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-go-contrib/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/CHANGELOG.md)
- [Commits](open-telemetry/opentelemetry-go-contrib@v0.7.0...v0.22.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/contrib/samplers/jaegerremote
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…hanos-io#7492)

* compact: Update filtered blocks list before second downsample pass

If the second downsampling pass is given the same filteredMetas
list as the first pass, it will create duplicates of blocks
created in the first pass.

It will also not be able to do further downsampling e.g 5m->1h
using blocks created in the first pass, as it will not be aware
of them.

The metadata was already being synced before the second pass,
but not updated into the filteredMetas list.

Signed-off-by: Thomas Hartland <thomas.hartland@diamond.ac.uk>

* Update changelog

Signed-off-by: Thomas Hartland <thomas.hartland@diamond.ac.uk>

* e2e/compact: Fix number of blocks cleaned assertion

The value was increased in 2ed48f7 to fix the test,
with the reasoning that the hardcoded value must
have been taken from a run of the CI that didn't
reach the max value due to CI worker lag.

More likely the real reason is that commit 68bef3f
the day before had caused blocks to be duplicated
during downsampling.

The duplicate block is immediately marked for deletion,
causing an extra +1 in the number of blocks cleaned.

Subtracting one from the value again now that the
block duplication issue is fixed.

Signed-off-by: Thomas Hartland <thomas.hartland@diamond.ac.uk>

* e2e/compact: Revert change to downsample count assertion

Combined with the previous commit this effectively reverts
all of 2ed48f7, in which two assertions were changed to
(unknowingly) account for a bug which had just been
introduced in the downsampling code, causing duplicate blocks.

This assertion change I am less sure on the reasoning for,
but after running through the e2e tests several times locally,
it is consistent that the only downsampling happens in the
"compact-working" step, and so all other steps would report 0
for their total downsamples metric.

Signed-off-by: Thomas Hartland <thomas.hartland@diamond.ac.uk>

---------

Signed-off-by: Thomas Hartland <thomas.hartland@diamond.ac.uk>
Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com>
…s.go (thanos-io#7552)

Signed-off-by: Nishant Bansal <nishant.bansal.mec21@iitbhu.ac.in>
Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com>
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.24.0 to 0.25.0.
- [Commits](golang/crypto@v0.24.0...v0.25.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
thanos-io#7528)

Bumps [go.opentelemetry.io/otel/bridge/opentracing](https://github.com/open-telemetry/opentelemetry-go) from 1.21.0 to 1.28.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md)
- [Commits](open-telemetry/opentelemetry-go@v1.21.0...v1.28.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/otel/bridge/opentracing
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This commits adds the option of filtering rules by rule name, rule
group, or file. This brings the rule API closer in-line with the current
Prometheus api.

Signed-off-by: Jacob Baungard Hansen <jacobbaungard@redhat.com>
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.26.0 to 0.27.0.
- [Commits](golang/net@v0.26.0...v0.27.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…hanos-io#7525)

Bumps [go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc](https://github.com/open-telemetry/opentelemetry-go) from 1.27.0 to 1.28.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md)
- [Commits](open-telemetry/opentelemetry-go@v1.27.0...v1.28.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
yuchen-db and others added 26 commits November 13, 2024 16:03
Signed-off-by: Yi Jin <yi.jin@databricks.com>
Signed-off-by: Yuchen Wang <162491048+yuchen-db@users.noreply.github.com>
* support hedged requests in store

Signed-off-by: milinddethe15 <milinddethe15@gmail.com>

* hedged roundtripper with tdigest for dynamic delay

Signed-off-by: milinddethe15 <milinddethe15@gmail.com>

* refactor struct and fix lint

Signed-off-by: milinddethe15 <milinddethe15@gmail.com>

* Improve hedging implementation

Signed-off-by: milinddethe15 <milinddethe15@gmail.com>

* Improved hedging implementation

Signed-off-by: milinddethe15 <milinddethe15@gmail.com>

* Update store doc

Signed-off-by: milinddethe15 <milinddethe15@gmail.com>

* fix white space

Signed-off-by: milinddethe15 <milinddethe15@gmail.com>

* add enabled field

Signed-off-by: milinddethe15 <milinddethe15@gmail.com>

---------

Signed-off-by: milinddethe15 <milinddethe15@gmail.com>
I always get this in logs:
```
err: receive capnp conn: close tcp ...: use of closed network connection
```

This is also visible in the e2e test.

After Done() returns, the connection is closed either way so no need to
close it again.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
* Fix a storage GW bug that loses TSDB infos when joining them
* E2E test demonstrating a bug in the MinT calculation in distributed
  Engine

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
…o#7915)

* always close block series client at the end

Signed-off-by: Ben Ye <benye@amazon.com>

* add back close for loser tree

Signed-off-by: Ben Ye <benye@amazon.com>

---------

Signed-off-by: Ben Ye <benye@amazon.com>
* Update objstore and promql-engine to latest

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

* Fixes after upgrade

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

---------

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Yi Jin <yi.jin@databricks.com>
Signed-off-by: Yi Jin <yi.jin@databricks.com>
Signed-off-by: Yi Jin <yi.jin@databricks.com>
Signed-off-by: Yi Jin <yi.jin@databricks.com>
Signed-off-by: Yi Jin <yi.jin@databricks.com>
@jnyi jnyi merged commit 149364c into release Dec 2, 2024
184 of 185 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.