Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy: Query goroutine leak when store.response-timeout is set #7618

Merged
merged 1 commit into from
Aug 13, 2024

Conversation

cincinnat
Copy link
Contributor

@cincinnat cincinnat commented Aug 9, 2024

time.AfterFunc() returns a time.Timer object whose C field is nil, accroding to the documentation. A goroutine blocks forever on reading from a nil channel, leading to a goroutine leak on random slow queries for Thanos.

This goroutine leak would be most apparent for busy services with query.promql-engine=thanos, when grouroutins tend to stuck in batches, thanks to the wide usage of sync.Once by the engine.

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

@cincinnat cincinnat force-pushed the query-goroutine-leak branch 4 times, most recently from 910a242 to a4b9301 Compare August 9, 2024 13:10
Copy link
Contributor

@MichaHoffmann MichaHoffmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, thank you!

@saswatamcode
Copy link
Member

@cincinnat could you kindly rebase on latest main? We had a CI issue, which seems to fixed. Want to merge this on green 🙂

time.AfterFunc() returns a time.Timer object whose C field is nil,
accroding to the documentation. A goroutine blocks forever on reading
from a `nil` channel, leading to a goroutine leak on random slow
queries.

Signed-off-by: Mikhail Nozdrachev <mikhail.nozdrachev@aiven.io>
@cincinnat cincinnat force-pushed the query-goroutine-leak branch from a4b9301 to d23ca27 Compare August 13, 2024 07:04
@saswatamcode saswatamcode merged commit 4050c73 into thanos-io:main Aug 13, 2024
19 of 20 checks passed
@cincinnat cincinnat deleted the query-goroutine-leak branch August 13, 2024 07:39
saswatamcode pushed a commit to saswatamcode/thanos that referenced this pull request Aug 13, 2024
…nos-io#7618)

time.AfterFunc() returns a time.Timer object whose C field is nil,
accroding to the documentation. A goroutine blocks forever on reading
from a `nil` channel, leading to a goroutine leak on random slow
queries.

Signed-off-by: Mikhail Nozdrachev <mikhail.nozdrachev@aiven.io>
saswatamcode added a commit that referenced this pull request Aug 13, 2024
* Proxy: Query goroutine leak when `store.response-timeout` is set (#7618)

time.AfterFunc() returns a time.Timer object whose C field is nil,
accroding to the documentation. A goroutine blocks forever on reading
from a `nil` channel, leading to a goroutine leak on random slow
queries.

Signed-off-by: Mikhail Nozdrachev <mikhail.nozdrachev@aiven.io>

* pkg/clientconfig: fix TLS configs with only CA (#7634)

065e3dd introduced a regression: TLS configurations for Thanos Ruler
query and alerting with only a CA file failed to load.

For instance, the following snippet is a valid query configuration:

```
- static_configs:
  - prometheus.example.com:9090
  scheme: https
  http_config:
    tls_config:
      ca_file: /etc/ssl/cert.pem
```

The test fixtures (CA, certificate and key files) are copied from
prometheus/common and are valid until 2072.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Cut patch release v0.36.1

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

* Fix failing e2e test (#7620)

Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

---------

Signed-off-by: Mikhail Nozdrachev <mikhail.nozdrachev@aiven.io>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com>
Co-authored-by: Mikhail Nozdrachev <mikhail.nozdrachev@aiven.io>
Co-authored-by: Simon Pasquier <spasquie@redhat.com>
Co-authored-by: Harry John <johrry@amazon.com>
saswatamcode added a commit that referenced this pull request Aug 14, 2024
* CHANGELOG: Mark 0.36 as in progress

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>

* Cut release candidate v0.36.0-rc.0 (#7490)

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>

* Cut release candidate 0.36.0 rc.1 (#7510)

* *: fix server grpc histograms (#7493)

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>

* Close endpoints after the gRPC server has terminated (#7509)

Endpoints are currently closed as soon as we receive a SIGTERM or SIGINT.
This causes in-flight queries to get cancelled since outgoing connections
get closed instantly.

This commit moves the endpoints.Close call after the grpc server shutdown
to make sure connections are available as long as the server is running.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Cut release candidate v0.36.0-rc.1

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>

---------

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Co-authored-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Cut release v0.36.0 (#7578)

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>

* Cut patch release `v0.36.1` (#7636)

* Proxy: Query goroutine leak when `store.response-timeout` is set (#7618)

time.AfterFunc() returns a time.Timer object whose C field is nil,
accroding to the documentation. A goroutine blocks forever on reading
from a `nil` channel, leading to a goroutine leak on random slow
queries.

Signed-off-by: Mikhail Nozdrachev <mikhail.nozdrachev@aiven.io>

* pkg/clientconfig: fix TLS configs with only CA (#7634)

065e3dd introduced a regression: TLS configurations for Thanos Ruler
query and alerting with only a CA file failed to load.

For instance, the following snippet is a valid query configuration:

```
- static_configs:
  - prometheus.example.com:9090
  scheme: https
  http_config:
    tls_config:
      ca_file: /etc/ssl/cert.pem
```

The test fixtures (CA, certificate and key files) are copied from
prometheus/common and are valid until 2072.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Cut patch release v0.36.1

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

* Fix failing e2e test (#7620)

Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

---------

Signed-off-by: Mikhail Nozdrachev <mikhail.nozdrachev@aiven.io>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com>
Co-authored-by: Mikhail Nozdrachev <mikhail.nozdrachev@aiven.io>
Co-authored-by: Simon Pasquier <spasquie@redhat.com>
Co-authored-by: Harry John <johrry@amazon.com>

---------

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: Mikhail Nozdrachev <mikhail.nozdrachev@aiven.io>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com>
Co-authored-by: Michael Hoffmann <mhoffm@posteo.de>
Co-authored-by: Filip Petkovski <filip.petkovsky@gmail.com>
Co-authored-by: Mikhail Nozdrachev <mikhail.nozdrachev@aiven.io>
Co-authored-by: Simon Pasquier <spasquie@redhat.com>
Co-authored-by: Harry John <johrry@amazon.com>
hczhu-db pushed a commit to databricks/thanos that referenced this pull request Aug 22, 2024
* Proxy: Query goroutine leak when `store.response-timeout` is set (thanos-io#7618)

time.AfterFunc() returns a time.Timer object whose C field is nil,
accroding to the documentation. A goroutine blocks forever on reading
from a `nil` channel, leading to a goroutine leak on random slow
queries.

Signed-off-by: Mikhail Nozdrachev <mikhail.nozdrachev@aiven.io>

* pkg/clientconfig: fix TLS configs with only CA (thanos-io#7634)

065e3dd introduced a regression: TLS configurations for Thanos Ruler
query and alerting with only a CA file failed to load.

For instance, the following snippet is a valid query configuration:

```
- static_configs:
  - prometheus.example.com:9090
  scheme: https
  http_config:
    tls_config:
      ca_file: /etc/ssl/cert.pem
```

The test fixtures (CA, certificate and key files) are copied from
prometheus/common and are valid until 2072.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>

* Cut patch release v0.36.1

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

* Fix failing e2e test (thanos-io#7620)

Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

---------

Signed-off-by: Mikhail Nozdrachev <mikhail.nozdrachev@aiven.io>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: 🌲 Harry 🌊 John 🏔 <johrry@amazon.com>
Co-authored-by: Mikhail Nozdrachev <mikhail.nozdrachev@aiven.io>
Co-authored-by: Simon Pasquier <spasquie@redhat.com>
Co-authored-by: Harry John <johrry@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants