Receive: Dead Lock in multitsdb #7856

jnyi · 2024-10-23T19:58:29Z

We are testing the latest thanos main branch code to adopt cuckoo filters and found receiver sometimes has leak go routines very high numbers, did pprof, and found there are potential lock race conditions between function:

Thanos, Prometheus and Golang version used:

thanos: v0.37.0-dev
golang: 1.23.0

Object Storage Provider:

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

Anything else we need to know:

dosubot · 2024-10-23T20:00:31Z

Hey @jnyi! I'm here to help you with any bugs, questions, or contributions. Let's work together to solve this issue.

The issue you're encountering with potential lock race conditions in the (*MultiTSDB) struct, particularly involving the getOrLoadTenant and TSDBLocalClients functions, could be related to how these functions manage concurrency and synchronization. The (*MultiTSDB) struct uses a read-write mutex (sync.RWMutex) to handle concurrent access to its fields, which is crucial for preventing race conditions ^[1].

The getOrLoadTenant function retrieves or initializes a tenant's TSDB, using locks to ensure thread safety when accessing the tenants map. It first attempts a fast path with a read lock and, if necessary, switches to a write lock to prevent race conditions ^[1]. The TSDBLocalClients function also uses a read-write lock to manage access to the list of store clients, updating the list if needed ^[1].

Given the high number of goroutines and potential lock contention, it might be beneficial to review the usage patterns of these functions to ensure they are not being called excessively or in a way that could lead to contention. Additionally, consider profiling the application to identify specific areas where locks might be causing bottlenecks or where goroutines are not being managed efficiently.

To continue talking to Dosu, mention @dosu.

jnyi · 2024-10-23T20:23:06Z

we plan to revert this pr internally which use more memory but less chance to cause pod goroutine leaks and hurts reliability to the system: #7782

GiedriusS · 2024-10-23T20:28:24Z

Fixing in #7857, sorry for issues

jnyi · 2024-10-23T20:34:58Z

np, thanks for helping this out quickly, wonder if we could capture this kind of leaks in unit tests, like -race or because we didn't unit test this code branch properly -race didn't detect it?

GiedriusS · 2024-10-25T07:15:08Z

We have -race enabled but it doesn't catch stuff like this. We need some kind of linter to catch situations where we lock something but don't unlock it in all branches.

dosubot bot added bug component: receive labels Oct 23, 2024

jnyi changed the title ~~Lock race condition in multitsdb~~ Receive: Lock race condition in multitsdb Oct 23, 2024

jnyi mentioned this issue Oct 23, 2024

[ES-1289498] revert the change that might cause lock contention databricks/thanos#90

Closed

2 tasks

jnyi mentioned this issue Oct 23, 2024

[ES-1289498] fix deadlock databricks/thanos#91

Merged

2 tasks

GiedriusS closed this as completed Oct 25, 2024

jnyi changed the title ~~Receive: Lock race condition in multitsdb~~ Receive: Dead Lock in multitsdb Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receive: Dead Lock in multitsdb #7856

Receive: Dead Lock in multitsdb #7856

jnyi commented Oct 23, 2024 •

edited

Loading

dosubot bot commented Oct 23, 2024

jnyi commented Oct 23, 2024

GiedriusS commented Oct 23, 2024

jnyi commented Oct 23, 2024

GiedriusS commented Oct 25, 2024

Receive: Dead Lock in multitsdb #7856

Receive: Dead Lock in multitsdb #7856

Comments

jnyi commented Oct 23, 2024 • edited Loading

dosubot bot commented Oct 23, 2024

jnyi commented Oct 23, 2024

GiedriusS commented Oct 23, 2024

jnyi commented Oct 23, 2024

GiedriusS commented Oct 25, 2024

jnyi commented Oct 23, 2024 •

edited

Loading