Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receive: Dead Lock in multitsdb #7856

Closed
jnyi opened this issue Oct 23, 2024 · 5 comments
Closed

Receive: Dead Lock in multitsdb #7856

jnyi opened this issue Oct 23, 2024 · 5 comments

Comments

@jnyi
Copy link
Contributor

jnyi commented Oct 23, 2024

We are testing the latest thanos main branch code to adopt cuckoo filters and found receiver sometimes has leak go routines very high numbers, did pprof, and found there are potential lock race conditions between function:

Screenshot 2024-10-23 at 12 50 05 PM Screenshot 2024-10-23 at 12 50 12 PM Screenshot 2024-10-23 at 11 45 53 AM Screenshot 2024-10-23 at 12 50 18 PM

Thanos, Prometheus and Golang version used:

thanos: v0.37.0-dev
golang: 1.23.0

Object Storage Provider:

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

Anything else we need to know:

Copy link

dosubot bot commented Oct 23, 2024

Hey @jnyi! I'm here to help you with any bugs, questions, or contributions. Let's work together to solve this issue.

The issue you're encountering with potential lock race conditions in the (*MultiTSDB) struct, particularly involving the getOrLoadTenant and TSDBLocalClients functions, could be related to how these functions manage concurrency and synchronization. The (*MultiTSDB) struct uses a read-write mutex (sync.RWMutex) to handle concurrent access to its fields, which is crucial for preventing race conditions [1].

The getOrLoadTenant function retrieves or initializes a tenant's TSDB, using locks to ensure thread safety when accessing the tenants map. It first attempts a fast path with a read lock and, if necessary, switches to a write lock to prevent race conditions [1]. The TSDBLocalClients function also uses a read-write lock to manage access to the list of store clients, updating the list if needed [1].

Given the high number of goroutines and potential lock contention, it might be beneficial to review the usage patterns of these functions to ensure they are not being called excessively or in a way that could lead to contention. Additionally, consider profiling the application to identify specific areas where locks might be causing bottlenecks or where goroutines are not being managed efficiently.

To continue talking to Dosu, mention @dosu.

@jnyi jnyi changed the title Lock race condition in multitsdb Receive: Lock race condition in multitsdb Oct 23, 2024
@jnyi
Copy link
Contributor Author

jnyi commented Oct 23, 2024

we plan to revert this pr internally which use more memory but less chance to cause pod goroutine leaks and hurts reliability to the system: #7782

@GiedriusS
Copy link
Member

Fixing in #7857, sorry for issues

@jnyi
Copy link
Contributor Author

jnyi commented Oct 23, 2024

np, thanks for helping this out quickly, wonder if we could capture this kind of leaks in unit tests, like -race or because we didn't unit test this code branch properly -race didn't detect it?

@GiedriusS
Copy link
Member

We have -race enabled but it doesn't catch stuff like this. We need some kind of linter to catch situations where we lock something but don't unlock it in all branches.

@jnyi jnyi changed the title Receive: Lock race condition in multitsdb Receive: Dead Lock in multitsdb Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants