-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingester blowing up to tens of thousands of goroutines #858
Comments
At the same time I see a couple of |
I managed to get a goroutine profile:
So why isn't Current working theory is that calls from distributor to ingester are timed out at 2 seconds, and when the caller cancels the call it is removed from the count of streams. |
Now let's consider what they are all blocked on: This is because one goroutine is part-way through claiming the write lock, so all readers wait:
Oddly, it's a query. Doesn't seem necessary for queries to do this. (not that that would change the big picture - there are plenty of goroutines that do want to write). |
Does this imply there is some cause of inconsistency between the index and the series? I thought the queries should never cause a series to be created... |
It's locking to create a |
Following this line of thinking, I increased the timeout to 10 seconds, thinking it would slow the growth. This did not prevent another ingester from bouncing at startup in the same way a bit later. I did see the distributors hit 1.5K goroutines. That time the largest set of blocked goroutines were waiting to write:
while one goroutine holds that lock and is waiting for readers to finish:
I am thinking we should shard |
Perhaps sync.Map is more appropriate here? https://golang.org/pkg/sync/#Map
…On Fri, 29 Jun 2018 at 19:51, Bryan Boreham ***@***.***> wrote:
Current working theory is that calls from distributor to ingester are
timed out at 2 seconds, and when the caller cancels the call it is removed
from the count of streams.
Following this line of thinking, I increased the timeout to 10 seconds,
thinking it would slow the growth. This did not prevent another ingester
from bouncing at startup in the same way a bit later. I did see the
distributors hit 1.5K goroutines.
That time the largest set of blocked goroutines were waiting to write:
goroutine profile: total 6297
3781 @ 0x42d8aa 0x42d95e 0x43e044 0x43dd5d 0x470798 0x47169d 0xc3a0f6 0xc330d6 0xc32d93 0xbff996 0x819fde 0x832e66 0x831f69 0x832e66 0x81b250 0x832e66 0x81a14b 0x839058 0x832e66 0x833066 0xbe9987 0x7f80fc 0x7fba08 0x802bdf 0x45ac41
# 0x43dd5c sync.runtime_SemacquireMutex+0x3c /usr/local/go/src/runtime/sema.go:71
# 0x470797 sync.(*Mutex).Lock+0x107 /usr/local/go/src/sync/mutex.go:134
# 0x47169c sync.(*RWMutex).Lock+0x2c /usr/local/go/src/sync/rwmutex.go:93
# 0xc3a0f5 github.com/weaveworks/cortex/pkg/ingester.(*userStates).getOrCreateSeries+0x135 /go/src/github.com/weaveworks/cortex/pkg/ingester/user_state.go:158
# 0xc330d5 github.com/weaveworks/cortex/pkg/ingester.(*Ingester).append+0x255 /go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:344
# 0xc32d92 github.com/weaveworks/cortex/pkg/ingester.(*Ingester).Push+0xe2 /go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:304
# 0xbff995 github.com/weaveworks/cortex/pkg/ingester/client._Ingester_Push_Handler.func1+0x85 /go/src/github.com/weaveworks/cortex/pkg/ingester/client/cortex.pb.go:1713
while one goroutine holds that lock and is waiting for readers to finish:
1 @ 0x42d8aa 0x42d95e 0x43e044 0x43dc69 0x4716de 0xc3a0f6 0xc330d6 0xc32d93 0xbff996 0x819fde 0x832e66 0x831f69 0x832e66 0x81b250 0x832e66 0x81a14b 0x839058 0x832e66 0x833066 0xbe9987 0x7f80fc 0x7fba08 0x802bdf 0x45ac41
# 0x43dc68 sync.runtime_Semacquire+0x38 /usr/local/go/src/runtime/sema.go:56
# 0x4716dd sync.(*RWMutex).Lock+0x6d /usr/local/go/src/sync/rwmutex.go:98
# 0xc3a0f5 github.com/weaveworks/cortex/pkg/ingester.(*userStates).getOrCreateSeries+0x135 /go/src/github.com/weaveworks/cortex/pkg/ingester/user_state.go:158
# 0xc330d5 github.com/weaveworks/cortex/pkg/ingester.(*Ingester).append+0x255 /go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:344
# 0xc32d92 github.com/weaveworks/cortex/pkg/ingester.(*Ingester).Push+0xe2 /go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:304
# 0xbff995 github.com/weaveworks/cortex/pkg/ingester/client._Ingester_Push_Handler.func1+0x85 /go/src/github.com/weaveworks/cortex/pkg/ingester/client/cortex.pb.go:1713
I am thinking we should shard userStates so more work can proceed in
parallel. E.g. have 8 maps and 8 mutexes.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#858 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAbGhQxx4C83dplJrdFD1svw7SyO0HWaks5uBjfxgaJpZM4U7QYB>
.
|
I did some reading and find your suggestion entirely plausible: very likely to work much better in steady-state though not guaranteed to work better at startup (which is where I'm seeing the problem). I kinda feel I need the whole thing to slow down. Tried increasing the 2-second timeout to 10 seconds but now the distributors blow up. |
Now sync.Map is in, can we close this? Or do you think we should stick a limit on number of concurrent requests too? |
Very similar thing happened today - this time the cause may have been #1301 because we were rejecting 300,000 samples per second at the time. So yes I want to limit the goroutines. |
We have gone through this issue during our bugscrub and decided to close this as we haven't see it in recent times. Will re-open if we spot it again, but likely fixed. |
Note |
Had one ingester OOM a few times in succession today; the thing that stands out is the number of goroutines:
No obvious cause; in the logs I can see someone was playing silly with long label values, but not at high rates, and if that was the problem I would expect it to hit 3 ingesters rather than 1.
The text was updated successfully, but these errors were encountered: