Ingester blowing up to tens of thousands of goroutines #858

bboreham · 2018-06-28T12:00:00Z

Had one ingester OOM a few times in succession today; the thing that stands out is the number of goroutines:

No obvious cause; in the logs I can see someone was playing silly with long label values, but not at high rates, and if that was the problem I would expect it to hit 3 ingesters rather than 1.

bboreham · 2018-06-28T13:40:37Z

At the same time I see a couple of queriers also blowing up to huge size (5GB), but not to huge numbers of goroutines

bboreham · 2018-06-29T08:51:16Z

I managed to get a goroutine profile:

goroutine profile: total 8004
4947 @ 0x42d8aa 0x42d95e 0x43e044 0x43dc69 0x4715b9 0xc3a066 0xc330d6 0xc32d93 0xbff996 0x819fde 0x832e66 0x831f69 0x832e66 0x81b250 0x832e66 0x81a14b 0x839058 0x832e66 0x833066 0xbe9987 0x7f80fc 0x7fba08 0x802bdf 0x45ac41
#	0x43dc68	sync.runtime_Semacquire+0x38														/usr/local/go/src/runtime/sema.go:56
#	0x4715b8	sync.(*RWMutex).RLock+0x48														/usr/local/go/src/sync/rwmutex.go:50
#	0xc3a065	github.com/weaveworks/cortex/pkg/ingester.(*userStates).getOrCreateSeries+0xa5								/go/src/github.com/weaveworks/cortex/pkg/ingester/user_state.go:144
#	0xc330d5	github.com/weaveworks/cortex/pkg/ingester.(*Ingester).append+0x255									/go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:344
#	0xc32d92	github.com/weaveworks/cortex/pkg/ingester.(*Ingester).Push+0xe2										/go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:304
#	0xbff995	github.com/weaveworks/cortex/pkg/ingester/client._Ingester_Push_Handler.func1+0x85							/go/src/github.com/weaveworks/cortex/pkg/ingester/client/cortex.pb.go:1713
#	0x819fdd	github.com/weaveworks/cortex/vendor/github.com/weaveworks/common/middleware.ServerUserHeaderInterceptor+0x9d				/go/src/github.com/weaveworks/cortex/vendor/github.com/weaveworks/common/middleware/grpc_auth.go:27
#	0x832e65	github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x65				/go/src/github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware/chain.go:22
#	0x831f68	github.com/weaveworks/cortex/vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc.OpenTracingServerInterceptor.func1+0x308	/go/src/github.com/weaveworks/cortex/vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc/server.go:61
#	0x832e65	github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x65				/go/src/github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware/chain.go:22
#	0x81b24f	github.com/weaveworks/cortex/vendor/github.com/weaveworks/common/middleware.ServerInstrumentInterceptor.func1+0x9f			/go/src/github.com/weaveworks/cortex/vendor/github.com/weaveworks/common/middleware/grpc_instrumentation.go:17
#	0x832e65	github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x65				/go/src/github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware/chain.go:22
#	0x81a14a	github.com/weaveworks/cortex/vendor/github.com/weaveworks/common/middleware.GRPCServerLog.UnaryServerInterceptor+0x9a			/go/src/github.com/weaveworks/cortex/vendor/github.com/weaveworks/common/middleware/grpc_logging.go:24
#	0x839057	github.com/weaveworks/cortex/vendor/github.com/weaveworks/common/middleware.(GRPCServerLog).UnaryServerInterceptor-fm+0x77		/go/src/github.com/weaveworks/cortex/vendor/github.com/weaveworks/common/server/server.go:97
#	0x832e65	github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x65				/go/src/github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware/chain.go:22
#	0x833065	github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware.ChainUnaryServer.func1+0xf5					/go/src/github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware/chain.go:29
#	0xbe9986	github.com/weaveworks/cortex/pkg/ingester/client._Ingester_Push_Handler+0x166								/go/src/github.com/weaveworks/cortex/pkg/ingester/client/cortex.pb.go:1715
#	0x7f80fb	github.com/weaveworks/cortex/vendor/google.golang.org/grpc.(*Server).processUnaryRPC+0x8ab						/go/src/github.com/weaveworks/cortex/vendor/google.golang.org/grpc/server.go:921
#	0x7fba07	github.com/weaveworks/cortex/vendor/google.golang.org/grpc.(*Server).handleStream+0x1317						/go/src/github.com/weaveworks/cortex/vendor/google.golang.org/grpc/server.go:1143
#	0x802bde	github.com/weaveworks/cortex/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1+0x9e						/go/src/github.com/weaveworks/cortex/vendor/google.golang.org/grpc/server.go:638

So why isn't --ingester.max-concurrent-streams protecting us from this?

Current working theory is that calls from distributor to ingester are timed out at 2 seconds, and when the caller cancels the call it is removed from the count of streams.

bboreham · 2018-06-29T09:54:52Z

Now let's consider what they are all blocked on: us.mtx.Rlock()

This is because one goroutine is part-way through claiming the write lock, so all readers wait:

1 @ 0x42d8aa 0x42d95e 0x43e044 0x43dc69 0x4716de 0xc39ea3 0xc33965 0xc33792 0xbffa86 0x819fde 0x832e66 0x831f69 0x832e66 0x81b250 0x832e66 0x81a14b 0x839058 0x832e66 0x833066 0xbe9c37 0x7f80fc 0x7fba08 0x802bdf 0x45ac41
#	0x43dc68	sync.runtime_Semacquire+0x38														/usr/local/go/src/runtime/sema.go:56
#	0x4716dd	sync.(*RWMutex).Lock+0x6d														/usr/local/go/src/sync/rwmutex.go:98
#	0xc39ea2	github.com/weaveworks/cortex/pkg/ingester.(*userStates).getOrCreate+0xf2								/go/src/github.com/weaveworks/cortex/pkg/ingester/user_state.go:126
#	0xc33964	github.com/weaveworks/cortex/pkg/ingester.(*Ingester).query+0x114									/go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:387
#	0xc33791	github.com/weaveworks/cortex/pkg/ingester.(*Ingester).Query+0x111									/go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:374
#	0xbffa85	github.com/weaveworks/cortex/pkg/ingester/client._Ingester_Query_Handler.func1+0x85							/go/src/github.com/weaveworks/cortex/pkg/ingester/client/cortex.pb.go:1731

Oddly, it's a query. Doesn't seem necessary for queries to do this. (not that that would change the big picture - there are plenty of goroutines that do want to write).

tomwilkie · 2018-06-29T13:33:29Z

Does this imply there is some cause of inconsistency between the index and the series? I thought the queries should never cause a series to be created...

bboreham · 2018-06-29T13:50:08Z

It's locking to create a userState not a series.

bboreham · 2018-06-29T14:21:29Z

Current working theory is that calls from distributor to ingester are timed out at 2 seconds, and when the caller cancels the call it is removed from the count of streams.

Following this line of thinking, I increased the timeout to 10 seconds, thinking it would slow the growth. This did not prevent another ingester from bouncing at startup in the same way a bit later. I did see the distributors hit 1.5K goroutines.

That time the largest set of blocked goroutines were waiting to write:

goroutine profile: total 6297
3781 @ 0x42d8aa 0x42d95e 0x43e044 0x43dd5d 0x470798 0x47169d 0xc3a0f6 0xc330d6 0xc32d93 0xbff996 0x819fde 0x832e66 0x831f69 0x832e66 0x81b250 0x832e66 0x81a14b 0x839058 0x832e66 0x833066 0xbe9987 0x7f80fc 0x7fba08 0x802bdf 0x45ac41
#	0x43dd5c	sync.runtime_SemacquireMutex+0x3c													/usr/local/go/src/runtime/sema.go:71
#	0x470797	sync.(*Mutex).Lock+0x107														/usr/local/go/src/sync/mutex.go:134
#	0x47169c	sync.(*RWMutex).Lock+0x2c														/usr/local/go/src/sync/rwmutex.go:93
#	0xc3a0f5	github.com/weaveworks/cortex/pkg/ingester.(*userStates).getOrCreateSeries+0x135								/go/src/github.com/weaveworks/cortex/pkg/ingester/user_state.go:158
#	0xc330d5	github.com/weaveworks/cortex/pkg/ingester.(*Ingester).append+0x255									/go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:344
#	0xc32d92	github.com/weaveworks/cortex/pkg/ingester.(*Ingester).Push+0xe2										/go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:304
#	0xbff995	github.com/weaveworks/cortex/pkg/ingester/client._Ingester_Push_Handler.func1+0x85							/go/src/github.com/weaveworks/cortex/pkg/ingester/client/cortex.pb.go:1713

while one goroutine holds that lock and is waiting for readers to finish:

1 @ 0x42d8aa 0x42d95e 0x43e044 0x43dc69 0x4716de 0xc3a0f6 0xc330d6 0xc32d93 0xbff996 0x819fde 0x832e66 0x831f69 0x832e66 0x81b250 0x832e66 0x81a14b 0x839058 0x832e66 0x833066 0xbe9987 0x7f80fc 0x7fba08 0x802bdf 0x45ac41
#	0x43dc68	sync.runtime_Semacquire+0x38														/usr/local/go/src/runtime/sema.go:56
#	0x4716dd	sync.(*RWMutex).Lock+0x6d														/usr/local/go/src/sync/rwmutex.go:98
#	0xc3a0f5	github.com/weaveworks/cortex/pkg/ingester.(*userStates).getOrCreateSeries+0x135								/go/src/github.com/weaveworks/cortex/pkg/ingester/user_state.go:158
#	0xc330d5	github.com/weaveworks/cortex/pkg/ingester.(*Ingester).append+0x255									/go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:344
#	0xc32d92	github.com/weaveworks/cortex/pkg/ingester.(*Ingester).Push+0xe2										/go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:304
#	0xbff995	github.com/weaveworks/cortex/pkg/ingester/client._Ingester_Push_Handler.func1+0x85							/go/src/github.com/weaveworks/cortex/pkg/ingester/client/cortex.pb.go:1713

I am thinking we should shard userStates so more work can proceed in parallel. E.g. have 8 maps and 8 mutexes.

tomwilkie · 2018-06-29T19:03:49Z

Perhaps sync.Map is more appropriate here? https://golang.org/pkg/sync/#Map

…

On Fri, 29 Jun 2018 at 19:51, Bryan Boreham ***@***.***> wrote: Current working theory is that calls from distributor to ingester are timed out at 2 seconds, and when the caller cancels the call it is removed from the count of streams. Following this line of thinking, I increased the timeout to 10 seconds, thinking it would slow the growth. This did not prevent another ingester from bouncing at startup in the same way a bit later. I did see the distributors hit 1.5K goroutines. That time the largest set of blocked goroutines were waiting to write: goroutine profile: total 6297 3781 @ 0x42d8aa 0x42d95e 0x43e044 0x43dd5d 0x470798 0x47169d 0xc3a0f6 0xc330d6 0xc32d93 0xbff996 0x819fde 0x832e66 0x831f69 0x832e66 0x81b250 0x832e66 0x81a14b 0x839058 0x832e66 0x833066 0xbe9987 0x7f80fc 0x7fba08 0x802bdf 0x45ac41 # 0x43dd5c sync.runtime_SemacquireMutex+0x3c /usr/local/go/src/runtime/sema.go:71 # 0x470797 sync.(*Mutex).Lock+0x107 /usr/local/go/src/sync/mutex.go:134 # 0x47169c sync.(*RWMutex).Lock+0x2c /usr/local/go/src/sync/rwmutex.go:93 # 0xc3a0f5 github.com/weaveworks/cortex/pkg/ingester.(*userStates).getOrCreateSeries+0x135 /go/src/github.com/weaveworks/cortex/pkg/ingester/user_state.go:158 # 0xc330d5 github.com/weaveworks/cortex/pkg/ingester.(*Ingester).append+0x255 /go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:344 # 0xc32d92 github.com/weaveworks/cortex/pkg/ingester.(*Ingester).Push+0xe2 /go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:304 # 0xbff995 github.com/weaveworks/cortex/pkg/ingester/client._Ingester_Push_Handler.func1+0x85 /go/src/github.com/weaveworks/cortex/pkg/ingester/client/cortex.pb.go:1713 while one goroutine holds that lock and is waiting for readers to finish: 1 @ 0x42d8aa 0x42d95e 0x43e044 0x43dc69 0x4716de 0xc3a0f6 0xc330d6 0xc32d93 0xbff996 0x819fde 0x832e66 0x831f69 0x832e66 0x81b250 0x832e66 0x81a14b 0x839058 0x832e66 0x833066 0xbe9987 0x7f80fc 0x7fba08 0x802bdf 0x45ac41 # 0x43dc68 sync.runtime_Semacquire+0x38 /usr/local/go/src/runtime/sema.go:56 # 0x4716dd sync.(*RWMutex).Lock+0x6d /usr/local/go/src/sync/rwmutex.go:98 # 0xc3a0f5 github.com/weaveworks/cortex/pkg/ingester.(*userStates).getOrCreateSeries+0x135 /go/src/github.com/weaveworks/cortex/pkg/ingester/user_state.go:158 # 0xc330d5 github.com/weaveworks/cortex/pkg/ingester.(*Ingester).append+0x255 /go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:344 # 0xc32d92 github.com/weaveworks/cortex/pkg/ingester.(*Ingester).Push+0xe2 /go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:304 # 0xbff995 github.com/weaveworks/cortex/pkg/ingester/client._Ingester_Push_Handler.func1+0x85 /go/src/github.com/weaveworks/cortex/pkg/ingester/client/cortex.pb.go:1713 I am thinking we should shard userStates so more work can proceed in parallel. E.g. have 8 maps and 8 mutexes. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#858 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbGhQxx4C83dplJrdFD1svw7SyO0HWaks5uBjfxgaJpZM4U7QYB> .

bboreham · 2018-07-02T13:54:19Z

Perhaps sync.Map is more appropriate here? https://golang.org/pkg/sync/#Map

I did some reading and find your suggestion entirely plausible: very likely to work much better in steady-state though not guaranteed to work better at startup (which is where I'm seeing the problem).

I kinda feel I need the whole thing to slow down. Tried increasing the 2-second timeout to 10 seconds but now the distributors blow up.

tomwilkie · 2018-08-24T16:52:15Z

Now sync.Map is in, can we close this? Or do you think we should stick a limit on number of concurrent requests too?

bboreham · 2019-07-12T15:47:34Z

Very similar thing happened today - this time the cause may have been #1301 because we were rejecting 300,000 samples per second at the time.

So yes I want to limit the goroutines.

gouthamve · 2020-05-28T14:57:38Z

We have gone through this issue during our bugscrub and decided to close this as we haven't see it in recent times. Will re-open if we spot it again, but likely fixed.

bboreham · 2021-08-31T10:02:01Z

Note -ingester.instance-limits.max-inflight-push-requests was added in 1.9.0, which will fail any incoming requests once we get past the limit.

bboreham added the type/bug label Jun 28, 2018

bboreham mentioned this issue Jun 29, 2018

Don't create userState in query() #861

Merged

bboreham mentioned this issue Jul 4, 2018

Use sync.Map in ingester userStates #866

Merged

tomwilkie added the component/ingester label Aug 24, 2018

bboreham mentioned this issue Dec 17, 2019

Distributors OOM on a single slow ingester in the cluster #1895

Closed

gouthamve closed this as completed May 28, 2020

This was referenced Jun 28, 2021

Ingester blowing up to tens of thousands of goroutines #4324

Closed

Ingester bounds #3992

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingester blowing up to tens of thousands of goroutines #858

Ingester blowing up to tens of thousands of goroutines #858

bboreham commented Jun 28, 2018

bboreham commented Jun 28, 2018 •

edited

Loading

bboreham commented Jun 29, 2018 •

edited

Loading

bboreham commented Jun 29, 2018

tomwilkie commented Jun 29, 2018

bboreham commented Jun 29, 2018

bboreham commented Jun 29, 2018

tomwilkie commented Jun 29, 2018 via email

bboreham commented Jul 2, 2018

tomwilkie commented Aug 24, 2018

bboreham commented Jul 12, 2019

gouthamve commented May 28, 2020

bboreham commented Aug 31, 2021

Ingester blowing up to tens of thousands of goroutines #858

Ingester blowing up to tens of thousands of goroutines #858

Comments

bboreham commented Jun 28, 2018

bboreham commented Jun 28, 2018 • edited Loading

bboreham commented Jun 29, 2018 • edited Loading

bboreham commented Jun 29, 2018

tomwilkie commented Jun 29, 2018

bboreham commented Jun 29, 2018

bboreham commented Jun 29, 2018

tomwilkie commented Jun 29, 2018 via email

bboreham commented Jul 2, 2018

tomwilkie commented Aug 24, 2018

bboreham commented Jul 12, 2019

gouthamve commented May 28, 2020

bboreham commented Aug 31, 2021

bboreham commented Jun 28, 2018 •

edited

Loading

bboreham commented Jun 29, 2018 •

edited

Loading