Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester blowing up to tens of thousands of goroutines #858

Closed
bboreham opened this issue Jun 28, 2018 · 12 comments
Closed

Ingester blowing up to tens of thousands of goroutines #858

bboreham opened this issue Jun 28, 2018 · 12 comments

Comments

@bboreham
Copy link
Contributor

Had one ingester OOM a few times in succession today; the thing that stands out is the number of goroutines:

image

No obvious cause; in the logs I can see someone was playing silly with long label values, but not at high rates, and if that was the problem I would expect it to hit 3 ingesters rather than 1.

@bboreham
Copy link
Contributor Author

bboreham commented Jun 28, 2018

At the same time I see a couple of queriers also blowing up to huge size (5GB), but not to huge numbers of goroutines

@bboreham
Copy link
Contributor Author

bboreham commented Jun 29, 2018

I managed to get a goroutine profile:

goroutine profile: total 8004
4947 @ 0x42d8aa 0x42d95e 0x43e044 0x43dc69 0x4715b9 0xc3a066 0xc330d6 0xc32d93 0xbff996 0x819fde 0x832e66 0x831f69 0x832e66 0x81b250 0x832e66 0x81a14b 0x839058 0x832e66 0x833066 0xbe9987 0x7f80fc 0x7fba08 0x802bdf 0x45ac41
#	0x43dc68	sync.runtime_Semacquire+0x38														/usr/local/go/src/runtime/sema.go:56
#	0x4715b8	sync.(*RWMutex).RLock+0x48														/usr/local/go/src/sync/rwmutex.go:50
#	0xc3a065	github.com/weaveworks/cortex/pkg/ingester.(*userStates).getOrCreateSeries+0xa5								/go/src/github.com/weaveworks/cortex/pkg/ingester/user_state.go:144
#	0xc330d5	github.com/weaveworks/cortex/pkg/ingester.(*Ingester).append+0x255									/go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:344
#	0xc32d92	github.com/weaveworks/cortex/pkg/ingester.(*Ingester).Push+0xe2										/go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:304
#	0xbff995	github.com/weaveworks/cortex/pkg/ingester/client._Ingester_Push_Handler.func1+0x85							/go/src/github.com/weaveworks/cortex/pkg/ingester/client/cortex.pb.go:1713
#	0x819fdd	github.com/weaveworks/cortex/vendor/github.com/weaveworks/common/middleware.ServerUserHeaderInterceptor+0x9d				/go/src/github.com/weaveworks/cortex/vendor/github.com/weaveworks/common/middleware/grpc_auth.go:27
#	0x832e65	github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x65				/go/src/github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware/chain.go:22
#	0x831f68	github.com/weaveworks/cortex/vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc.OpenTracingServerInterceptor.func1+0x308	/go/src/github.com/weaveworks/cortex/vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc/server.go:61
#	0x832e65	github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x65				/go/src/github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware/chain.go:22
#	0x81b24f	github.com/weaveworks/cortex/vendor/github.com/weaveworks/common/middleware.ServerInstrumentInterceptor.func1+0x9f			/go/src/github.com/weaveworks/cortex/vendor/github.com/weaveworks/common/middleware/grpc_instrumentation.go:17
#	0x832e65	github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x65				/go/src/github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware/chain.go:22
#	0x81a14a	github.com/weaveworks/cortex/vendor/github.com/weaveworks/common/middleware.GRPCServerLog.UnaryServerInterceptor+0x9a			/go/src/github.com/weaveworks/cortex/vendor/github.com/weaveworks/common/middleware/grpc_logging.go:24
#	0x839057	github.com/weaveworks/cortex/vendor/github.com/weaveworks/common/middleware.(GRPCServerLog).UnaryServerInterceptor-fm+0x77		/go/src/github.com/weaveworks/cortex/vendor/github.com/weaveworks/common/server/server.go:97
#	0x832e65	github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x65				/go/src/github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware/chain.go:22
#	0x833065	github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware.ChainUnaryServer.func1+0xf5					/go/src/github.com/weaveworks/cortex/vendor/github.com/mwitkow/go-grpc-middleware/chain.go:29
#	0xbe9986	github.com/weaveworks/cortex/pkg/ingester/client._Ingester_Push_Handler+0x166								/go/src/github.com/weaveworks/cortex/pkg/ingester/client/cortex.pb.go:1715
#	0x7f80fb	github.com/weaveworks/cortex/vendor/google.golang.org/grpc.(*Server).processUnaryRPC+0x8ab						/go/src/github.com/weaveworks/cortex/vendor/google.golang.org/grpc/server.go:921
#	0x7fba07	github.com/weaveworks/cortex/vendor/google.golang.org/grpc.(*Server).handleStream+0x1317						/go/src/github.com/weaveworks/cortex/vendor/google.golang.org/grpc/server.go:1143
#	0x802bde	github.com/weaveworks/cortex/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1+0x9e						/go/src/github.com/weaveworks/cortex/vendor/google.golang.org/grpc/server.go:638

So why isn't --ingester.max-concurrent-streams protecting us from this?

Current working theory is that calls from distributor to ingester are timed out at 2 seconds, and when the caller cancels the call it is removed from the count of streams.

@bboreham
Copy link
Contributor Author

Now let's consider what they are all blocked on: us.mtx.Rlock()

This is because one goroutine is part-way through claiming the write lock, so all readers wait:

1 @ 0x42d8aa 0x42d95e 0x43e044 0x43dc69 0x4716de 0xc39ea3 0xc33965 0xc33792 0xbffa86 0x819fde 0x832e66 0x831f69 0x832e66 0x81b250 0x832e66 0x81a14b 0x839058 0x832e66 0x833066 0xbe9c37 0x7f80fc 0x7fba08 0x802bdf 0x45ac41
#	0x43dc68	sync.runtime_Semacquire+0x38														/usr/local/go/src/runtime/sema.go:56
#	0x4716dd	sync.(*RWMutex).Lock+0x6d														/usr/local/go/src/sync/rwmutex.go:98
#	0xc39ea2	github.com/weaveworks/cortex/pkg/ingester.(*userStates).getOrCreate+0xf2								/go/src/github.com/weaveworks/cortex/pkg/ingester/user_state.go:126
#	0xc33964	github.com/weaveworks/cortex/pkg/ingester.(*Ingester).query+0x114									/go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:387
#	0xc33791	github.com/weaveworks/cortex/pkg/ingester.(*Ingester).Query+0x111									/go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:374
#	0xbffa85	github.com/weaveworks/cortex/pkg/ingester/client._Ingester_Query_Handler.func1+0x85							/go/src/github.com/weaveworks/cortex/pkg/ingester/client/cortex.pb.go:1731

Oddly, it's a query. Doesn't seem necessary for queries to do this. (not that that would change the big picture - there are plenty of goroutines that do want to write).

@tomwilkie
Copy link
Contributor

Does this imply there is some cause of inconsistency between the index and the series? I thought the queries should never cause a series to be created...

@bboreham
Copy link
Contributor Author

It's locking to create a userState not a series.

@bboreham
Copy link
Contributor Author

Current working theory is that calls from distributor to ingester are timed out at 2 seconds, and when the caller cancels the call it is removed from the count of streams.

Following this line of thinking, I increased the timeout to 10 seconds, thinking it would slow the growth. This did not prevent another ingester from bouncing at startup in the same way a bit later. I did see the distributors hit 1.5K goroutines.

That time the largest set of blocked goroutines were waiting to write:

goroutine profile: total 6297
3781 @ 0x42d8aa 0x42d95e 0x43e044 0x43dd5d 0x470798 0x47169d 0xc3a0f6 0xc330d6 0xc32d93 0xbff996 0x819fde 0x832e66 0x831f69 0x832e66 0x81b250 0x832e66 0x81a14b 0x839058 0x832e66 0x833066 0xbe9987 0x7f80fc 0x7fba08 0x802bdf 0x45ac41
#	0x43dd5c	sync.runtime_SemacquireMutex+0x3c													/usr/local/go/src/runtime/sema.go:71
#	0x470797	sync.(*Mutex).Lock+0x107														/usr/local/go/src/sync/mutex.go:134
#	0x47169c	sync.(*RWMutex).Lock+0x2c														/usr/local/go/src/sync/rwmutex.go:93
#	0xc3a0f5	github.com/weaveworks/cortex/pkg/ingester.(*userStates).getOrCreateSeries+0x135								/go/src/github.com/weaveworks/cortex/pkg/ingester/user_state.go:158
#	0xc330d5	github.com/weaveworks/cortex/pkg/ingester.(*Ingester).append+0x255									/go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:344
#	0xc32d92	github.com/weaveworks/cortex/pkg/ingester.(*Ingester).Push+0xe2										/go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:304
#	0xbff995	github.com/weaveworks/cortex/pkg/ingester/client._Ingester_Push_Handler.func1+0x85							/go/src/github.com/weaveworks/cortex/pkg/ingester/client/cortex.pb.go:1713

while one goroutine holds that lock and is waiting for readers to finish:

1 @ 0x42d8aa 0x42d95e 0x43e044 0x43dc69 0x4716de 0xc3a0f6 0xc330d6 0xc32d93 0xbff996 0x819fde 0x832e66 0x831f69 0x832e66 0x81b250 0x832e66 0x81a14b 0x839058 0x832e66 0x833066 0xbe9987 0x7f80fc 0x7fba08 0x802bdf 0x45ac41
#	0x43dc68	sync.runtime_Semacquire+0x38														/usr/local/go/src/runtime/sema.go:56
#	0x4716dd	sync.(*RWMutex).Lock+0x6d														/usr/local/go/src/sync/rwmutex.go:98
#	0xc3a0f5	github.com/weaveworks/cortex/pkg/ingester.(*userStates).getOrCreateSeries+0x135								/go/src/github.com/weaveworks/cortex/pkg/ingester/user_state.go:158
#	0xc330d5	github.com/weaveworks/cortex/pkg/ingester.(*Ingester).append+0x255									/go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:344
#	0xc32d92	github.com/weaveworks/cortex/pkg/ingester.(*Ingester).Push+0xe2										/go/src/github.com/weaveworks/cortex/pkg/ingester/ingester.go:304
#	0xbff995	github.com/weaveworks/cortex/pkg/ingester/client._Ingester_Push_Handler.func1+0x85							/go/src/github.com/weaveworks/cortex/pkg/ingester/client/cortex.pb.go:1713

I am thinking we should shard userStates so more work can proceed in parallel. E.g. have 8 maps and 8 mutexes.

@tomwilkie
Copy link
Contributor

tomwilkie commented Jun 29, 2018 via email

@bboreham
Copy link
Contributor Author

bboreham commented Jul 2, 2018

Perhaps sync.Map is more appropriate here? https://golang.org/pkg/sync/#Map

I did some reading and find your suggestion entirely plausible: very likely to work much better in steady-state though not guaranteed to work better at startup (which is where I'm seeing the problem).

I kinda feel I need the whole thing to slow down. Tried increasing the 2-second timeout to 10 seconds but now the distributors blow up.

@tomwilkie
Copy link
Contributor

Now sync.Map is in, can we close this? Or do you think we should stick a limit on number of concurrent requests too?

@bboreham
Copy link
Contributor Author

Very similar thing happened today - this time the cause may have been #1301 because we were rejecting 300,000 samples per second at the time.

So yes I want to limit the goroutines.

@gouthamve
Copy link
Contributor

We have gone through this issue during our bugscrub and decided to close this as we haven't see it in recent times. Will re-open if we spot it again, but likely fixed.

@bboreham
Copy link
Contributor Author

Note -ingester.instance-limits.max-inflight-push-requests was added in 1.9.0, which will fail any incoming requests once we get past the limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants