when ingester starting, ingester increase thousands goroutines #4393

georgehao · 2021-08-03T03:17:00Z

Describe the bug

Hi, When a ingester just starting a moment, the goroutines increase thousands. but wait just 1h，the ingester started, the gorouines decrease 100

there is two feature:

when ingester run a period of time, the server pressure is not large.
but when all the ingester just start at the same time, all the ingester can't starting. the goroutines of the ingester became large, finally oom, restarting.

when starting:

when started:

To Reproduce

cortex version: release-1.9.0
no ingestion_rate

Expected behavior

when starting, why so many goroutines

Storage Engine

Blocks
Chunks

thanks

bboreham · 2021-08-09T19:30:06Z

Can you upload the actual profile? I find those graphviz diagrams unreadable.
Optionally using https://share.polarsignals.com/.

georgehao · 2021-08-13T02:14:59Z

goroutine profile: total 63948
63873 @ 0x43a7c5 0x44b2e5 0x44b2ce 0x46c5a7 0x47b825 0x47cf90 0x47cf22 0x9fc995 0x9fc725 0x9f816a 0x1fdd276 0x1fd292d 0x16ede89 0x202eee3 0xb8dda3 0x201b256 0xb8dda3 0xb91b7a 0xb8dda3 0xccd3bd 0xb8dda3 0xcca2b8 0xb8dda3 0xb8df97 0x16d3890 0xb3e6a2 0xb42825 0xb50ee5 0x470621
#	0x46c5a6	sync.runtime_SemacquireMutex+0x46							/data1/export/go1.15.7/src/runtime/sema.go:71
#	0x47b824	sync.(*Mutex).lockSlow+0x104								/data1/export/go1.15.7/src/sync/mutex.go:138
#	0x47cf8f	sync.(*Mutex).Lock+0x8f									/data1/export/go1.15.7/src/sync/mutex.go:81
#	0x47cf21	sync.(*RWMutex).Lock+0x21								/data1/export/go1.15.7/src/sync/rwmutex.go:98
#	0x9fc994	github.com/prometheus/prometheus/tsdb.(*Head).getOrCreateWithID+0x214			/home/xx/gopath/src/github.com/cortexproject/cortex/vendor/github.com/prometheus/prometheus/tsdb/head.go:1800
#	0x9fc724	github.com/prometheus/prometheus/tsdb.(*Head).getOrCreate+0xe4				/home/xx/gopath/src/github.com/cortexproject/cortex/vendor/github.com/prometheus/prometheus/tsdb/head.go:1783
#	0x9f8169	github.com/prometheus/prometheus/tsdb.(*headAppender).Add+0x129				/home/xx/gopath/src/github.com/cortexproject/cortex/vendor/github.com/prometheus/prometheus/tsdb/head.go:1193
#	0x1fdd275	github.com/cortexproject/cortex/pkg/ingester.(*Ingester).v2Push+0x15d5			/home/xx/gopath/src/github.com/cortexproject/cortex/pkg/ingester/ingester_v2.go:766
#	0x1fd292c	github.com/cortexproject/cortex/pkg/ingester.(*Ingester).Push+0x9ec			/home/xx/gopath/src/github.com/cortexproject/cortex/pkg/ingester/ingester.go:445
#	0x16ede88	github.com/cortexproject/cortex/pkg/ingester/client._Ingester_Push_Handler.func1+0x88	/home/xx/gopath/src/github.com/cortexproject/cortex/pkg/ingester/client/ingester.pb.go:2565
#	0x202eee2	github.com/cortexproject/cortex/pkg/cortex.ThanosTracerUnaryInterceptor+0xa2		/home/xx/gopath/src/github.com/cortexproject/cortex/pkg/cortex/tracing.go:14
#	0xb8dda2	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/home/xx/gopath/src/github.com/cortexproject/cortex/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0x201b255	github.com/cortexproject/cortex/pkg/util/fakeauth.glob..func2+0xb5			/home/xx/gopath/src/github.com/cortexproject/cortex/pkg/util/fakeauth/fake_auth.go:60
#	0xb8dda2	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/home/xx/gopath/src/github.com/cortexproject/cortex/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xb91b79	github.com/opentracing-contrib/go-grpc.OpenTracingServerInterceptor.func1+0x2d9		/home/xx/gopath/src/github.com/cortexproject/cortex/vendor/github.com/opentracing-contrib/go-grpc/server.go:57
#	0xb8dda2	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/home/xx/gopath/src/github.com/cortexproject/cortex/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xccd3bc	github.com/weaveworks/common/middleware.UnaryServerInstrumentInterceptor.func1+0x9c	/home/xx/gopath/src/github.com/cortexproject/cortex/vendor/github.com/weaveworks/common/middleware/grpc_instrumentation.go:32
#	0xb8dda2	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/home/xx/gopath/src/github.com/cortexproject/cortex/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xcca2b7	github.com/weaveworks/common/middleware.GRPCServerLog.UnaryServerInterceptor+0x97	/home/xx/gopath/src/github.com/cortexproject/cortex/vendor/github.com/weaveworks/common/middleware/grpc_logging.go:29
#	0xb8dda2	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/home/xx/gopath/src/github.com/cortexproject/cortex/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xb8df96	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1+0xd6		/home/xx/gopath/src/github.com/cortexproject/cortex/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34
#	0x16d388f	github.com/cortexproject/cortex/pkg/ingester/client._Ingester_Push_Handler+0x14f	/home/xx/gopath/src/github.com/cortexproject/cortex/pkg/ingester/client/ingester.pb.go:2567
#	0xb3e6a1	google.golang.org/grpc.(*Server).processUnaryRPC+0x521					/home/xx/gopath/src/github.com/cortexproject/cortex/vendor/google.golang.org/grpc/server.go:1210
#	0xb42824	google.golang.org/grpc.(*Server).handleStream+0xd04					/home/xx/gopath/src/github.com/cortexproject/cortex/vendor/google.golang.org/grpc/server.go:1533
#	0xb50ee4	google.golang.org/grpc.(*Server).serveStreams.func1.2+0xa4				/home/xx/gopath/src/github.com/cortexproject/cortex/vendor/google.golang.org/grpc/server.go:871

bboreham · 2021-08-13T14:21:43Z

The large number looks the same as #4324. We have a limit added in #3992, -ingester.instance-limits.max-inflight-push-requests.

With the full profile I could check to see if anything unusual was blocking them.

georgehao · 2021-08-14T08:15:54Z

@bboreham The -ingester.instance-limits.max-inflight-push-requests option is not a good idea. The goroutine become thousands only when ingester starting. But If the Ingester started, the goroutine only just 100. So just let the ingester started, I need add so many machine to avoid lock competition. this is expensive.

The -ingester.instance-limits.max-inflight-push-requests option I also tried, because the timeseries is so many of per request, the max-inflight-push-requests need set so small, so when ingester starting, the max-inflight-push-requests is not useful.

I also check the reason, the mutex all locked at "github.com/prometheus/prometheus/tsdb.(*Head).getOrCreateWithID+0x214".

The root cause is that:

when ingester starting, the hash of series label not is in memory, so it's need create a tsdb head.
create tsdb head need add tsdb memory-postings
add tsdb memory-postings need sort all the SeriesID
when the number of SeriesID is large, this is very cost of time

So , The solution may be: reduce the cost of sort the postings SeriesID

georgehao · 2021-08-14T08:23:13Z

also this is may be another way.

like #4408. if can split many tenant and can query at once. when ingester pushed, there may be many db of tenant. this can decrease the lock competition

bboreham · 2021-08-17T11:39:31Z

Now it looks like #3349. We had a PR proposed which sharded locks.

I think you're saying MemPostings.addFor() does a sort which is slow. It does look suspiciously like a bubble-sort.

All of that code is now from Prometheus; it would be necessary to create a benchmark there to demonstrate the issue, and PR to fix it there.

By the way, you said "cortex version: release-1.9.0", but this line from the profile extract does not match:

0x9fc994 github.com/prometheus/prometheus/tsdb.(*Head).getOrCreateWithID+0x214 /home/xx/gopath/src/github.com/cortexproject/cortex/vendor/github.com/prometheus/prometheus/tsdb/head.go:1800

Here is line 1800 for v1.9.0:

cortex/vendor/github.com/prometheus/prometheus/tsdb/head.go

Line 1800 in ed4f339

}

Version 1.8 is a much better match:

cortex/vendor/github.com/prometheus/prometheus/tsdb/head.go

Line 1800 in 8cf4eef

h.symMtx.Lock()

bboreham · 2021-08-17T14:04:33Z

I looked at the Prometheus code here:
https://github.com/prometheus/prometheus/blob/bb05485c79084fecd3602eceafca3d554ab88987/tsdb/index/postings.go#L326-L331

which, if series IDs were in random order, would be very slow.
However in Cortex the series IDs are generated by incrementing a counter, so we always add to the end of the list.

georgehao · 2021-08-18T03:13:47Z

@bboreham yes, the pprof profile is release-1.8。I copy the wrong one.

yeah, I say MemPostings.addFor().

seriesID increase

it seem that seriesID is created and increased by prometheus tsbd.Head : https://github.com/prometheus/prometheus/blob/bb05485c79/tsdb/head.go#L1185

And the seriesID is increase by h.lastSeriesID.Inc(), so seriesID is grow bigger.

so in prometheus, it seem if seriesID pass to MemPostings.addFor() by order, it will add to the end of the list. BUT..

cortex

But cortex only have one db, and the db only have one head, so when v2Push is called by goroutines, lastSeriesID is increased by prometheus Head.getOrCreate, because of the Head. getOrCreateWithID lock, it maybe not insert into the MemPosting.

so, When a goroutine fetch the lock, and send to head.MemPostings, it maybe smaller than last seriesId inserted. In my test, when cortex ingester starting, this lock let the seriesId out of order, so it sort the seriesID cost large of time

georgehao changed the title ~~Ingester start a moment, the goroutines increase thousands goroutines~~ when ingester starting, ingester increase thousands goroutines Aug 3, 2021

bboreham added the type/performance label Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

when ingester starting, ingester increase thousands goroutines #4393

when ingester starting, ingester increase thousands goroutines #4393

georgehao commented Aug 3, 2021 •

edited

Loading

bboreham commented Aug 9, 2021

georgehao commented Aug 13, 2021 •

edited

Loading

bboreham commented Aug 13, 2021

georgehao commented Aug 14, 2021

georgehao commented Aug 14, 2021

bboreham commented Aug 17, 2021 •

edited

Loading

bboreham commented Aug 17, 2021

georgehao commented Aug 18, 2021 •

edited

Loading

when ingester starting, ingester increase thousands goroutines #4393

when ingester starting, ingester increase thousands goroutines #4393

Comments

georgehao commented Aug 3, 2021 • edited Loading

bboreham commented Aug 9, 2021

georgehao commented Aug 13, 2021 • edited Loading

bboreham commented Aug 13, 2021

georgehao commented Aug 14, 2021

georgehao commented Aug 14, 2021

bboreham commented Aug 17, 2021 • edited Loading

bboreham commented Aug 17, 2021

georgehao commented Aug 18, 2021 • edited Loading

seriesID increase

cortex

georgehao commented Aug 3, 2021 •

edited

Loading

georgehao commented Aug 13, 2021 •

edited

Loading

bboreham commented Aug 17, 2021 •

edited

Loading

georgehao commented Aug 18, 2021 •

edited

Loading