Ingester bounds #3992

pstibrany · 2021-03-22T15:28:35Z

What this PR does: This PR implements various global (per-ingester, not per-tenant) limits for use by ingester:

Max number of series in memory
Max number of tenants in memory
Max number of inflight requests
Max ingestion rate.

All of these are disabled by default, and can be changed by using config/CLI parameters, and by using runtime configuration (to avoid redeploy of ingesters). Current limits are exported as cortex_ingester_global_limit metric with various limit label. If ingester finds that push request would go over one of these limits, it returns 500 error.

Which issue(s) this PR fixes:
Fixes #665.

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

56quarters

Looks good, small nit about atomic alignment.

It seems like more settings are being added to runtime config due to a desire to change them without restarts. Obviously not something to be solved here but, might be time to start thinking about how we could enable more configuration to be changed at runtime without needing to add it to the runtimeConfigValues struct.

pkg/ingester/ingester.go

docs/configuration/config-file-reference.md

pracucci

Solid work! 👏 The overall logic makes sense to me. I've some concerns about the "global limits" naming (because we already have "global" limits) and some high level comments. I will take a deeper look at tiny details during the 2nd pass review.

docs/configuration/config-file-reference.md

pkg/ingester/ingester_v2.go

pkg/ingester/global_limits.go

pkg/ingester/ingester.go

pkg/ingester/ingester_v2.go

pkg/ingester/global_limits.go

ranton256

Looks good to me overall. Thanks for working on this.

pkg/ingester/ingester_v2.go

docs/configuration/config-file-reference.md

ranton256 · 2021-03-31T20:13:14Z

Looks good to me overall. Thanks for working on this.

I remembered after I sent this, I was also going to ask if you had any data on the performance difference with the new limits enabled.

pstibrany · 2021-03-31T21:58:25Z

I remembered after I sent this, I was also going to ask if you had any data on the performance difference with the new limits enabled.

We haven't run this code in any of our environments yet. I expect that we will start testing it sometime after easter. I have done some benchmarking in this comment, but it was only comparing master vs this branch with nil limits. Included in this PR is benchmark which compares some failure scenarios with and without limits. Unfortunately I don't have numbers ready at the moment to post it here.

ranton256

LGTM, and thanks for the info on the benchmarking status.

tomwilkie

Given this a once over and LGTM! Thanks Peter.

pracucci

Very good job and super sorry for my late review! I left few nits. No need for me to re-review it once they're addressed. Thanks!

pkg/ingester/instance_limits.go

pkg/ingester/ingester_v2.go

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Rename max_users to max_tenants. Removed extra parameter to `getOrCreateTSDB` Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

…that these limits only work when using blocks engine. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

pstibrany · 2021-04-09T13:57:34Z

Thank you for all reviews!

bboreham · 2021-06-29T08:11:46Z

Also fixes #858

bboreham · 2021-06-29T08:21:08Z

pkg/ingester/metrics.go

+		inflightRequests: promauto.With(r).NewGaugeFunc(prometheus.GaugeOpts{
+			Name: "cortex_ingester_inflight_push_requests",


This duplicates cortex_inflight_requests{route="/cortex.Ingester/Push"}.

Yes it does. But it exposes value use for limit check.

bboreham · 2021-07-05T16:12:41Z

I tried this out, with the limit set to 500 and many synthetic requests being pushed in from Avalanche.

It did, as hoped, cause a lot of errors "cannot push: too many inflight push requests in ingester".

I also expected the change to cut the number of goroutines, but there were still a lot. Mostly in gRPC before the code with the limit is hit:

goroutine profile: total 5860
5274 @ 0x43b2c5 0x44c437 0xac6d05 0xac6bd1 0xac7c35 0x47f887 0xac7b72 0xac7b2f 0xb3ca23 0xb3d64d 0xb437f6 0xb47b8c 0xb5648b 0x472701
#	0xac6d04	google.golang.org/grpc/internal/transport.(*recvBufferReader).read+0xa4		/backend-enterprise/vendor/google.golang.org/grpc/internal/transport/transport.go:177
#	0xac6bd0	google.golang.org/grpc/internal/transport.(*recvBufferReader).Read+0x210	/backend-enterprise/vendor/google.golang.org/grpc/internal/transport/transport.go:171
#	0xac7c34	google.golang.org/grpc/internal/transport.(*transportReader).Read+0x54		/backend-enterprise/vendor/google.golang.org/grpc/internal/transport/transport.go:482
#	0x47f886	io.ReadAtLeast+0x86								/usr/local/go/src/io/io.go:328
#	0xac7b71	io.ReadFull+0xd1								/usr/local/go/src/io/io.go:347
#	0xac7b2e	google.golang.org/grpc/internal/transport.(*Stream).Read+0x8e			/backend-enterprise/vendor/google.golang.org/grpc/internal/transport/transport.go:466
#	0xb3ca22	google.golang.org/grpc.(*parser).recvMsg+0x62					/backend-enterprise/vendor/google.golang.org/grpc/rpc_util.go:557
#	0xb3d64c	google.golang.org/grpc.recvAndDecompress+0x4c					/backend-enterprise/vendor/google.golang.org/grpc/rpc_util.go:688
#	0xb437f5	google.golang.org/grpc.(*Server).processUnaryRPC+0x355				/backend-enterprise/vendor/google.golang.org/grpc/server.go:1176
#	0xb47b8b	google.golang.org/grpc.(*Server).handleStream+0xd0b				/backend-enterprise/vendor/google.golang.org/grpc/server.go:1533
#	0xb5648a	google.golang.org/grpc.(*Server).serveStreams.func1.2+0xaa			/backend-enterprise/vendor/google.golang.org/grpc/server.go:871

161 @ 0x43b2c5 0x44cf05 0x44ceee 0x41cb5e 0x41cb42 0x41fd58 0x40e1d6 0x40b699 0x9c765e 0x9c9254 0x9b3775 0x20844a7 0x20799dd 0x156b2e9 0x20db263 0xb9ab23 0xd0b2a4 0x20c6cb6 0xb9ab23 0xb9e7e2 0xb9ab23 0xd0eafa 0xb9ab23 0xd0b814 0xb9ab23 0xb9ad17 0x154fdb0 0xb439cb 0xb47b8c 0xb5648b 0x472701
#	0x9c765d	github.com/prometheus/prometheus/tsdb.(*Head).putSeriesBuffer+0x3d			/backend-enterprise/vendor/github.com/prometheus/prometheus/tsdb/head.go:1282
#	0x9c9253	github.com/prometheus/prometheus/tsdb.(*headAppender).Commit+0x633			/backend-enterprise/vendor/github.com/prometheus/prometheus/tsdb/head.go:1521
#	0x9b3774	github.com/prometheus/prometheus/tsdb.dbAppender.Commit+0x34				/backend-enterprise/vendor/github.com/prometheus/prometheus/tsdb/db.go:817
#	0x20844a6	github.com/cortexproject/cortex/pkg/ingester.(*Ingester).v2Push+0x1a06			/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ingester/ingester_v2.go:896
#	0x20799dc	github.com/cortexproject/cortex/pkg/ingester.(*Ingester).Push+0x8dc			/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ingester/ingester.go:475
#	0x156b2e8	github.com/cortexproject/cortex/pkg/ingester/client._Ingester_Push_Handler.func1+0x88	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ingester/client/ingester.pb.go:2565
#	0x20db262	github.com/cortexproject/cortex/pkg/cortex.ThanosTracerUnaryInterceptor+0xa2		/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/cortex/tracing.go:14
#	0xb9ab22	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xd0b2a3	github.com/weaveworks/common/middleware.ServerUserHeaderInterceptor+0xa3		/backend-enterprise/vendor/github.com/weaveworks/common/middleware/grpc_auth.go:38
#	0x20c6cb5	github.com/cortexproject/cortex/pkg/util/fakeauth.SetupAuthMiddleware.func1+0x115	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/fakeauth/fake_auth.go:27
#	0xb9ab22	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xb9e7e1	github.com/opentracing-contrib/go-grpc.OpenTracingServerInterceptor.func1+0x301		/backend-enterprise/vendor/github.com/opentracing-contrib/go-grpc/server.go:57
#	0xb9ab22	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xd0eaf9	github.com/weaveworks/common/middleware.UnaryServerInstrumentInterceptor.func1+0x99	/backend-enterprise/vendor/github.com/weaveworks/common/middleware/grpc_instrumentation.go:32
#	0xb9ab22	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xd0b813	github.com/weaveworks/common/middleware.GRPCServerLog.UnaryServerInterceptor+0x93	/backend-enterprise/vendor/github.com/weaveworks/common/middleware/grpc_logging.go:29
#	0xb9ab22	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xb9ad16	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1+0xd6		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34
#	0x154fdaf	github.com/cortexproject/cortex/pkg/ingester/client._Ingester_Push_Handler+0x14f	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ingester/client/ingester.pb.go:2567
#	0xb439ca	google.golang.org/grpc.(*Server).processUnaryRPC+0x52a					/backend-enterprise/vendor/google.golang.org/grpc/server.go:1210
#	0xb47b8b	google.golang.org/grpc.(*Server).handleStream+0xd0b					/backend-enterprise/vendor/google.golang.org/grpc/server.go:1533
#	0xb5648a	google.golang.org/grpc.(*Server).serveStreams.func1.2+0xaa				/backend-enterprise/vendor/google.golang.org/grpc/server.go:871

This in turn was caused by me using the cortex-jsonnet config which sets -server.grpc-max-concurrent-streams to 100,000: https://github.com/grafana/cortex-jsonnet/blob/23d110a5f0450417a551102007167d146702513f/cortex/ingester.libsonnet#L34

Reducing -server.grpc-max-concurrent-streams to 500 capped the number of goroutines below 1000.

pstibrany requested a review from pracucci March 22, 2021 15:28

pull-request-size bot added the size/XL label Mar 22, 2021

pstibrany mentioned this pull request Mar 23, 2021

Ingesters should not grow without bound #665

Closed

56quarters approved these changes Mar 24, 2021

View reviewed changes

pkg/ingester/ingester.go Show resolved Hide resolved

weeco reviewed Mar 26, 2021

View reviewed changes

docs/configuration/config-file-reference.md Outdated Show resolved Hide resolved

weeco reviewed Mar 26, 2021

View reviewed changes

docs/configuration/config-file-reference.md Outdated Show resolved Hide resolved

pracucci reviewed Mar 29, 2021

View reviewed changes

pstibrany force-pushed the ingester-limits branch 3 times, most recently from 1b8dc66 to d37db98 Compare March 31, 2021 07:46

ranton256 reviewed Mar 31, 2021

View reviewed changes

pkg/ingester/ingester_v2.go Outdated Show resolved Hide resolved

docs/configuration/config-file-reference.md Show resolved Hide resolved

ranton256 approved these changes Apr 2, 2021

View reviewed changes

pstibrany force-pushed the ingester-limits branch from 71d9f41 to 865b487 Compare April 6, 2021 08:06

tomwilkie approved these changes Apr 7, 2021

View reviewed changes

pracucci approved these changes Apr 9, 2021

View reviewed changes

pkg/ingester/instance_limits.go Outdated Show resolved Hide resolved

pkg/ingester/instance_limits.go Outdated Show resolved Hide resolved

pkg/ingester/ingester_v2.go Outdated Show resolved Hide resolved

pstibrany force-pushed the ingester-limits branch from f317bb1 to b7b1f8d Compare April 9, 2021 13:24

pstibrany added 12 commits April 9, 2021 15:24

Added global ingester limits.

82d0a51

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Add tests for global limits.

edd0587

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Expose current limits used by ingester via metrics.

023a25e

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Add max inflight requests limit.

baabc5f

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Added test for inflight push requests.

ffc8155

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Docs.

635e218

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Debug log.

90390ef

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Test for unmarshalling.

e37a216

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Nil default global limits.

82b14c2

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

CHANGELOG.md

e97bef6

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Expose current ingestion rate as gauge.

97af696

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Expose number of inflight requests.

e870730

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

pstibrany added 11 commits April 9, 2021 15:25

Change ewmaRate to use RWMutex.

2360350

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Rename globalLimits to instanceLimits.

0a68912

Rename max_users to max_tenants. Removed extra parameter to `getOrCreateTSDB` Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Rename globalLimits to instanceLimits, fix users -> tenants, explain …

b453a1e

…that these limits only work when using blocks engine. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Rename globalLimits to instanceLimits, fix users -> tenants, explain …

37be0e6

…that these limits only work when using blocks engine. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Remove details from error messages.

3be949c

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Comment.

69a6a63

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Fix series count when closing non-empty TSDB.

47b93ce

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Added new failure modes to benchmark.

202f1f4

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Fixed docs.

809e4c0

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Tick every second.

34266ad

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Fix CHANGELOG.md

e20cd3b

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

pstibrany force-pushed the ingester-limits branch from b7b1f8d to 2581afa Compare April 9, 2021 13:25

pstibrany and others added 4 commits April 9, 2021 15:25

Review feedback.

b059e8e

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Review feedback.

ca04f26

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Remove forgotten fmt.Println.

6693661

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

Use error variables.

2581afa

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

pstibrany merged commit 7f85a26 into cortexproject:master Apr 9, 2021

pracucci mentioned this pull request Jun 28, 2021

Ingester blowing up to tens of thousands of goroutines #4324

Closed

bboreham reviewed Jun 29, 2021

View reviewed changes

bboreham mentioned this pull request Aug 13, 2021

when ingester starting, ingester increase thousands goroutines #4393

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingester bounds #3992

Ingester bounds #3992

pstibrany commented Mar 22, 2021 •

edited

Loading

56quarters left a comment

pracucci left a comment

ranton256 left a comment

ranton256 commented Mar 31, 2021

pstibrany commented Mar 31, 2021 •

edited

Loading

ranton256 left a comment

tomwilkie left a comment

pracucci left a comment

pstibrany commented Apr 9, 2021

bboreham commented Jun 29, 2021

bboreham Jun 29, 2021

pstibrany Jun 29, 2021

bboreham commented Jul 5, 2021

		inflightRequests: promauto.With(r).NewGaugeFunc(prometheus.GaugeOpts{
		Name: "cortex_ingester_inflight_push_requests",

Ingester bounds #3992

Ingester bounds #3992

Conversation

pstibrany commented Mar 22, 2021 • edited Loading

56quarters left a comment

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

ranton256 left a comment

Choose a reason for hiding this comment

ranton256 commented Mar 31, 2021

pstibrany commented Mar 31, 2021 • edited Loading

ranton256 left a comment

Choose a reason for hiding this comment

tomwilkie left a comment

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

pstibrany commented Apr 9, 2021

bboreham commented Jun 29, 2021

bboreham Jun 29, 2021

Choose a reason for hiding this comment

pstibrany Jun 29, 2021

Choose a reason for hiding this comment

bboreham commented Jul 5, 2021

pstibrany commented Mar 22, 2021 •

edited

Loading

pstibrany commented Mar 31, 2021 •

edited

Loading