Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester bounds #3992

Merged
merged 27 commits into from
Apr 9, 2021
Merged

Ingester bounds #3992

merged 27 commits into from
Apr 9, 2021

Conversation

pstibrany
Copy link
Contributor

@pstibrany pstibrany commented Mar 22, 2021

What this PR does: This PR implements various global (per-ingester, not per-tenant) limits for use by ingester:

  • Max number of series in memory
  • Max number of tenants in memory
  • Max number of inflight requests
  • Max ingestion rate.

All of these are disabled by default, and can be changed by using config/CLI parameters, and by using runtime configuration (to avoid redeploy of ingesters). Current limits are exported as cortex_ingester_global_limit metric with various limit label. If ingester finds that push request would go over one of these limits, it returns 500 error.

Which issue(s) this PR fixes:
Fixes #665.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Contributor

@56quarters 56quarters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, small nit about atomic alignment.

It seems like more settings are being added to runtime config due to a desire to change them without restarts. Obviously not something to be solved here but, might be time to start thinking about how we could enable more configuration to be changed at runtime without needing to add it to the runtimeConfigValues struct.

pkg/ingester/ingester.go Show resolved Hide resolved
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid work! 👏 The overall logic makes sense to me. I've some concerns about the "global limits" naming (because we already have "global" limits) and some high level comments. I will take a deeper look at tiny details during the 2nd pass review.

docs/configuration/config-file-reference.md Outdated Show resolved Hide resolved
docs/configuration/config-file-reference.md Outdated Show resolved Hide resolved
pkg/ingester/ingester_v2.go Outdated Show resolved Hide resolved
pkg/ingester/global_limits.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
pkg/ingester/ingester_v2.go Outdated Show resolved Hide resolved
pkg/ingester/ingester_v2.go Show resolved Hide resolved
pkg/ingester/ingester_v2.go Outdated Show resolved Hide resolved
pkg/ingester/ingester_v2.go Outdated Show resolved Hide resolved
pkg/ingester/global_limits.go Outdated Show resolved Hide resolved
@pstibrany pstibrany force-pushed the ingester-limits branch 3 times, most recently from 1b8dc66 to d37db98 Compare March 31, 2021 07:46
Copy link
Contributor

@ranton256 ranton256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall. Thanks for working on this.

pkg/ingester/ingester_v2.go Outdated Show resolved Hide resolved
docs/configuration/config-file-reference.md Show resolved Hide resolved
@ranton256
Copy link
Contributor

Looks good to me overall. Thanks for working on this.

I remembered after I sent this, I was also going to ask if you had any data on the performance difference with the new limits enabled.

@pstibrany
Copy link
Contributor Author

pstibrany commented Mar 31, 2021

I remembered after I sent this, I was also going to ask if you had any data on the performance difference with the new limits enabled.

We haven't run this code in any of our environments yet. I expect that we will start testing it sometime after easter. I have done some benchmarking in this comment, but it was only comparing master vs this branch with nil limits. Included in this PR is benchmark which compares some failure scenarios with and without limits. Unfortunately I don't have numbers ready at the moment to post it here.

Copy link
Contributor

@ranton256 ranton256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and thanks for the info on the benchmarking status.

Copy link
Contributor

@tomwilkie tomwilkie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this a once over and LGTM! Thanks Peter.

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good job and super sorry for my late review! I left few nits. No need for me to re-review it once they're addressed. Thanks!

pkg/ingester/instance_limits.go Outdated Show resolved Hide resolved
pkg/ingester/instance_limits.go Outdated Show resolved Hide resolved
pkg/ingester/ingester_v2.go Outdated Show resolved Hide resolved
pstibrany added 12 commits April 9, 2021 15:24
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
pstibrany added 11 commits April 9, 2021 15:25
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Rename max_users to max_tenants.
Removed extra parameter to `getOrCreateTSDB`

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
…that these limits only work when using blocks engine.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
…that these limits only work when using blocks engine.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
pstibrany and others added 4 commits April 9, 2021 15:25
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
@pstibrany
Copy link
Contributor Author

Thank you for all reviews!

@pstibrany pstibrany merged commit 7f85a26 into cortexproject:master Apr 9, 2021
@bboreham
Copy link
Contributor

Also fixes #858

Comment on lines +262 to +263
inflightRequests: promauto.With(r).NewGaugeFunc(prometheus.GaugeOpts{
Name: "cortex_ingester_inflight_push_requests",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates cortex_inflight_requests{route="/cortex.Ingester/Push"}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does. But it exposes value use for limit check.

@bboreham
Copy link
Contributor

bboreham commented Jul 5, 2021

I tried this out, with the limit set to 500 and many synthetic requests being pushed in from Avalanche.

It did, as hoped, cause a lot of errors "cannot push: too many inflight push requests in ingester".

I also expected the change to cut the number of goroutines, but there were still a lot. Mostly in gRPC before the code with the limit is hit:

goroutine profile: total 5860
5274 @ 0x43b2c5 0x44c437 0xac6d05 0xac6bd1 0xac7c35 0x47f887 0xac7b72 0xac7b2f 0xb3ca23 0xb3d64d 0xb437f6 0xb47b8c 0xb5648b 0x472701
#	0xac6d04	google.golang.org/grpc/internal/transport.(*recvBufferReader).read+0xa4		/backend-enterprise/vendor/google.golang.org/grpc/internal/transport/transport.go:177
#	0xac6bd0	google.golang.org/grpc/internal/transport.(*recvBufferReader).Read+0x210	/backend-enterprise/vendor/google.golang.org/grpc/internal/transport/transport.go:171
#	0xac7c34	google.golang.org/grpc/internal/transport.(*transportReader).Read+0x54		/backend-enterprise/vendor/google.golang.org/grpc/internal/transport/transport.go:482
#	0x47f886	io.ReadAtLeast+0x86								/usr/local/go/src/io/io.go:328
#	0xac7b71	io.ReadFull+0xd1								/usr/local/go/src/io/io.go:347
#	0xac7b2e	google.golang.org/grpc/internal/transport.(*Stream).Read+0x8e			/backend-enterprise/vendor/google.golang.org/grpc/internal/transport/transport.go:466
#	0xb3ca22	google.golang.org/grpc.(*parser).recvMsg+0x62					/backend-enterprise/vendor/google.golang.org/grpc/rpc_util.go:557
#	0xb3d64c	google.golang.org/grpc.recvAndDecompress+0x4c					/backend-enterprise/vendor/google.golang.org/grpc/rpc_util.go:688
#	0xb437f5	google.golang.org/grpc.(*Server).processUnaryRPC+0x355				/backend-enterprise/vendor/google.golang.org/grpc/server.go:1176
#	0xb47b8b	google.golang.org/grpc.(*Server).handleStream+0xd0b				/backend-enterprise/vendor/google.golang.org/grpc/server.go:1533
#	0xb5648a	google.golang.org/grpc.(*Server).serveStreams.func1.2+0xaa			/backend-enterprise/vendor/google.golang.org/grpc/server.go:871

161 @ 0x43b2c5 0x44cf05 0x44ceee 0x41cb5e 0x41cb42 0x41fd58 0x40e1d6 0x40b699 0x9c765e 0x9c9254 0x9b3775 0x20844a7 0x20799dd 0x156b2e9 0x20db263 0xb9ab23 0xd0b2a4 0x20c6cb6 0xb9ab23 0xb9e7e2 0xb9ab23 0xd0eafa 0xb9ab23 0xd0b814 0xb9ab23 0xb9ad17 0x154fdb0 0xb439cb 0xb47b8c 0xb5648b 0x472701
#	0x9c765d	github.com/prometheus/prometheus/tsdb.(*Head).putSeriesBuffer+0x3d			/backend-enterprise/vendor/github.com/prometheus/prometheus/tsdb/head.go:1282
#	0x9c9253	github.com/prometheus/prometheus/tsdb.(*headAppender).Commit+0x633			/backend-enterprise/vendor/github.com/prometheus/prometheus/tsdb/head.go:1521
#	0x9b3774	github.com/prometheus/prometheus/tsdb.dbAppender.Commit+0x34				/backend-enterprise/vendor/github.com/prometheus/prometheus/tsdb/db.go:817
#	0x20844a6	github.com/cortexproject/cortex/pkg/ingester.(*Ingester).v2Push+0x1a06			/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ingester/ingester_v2.go:896
#	0x20799dc	github.com/cortexproject/cortex/pkg/ingester.(*Ingester).Push+0x8dc			/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ingester/ingester.go:475
#	0x156b2e8	github.com/cortexproject/cortex/pkg/ingester/client._Ingester_Push_Handler.func1+0x88	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ingester/client/ingester.pb.go:2565
#	0x20db262	github.com/cortexproject/cortex/pkg/cortex.ThanosTracerUnaryInterceptor+0xa2		/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/cortex/tracing.go:14
#	0xb9ab22	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xd0b2a3	github.com/weaveworks/common/middleware.ServerUserHeaderInterceptor+0xa3		/backend-enterprise/vendor/github.com/weaveworks/common/middleware/grpc_auth.go:38
#	0x20c6cb5	github.com/cortexproject/cortex/pkg/util/fakeauth.SetupAuthMiddleware.func1+0x115	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/fakeauth/fake_auth.go:27
#	0xb9ab22	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xb9e7e1	github.com/opentracing-contrib/go-grpc.OpenTracingServerInterceptor.func1+0x301		/backend-enterprise/vendor/github.com/opentracing-contrib/go-grpc/server.go:57
#	0xb9ab22	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xd0eaf9	github.com/weaveworks/common/middleware.UnaryServerInstrumentInterceptor.func1+0x99	/backend-enterprise/vendor/github.com/weaveworks/common/middleware/grpc_instrumentation.go:32
#	0xb9ab22	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xd0b813	github.com/weaveworks/common/middleware.GRPCServerLog.UnaryServerInterceptor+0x93	/backend-enterprise/vendor/github.com/weaveworks/common/middleware/grpc_logging.go:29
#	0xb9ab22	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x62		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25
#	0xb9ad16	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1+0xd6		/backend-enterprise/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34
#	0x154fdaf	github.com/cortexproject/cortex/pkg/ingester/client._Ingester_Push_Handler+0x14f	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ingester/client/ingester.pb.go:2567
#	0xb439ca	google.golang.org/grpc.(*Server).processUnaryRPC+0x52a					/backend-enterprise/vendor/google.golang.org/grpc/server.go:1210
#	0xb47b8b	google.golang.org/grpc.(*Server).handleStream+0xd0b					/backend-enterprise/vendor/google.golang.org/grpc/server.go:1533
#	0xb5648a	google.golang.org/grpc.(*Server).serveStreams.func1.2+0xaa				/backend-enterprise/vendor/google.golang.org/grpc/server.go:871

This in turn was caused by me using the cortex-jsonnet config which sets -server.grpc-max-concurrent-streams to 100,000: https://github.com/grafana/cortex-jsonnet/blob/23d110a5f0450417a551102007167d146702513f/cortex/ingester.libsonnet#L34

Reducing -server.grpc-max-concurrent-streams to 500 capped the number of goroutines below 1000.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ingesters should not grow without bound
7 participants