-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit concurrent creation of healthcheck gRPC connections #15053
Limit concurrent creation of healthcheck gRPC connections #15053
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
On 2nd thought, |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #15053 +/- ##
==========================================
- Coverage 67.41% 65.44% -1.98%
==========================================
Files 1560 1562 +2
Lines 192752 193677 +925
==========================================
- Hits 129952 126751 -3201
- Misses 62800 66926 +4126 ☔ View full report in Codecov by Sentry. |
cb2638e
to
24142aa
Compare
should a VTGate be cell-bound then to connect to ~10k vttablets? |
@harshit-gangal the Wherever possible our |
39e99ba
to
223967b
Compare
} | ||
|
||
func (thc *tabletHealthCheck) connectionLocked() queryservice.QueryService { | ||
func healthCheckDialerFactory(hc *HealthCheckImpl) func(ctx context.Context, addr string) (net.Conn, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If other parts of the code could benefit from a similar limit, perhaps this should be moved to a helper/util library or baked into grpcclient? 🤔 cc @deepthi (and anyone else) for thoughts
66555dc
to
fdae21d
Compare
go/vt/grpcclient/client.go
Outdated
@@ -39,6 +40,7 @@ import ( | |||
) | |||
|
|||
var ( | |||
grpcOptionsMu sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to fix data race:
WARNING: DATA RACE
Read at 0x000003e57f70 by goroutine 71:
vitess.io/vitess/go/vt/grpcclient.RegisterGRPCDialOptions()
/Users/tim/github/vitess/go/vt/grpcclient/client.go:89 +0xdd
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).connectionLocked()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:164 +0x68
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).Connection()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:144 +0xb5
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).stream()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:128 +0x44
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).checkConn()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:295 +0x6ee
vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).AddTablet.func2()
/Users/tim/github/vitess/go/vt/discovery/healthcheck.go:426 +0x44
Previous write at 0x000003e57f70 by goroutine 69:
vitess.io/vitess/go/vt/grpcclient.RegisterGRPCDialOptions()
/Users/tim/github/vitess/go/vt/grpcclient/client.go:89 +0x184
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).connectionLocked()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:164 +0x68
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).Connection()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:144 +0xb5
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).stream()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:128 +0x44
vitess.io/vitess/go/vt/discovery.(*tabletHealthCheck).checkConn()
/Users/tim/github/vitess/go/vt/discovery/tablet_health_check.go:295 +0x6ee
vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).AddTablet.func2()
/Users/tim/github/vitess/go/vt/discovery/healthcheck.go:426 +0x44
Goroutine 71 (running) created at:
vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).AddTablet()
/Users/tim/github/vitess/go/vt/discovery/healthcheck.go:426 +0xad5
vitess.io/vitess/go/vt/discovery.TestHealthCheckErrorOnPrimaryAfterExternalReparent()
/Users/tim/github/vitess/go/vt/discovery/healthcheck_test.go:358 +0x971
testing.tRunner()
/usr/local/Cellar/go/1.21.7/libexec/src/testing/testing.go:1595 +0x261
testing.(*T).Run.func1()
/usr/local/Cellar/go/1.21.7/libexec/src/testing/testing.go:1648 +0x44
Goroutine 69 (running) created at:
vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).AddTablet()
/Users/tim/github/vitess/go/vt/discovery/healthcheck.go:426 +0xad5
vitess.io/vitess/go/vt/discovery.TestHealthCheckErrorOnPrimaryAfterExternalReparent()
/Users/tim/github/vitess/go/vt/discovery/healthcheck_test.go:351 +0x5cc
testing.tRunner()
/usr/local/Cellar/go/1.21.7/libexec/src/testing/testing.go:1595 +0x261
testing.(*T).Run.func1()
/usr/local/Cellar/go/1.21.7/libexec/src/testing/testing.go:1648 +0x44
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
4bda1f6
to
a5316d7
Compare
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
This PR is slowing down the parallel vttablet connections being opened but still has the 10k limit on the thread count. I found that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look fine to me. An issue and adding release notes would be good.
@harshit-gangal thanks! vitessio/website#1696 adds this to the 3 x binaries that changed |
@harshit-gangal we considered raising the max threads limit but tried to avoid that in case it allowed a different problem to spiral out of control, also it doesn't scale as easily with growth unless it's cranked to a very high value We've been using the Related, we've noticed a blip of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. There is actually no way in the GRPC options to limit the amount of concurrent dials so the custom semaphore seems fine.
@timvaillancourt I don't think you need to update the website directly: you're supposed to add this to the release notes in this repository, as part of the PR, and the website will get updated once we release the new Vitess version. |
Thanks @vmg, I'll add to the release notes in this PR 👍 I think the |
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@harshit-gangal / @vmg I believe the doc changes are ready 🙇 |
Thanks Tim! LGTM. I'm 99% sure the flags section in the website is auto-generated now. @frouioui will know for sure. Need +1 to merge @harshit-gangal |
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@timvaillancourt now it needs a conflict resolution. |
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
…5053) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
…izations (#227) * Load `--grpc_auth_static_client_creds` file once (vitessio#15030) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Filter by keyspace earlier in `tabletgateway`s `WaitForTablets(...)` (vitessio#15347) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Limit concurrent creation of healthcheck gRPC connections (vitessio#15053) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * go mod tidy Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Update MySQL apt package and GPG signature (vitessio#14785) Signed-off-by: Matt Lord <mattalord@gmail.com> * remove unrelated workflow files from v20 Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> --------- Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> Signed-off-by: Matt Lord <mattalord@gmail.com> Co-authored-by: Matt Lord <mattalord@gmail.com>
…5053) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
* Make `Durabler` interface methods public (vitessio#15548) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> Signed-off-by: Manan Gupta <manan@planetscale.com> Co-authored-by: Manan Gupta <manan@planetscale.com> * Load `--grpc_auth_static_client_creds` file once (vitessio#15030) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Limit concurrent creation of healthcheck gRPC connections (vitessio#15053) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Filter by keyspace earlier in `tabletgateway`s `WaitForTablets(...)` (vitessio#15347) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Use slack-15.0 as previous release Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * empty commit Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * force ci to run * Update GH Action runners Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * test templates Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * set GH access token in build Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Fix reparent old tests Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Remove CIs we don't need Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Remove CIs we don't need again Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Add private repo setup to upgrade_downgrade_test_backups_e2e.yml Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Add private repo setup to more CI Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * remove CI skip logic for upstream stuff Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * CODEOWNERS Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * [release-19.0] Add timeout to all the contexts used for RPC calls in vtorc (vitessio#15991) (vitessio#16103) Signed-off-by: Manan Gupta <manan@planetscale.com> * `slack-vitess-r15.0.5`: forward-port consul topo limits PR #111 (#297) * `slack-vitess-r14.0.5`: allow conn overrides in consul topo (#111) * `slack-vitess-r14.0.5`: allow conn overrides in consul topo Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * fix e2e test Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> --------- Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Update flags tests that didn't exist in v14 Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> --------- Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * update vtcombo e2e Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Fix err with installing percona-xtrabackup-24 Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * `slack-vitess-r15.0.5`: fix races in `Unit Test (Race)` CI, fix "old" reparent CIs (#356) * update vtcombo e2e test Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * Fix bad merge conflict fix Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * go mod tidy Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * update vtcombo e2e test again Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * [release-19.0] Upgrade the Golang version to `go1.22.5` (vitessio#16322) Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: frouioui <frouioui@users.noreply.github.com> Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr> * merge conflict fixes Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> * make vtadmin_web_proto_types Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> --------- Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> Signed-off-by: Manan Gupta <manan@planetscale.com> Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr> Co-authored-by: Manan Gupta <manan@planetscale.com> Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com> Co-authored-by: vitess-bot <139342327+vitess-bot@users.noreply.github.com> Co-authored-by: frouioui <frouioui@users.noreply.github.com> Co-authored-by: Florent Poinsard <florent.poinsard@outlook.fr>
Description
This PR adds a concurrency limit for the number of gRPC connections created by the healthcheck handler of
vtgate
, in order to avoid hittingruntime: program exceeds 10000-thread limit
panicsRelated, in #15030 I was able to remove 1 x sycall/stream (for gRPC client creds), however this is not enough to avoid hitting panics - now we're seeing raw network syscalls as the main thread-usage offenders
Hot syscall # 1 (DNS due to new gRPC connection to tablet)
Hot syscall # 2 (same - gRPC connecting to tablet)
Hot syscall # 3
I lost this one unfortunately, but it was a
syscall.Open
for the client TCP connection openRelated Issue(s)
--grpc_auth_static_client_creds
file once #15030Checklist
Deployment Notes