-
-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent connection errors under high concurrency #3766
Comments
Initial thoughs when looking at this is that you probably hit a max-conclurrent-streams limit. The first error seems to come consistently when opening more than 50 streams. Can you confirm this? |
Hey, I'm seeing similar issues when intercepting a web application and running it locally through https://vite.dev/. diff --git a/pkg/tunnel/dialer.go b/pkg/tunnel/dialer.go
index ed1d72c0d..d7291eed8 100644
--- a/pkg/tunnel/dialer.go
+++ b/pkg/tunnel/dialer.go
@@ -181,7 +181,8 @@ func (h *dialer) connToStreamLoop(ctx context.Context, wg *sync.WaitGroup) {
endLevel := dlog.LogLevelTrace
id := h.stream.ID()
- outgoing := make(chan Message, 50)
+ const msgBufferSize = 1000
+ outgoing := make(chan Message, msgBufferSize)
defer func() {
if !h.ResetIdle() {
// Hard close of peer. We don't want any more data |
Very interesting @petergardfjall, and it's intriguing that this number is hardcoded to Again, some kind of reproducer would be extremely helpful here, so that we can monitor what's really going on. |
I can reliably reproduce the issue with the steps given above.
There are two other places where a channel with a capacity of 50 appears; I
tried changing them all but couldn't get my locally built images to work.
…On Thu, Jan 16, 2025, 18:13 Thomas Hallgren ***@***.***> wrote:
Very interesting @petergardfjall <https://github.com/petergardfjall>, and
it's intriguing that this number is hardcoded to 50, but I'm uncertain if
increasing it actually addresses the problem in this ticket because the
error arrives after 50 *creations* of tunnels, not after sending 50
messages on one of them.
Again, some kind of reproducer would be extremely helpful here, so that we
can monitor what's really going on.
—
Reply to this email directly, view it on GitHub
<#3766 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BAGBHIWKPKLM5RFXXZWMDST2K7SD5AVCNFSM6AAAAABU336KYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJWGI3DSMZYGQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Ah, right. I was just triggered by I'm gonna run with this and see how it holds up. diff --git cmd/traffic/cmd/agent/agent.go cmd/traffic/cmd/agent/agent.go
index 460c0b99e..9f8c4c24c 100644
--- cmd/traffic/cmd/agent/agent.go
+++ cmd/traffic/cmd/agent/agent.go
@@ -222,6 +222,7 @@ func TalkToManagerLoop(ctx context.Context, s State, info *rpc.AgentInfo) {
func StartServices(ctx context.Context, g *dgroup.Group, config Config, srv State) (*rpc.AgentInfo, error) {
var grpcOpts []grpc.ServerOption
ac := config.AgentConfig()
+ grpcOpts = append(grpcOpts, grpc.MaxConcurrentStreams(0))
grpcPortCh := make(chan uint16)
g.Go("tunneling", func(ctx context.Context) error {
diff --git cmd/traffic/cmd/manager/manager.go cmd/traffic/cmd/manager/manager.go
index 8c730713c..6a78dec50 100644
--- cmd/traffic/cmd/manager/manager.go
+++ cmd/traffic/cmd/manager/manager.go
@@ -256,6 +256,7 @@ func (s *service) serveHTTP(ctx context.Context) error {
if mz, ok := env.MaxReceiveSize.AsInt64(); ok {
opts = append(opts, grpc.MaxRecvMsgSize(int(mz)))
}
+ opts = append(opts, grpc.MaxConcurrentStreams(0))
grpcHandler := grpc.NewServer(opts...)
httpHandler := http.Handler(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
diff --git pkg/client/rootd/service.go pkg/client/rootd/service.go
index 9ea68746d..60bd1ab83 100644
--- pkg/client/rootd/service.go
+++ pkg/client/rootd/service.go
@@ -413,6 +413,7 @@ func (s *Service) serveGrpc(c context.Context, l net.Listener) error {
if mz := cfg.Grpc().MaxReceiveSize(); mz > 0 {
opts = append(opts, grpc.MaxRecvMsgSize(int(mz)))
}
+ opts = append(opts, grpc.MaxConcurrentStreams(0))
svc := grpc.NewServer(opts...)
rpc.RegisterDaemonServer(svc, s)
diff --git pkg/tunnel/bidipipe.go pkg/tunnel/bidipipe.go
index ab7ee0792..17f5b1a27 100644
--- pkg/tunnel/bidipipe.go
+++ pkg/tunnel/bidipipe.go
@@ -69,7 +69,8 @@ func (p *bidiPipe) doPipe(
readBytesProbe, writeBytesProbe *CounterProbe,
) {
defer wg.Done()
- wrCh := make(chan Message, 50)
+ const msgBufferSize = 1000
+ wrCh := make(chan Message, msgBufferSize)
defer close(wrCh)
wg.Add(1)
WriteLoop(ctx, b, wrCh, wg, writeBytesProbe)
diff --git pkg/tunnel/dialer.go pkg/tunnel/dialer.go
index ed1d72c0d..d7291eed8 100644
--- pkg/tunnel/dialer.go
+++ pkg/tunnel/dialer.go
@@ -181,7 +181,8 @@ func (h *dialer) connToStreamLoop(ctx context.Context, wg *sync.WaitGroup) {
endLevel := dlog.LogLevelTrace
id := h.stream.ID()
- outgoing := make(chan Message, 50)
+ const msgBufferSize = 1000
+ outgoing := make(chan Message, msgBufferSize)
defer func() {
if !h.ResetIdle() {
// Hard close of peer. We don't want any more data
diff --git pkg/tunnel/stream.go pkg/tunnel/stream.go
index c798e6331..597d37e48 100644
--- pkg/tunnel/stream.go
+++ pkg/tunnel/stream.go
@@ -74,7 +74,8 @@ type StreamCreator func(context.Context, ConnID) (Stream, error)
// ReadLoop reads from the Stream and dispatches messages and error to the give channels. There
// will be max one error since the error also terminates the loop.
func ReadLoop(ctx context.Context, s Stream, p *CounterProbe) (<-chan Message, <-chan error) {
- msgCh := make(chan Message, 50)
+ const msgBufferSize = 1000
+ msgCh := make(chan Message, msgBufferSize)
errCh := make(chan error, 1) // Max one message will be sent on this channel
dlog.Tracef(ctx, " %s %s, ReadLoop starting", s.Tag(), s.ID())
go func() { |
It's a bit of work, I can share my procedure (there might be better ones):
Hope this helps! |
Thanks for writing out the procedure, it's pretty similar to what I tried yesterday. The key difference is that I'm using a local registry because it was already set up in this cluster. Running
The secret does exist: $ kubectl get secret -n ambassador
NAME TYPE DATA AGE
mutator-webhook-tls Opaque 3 2m47s
sh.helm.release.v1.traffic-manager.v1 helm.sh/release.v1 1 2m47s I found I can work around this with After this, I can $ ./build-output/bin/telepresence helm install --set agentInjector.certificate.accessMethod=mount
Traffic Manager installed successfully
$ ./build-output/bin/telepresence connect
Launching Telepresence User Daemon
Launching Telepresence Root Daemon
Connected to context founda-k3s-1, namespace apps (https://198.19.249.86:6443)
$ ./build-output/bin/telepresence list
No Workloads (Deployments, StatefulSets, ReplicaSets, or Rollouts)
$ ./build-output/bin/telepresence version
OSS Client : v2.22.0-dev
OSS Root Daemon : v2.22.0-dev
OSS User Daemon : v2.22.0-dev
OSS Traffic Manager: v2.22.0-dev
Traffic Agent : registry.founda.dev/tel2:2.22.0-dev |
@phooijenga Running Establishing a new connection is a fairly heavyweight operation, and a browser that loads hundreds of javascripts will therefore always use "keep-alive" on the connections that it has. Try adding the That said, during my testing, I did discover a bug causing a leak of goroutines that I'll fix, and I also found some worthwhile optimizations to the network stack, so stay tuned for some improvements. |
@petergardfjall @phooijenga I've created a 2.21.2-rc.0 release candidate. Please try it out to see if it improves the situation. The panic visible in the daemon.log should definitely not be present now. |
Appreciate the swift improvements, @thallgren! I'm still seeing some occasional errors like this one in the I'm starting to suspect though that in my case there might be issues also with the dev server (vitejs/vite#17499). I haven't yet grasped the full dynamics of what's going on here, but will keep you posted if I come to any revelations. Either way, very much appreciate the work and I hope I haven't distracted too much from the root issue described by @phooijenga. |
I'm still seeing the failed to send DialOK errors, but haven't had the panic in a while. |
We got this problem too with a front end delivering a lot of JS file, randomly some js file get a 502 error, every time it's different files. |
Describe the bug
When I send a lot of requests to an intercepted service, some connections are reset or the intercept stops working completely.
I am intercepting a web application behind an Nginx ingress which does not bundle resources when running in development mode. When accessing it, the browser loads hundreds of small javascript files over HTTP/2. This results in many concurrent requests from Nginx to the application backend. Some of those connections are reset, which causes nginx to return an error to the browser. Sometimes the intercept stops working completely, causing the connection to eventually time out.
nuxt/nuxt#28424 describes the same issue.
To Reproduce
In one terminal, create an echo service deployment and intercept it. I'm using another echo server running in Docker here, but I've had the same result with a simple 'Hello World' application in Go:
While the intercept is active, make many requests with high concurrency, e.g. with trusty old ApacheBench:
(Sometimes ApacheBench fails with
apr_socket_recv: Connection reset by peer (104)
instead.)The traffic-agent log contains errors like this:
The daemon log contains errors like this:
Expected behavior
All requests should complete.
Versions (please complete the following information):
I've encountered the same problem on Ubuntu 24.10 with Kind and macOS Sequioa 15.2 with k3s in an OrbStack VM.
Logs
connector.log
daemon.log
echo-sc-7867967d69-lmrhm.apps.log
traffic-manager-588cdf459f-h7bg7.ambassador.log
The text was updated successfully, but these errors were encountered: