Intermittent connection errors under high concurrency #3766

phooijenga · 2025-01-09T10:52:14Z

Describe the bug

When I send a lot of requests to an intercepted service, some connections are reset or the intercept stops working completely.

I am intercepting a web application behind an Nginx ingress which does not bundle resources when running in development mode. When accessing it, the browser loads hundreds of small javascript files over HTTP/2. This results in many concurrent requests from Nginx to the application backend. Some of those connections are reset, which causes nginx to return an error to the browser. Sometimes the intercept stops working completely, causing the connection to eventually time out.

nuxt/nuxt#28424 describes the same issue.

To Reproduce

In one terminal, create an echo service deployment and intercept it. I'm using another echo server running in Docker here, but I've had the same result with a simple 'Hello World' application in Go:

$ kubectl apply -f https://github.com/telepresenceio/telepresence/raw/refs/heads/release/v2/k8s/echo-sc.yaml
service/echo-sc created
deployment.apps/echo-sc created

$ telepresence intercept echo-sc --docker-run -- --rm -p 8080:8080 jmalloc/echo-server
Using Deployment echo-sc
   Intercept name         : echo-sc
   State                  : ACTIVE
   Workload kind          : Deployment
   Destination            : 127.0.0.1:8080
   Service Port Identifier: proxied/TCP
   Volume Mount Point     : /tmp/telfs-2245137005
   Intercepting           : all TCP connections
Echo server listening on port 8080.

While the intercept is active, make many requests with high concurrency, e.g. with trusty old ApacheBench:

$ ab -n 5000 -c 50 http://echo-sc/
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking echo-sc (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 3292 requests completed

(Sometimes ApacheBench fails with apr_socket_recv: Connection reset by peer (104) instead.)

The traffic-agent log contains errors like this:

2025-01-09 10:30:06.6702 error   forward-echo-sc:8080 : !! SRV tcp 10.244.0.1:48681 -> 127.0.0.1:8080, Send failed: rpc error: code = Unavailable desc = transport is closing : lis=":9900"
2025-01-09 10:30:06.6704 error   forward-echo-sc:8080 : !! SRV tcp 10.244.0.1:48681 -> 127.0.0.1:8080, Send of closeSend failed: send of closeSend message failed: rpc error: code = Unavailable desc = transport is closing : lis=":9900"

The daemon log contains errors like this:

2025-01-09 11:30:06.6701 error   daemon/session/agentPods : !! CLI tcp 10.244.0.1:48681 -> 127.0.0.1:8080, Send failed: EOF
2025-01-09 11:30:06.6701 error   daemon/session/agentPods : !! CONN tcp 10.244.0.1:48681 -> 127.0.0.1:8080, failed to send DialOK: EOF

Expected behavior

All requests should complete.

Versions (please complete the following information):

OSS Client         : v2.21.1
OSS Root Daemon    : v2.21.1
OSS User Daemon    : v2.21.1
OSS Traffic Manager: v2.21.1
Traffic Agent      : ghcr.io/telepresenceio/tel2:2.21.1

I've encountered the same problem on Ubuntu 24.10 with Kind and macOS Sequioa 15.2 with k3s in an OrbStack VM.

Logs

connector.log
daemon.log
echo-sc-7867967d69-lmrhm.apps.log
traffic-manager-588cdf459f-h7bg7.ambassador.log

The text was updated successfully, but these errors were encountered:

thallgren · 2025-01-14T08:57:35Z

Initial thoughs when looking at this is that you probably hit a max-conclurrent-streams limit. The first error seems to come consistently when opening more than 50 streams. Can you confirm this?
A reproducer would be very valuable. It would help identifying the problem and could also become one of our regular integration tests once the problem has been properly addressed.

petergardfjall · 2025-01-16T12:47:35Z

Hey, I'm seeing similar issues when intercepting a web application and running it locally through https://vite.dev/.
Inspired by @thallgren's comment I built telepresence locally after applying a small patch like below and that at least seems to make it work a lot better.

diff --git a/pkg/tunnel/dialer.go b/pkg/tunnel/dialer.go
index ed1d72c0d..d7291eed8 100644
--- a/pkg/tunnel/dialer.go
+++ b/pkg/tunnel/dialer.go
@@ -181,7 +181,8 @@ func (h *dialer) connToStreamLoop(ctx context.Context, wg *sync.WaitGroup) {
        endLevel := dlog.LogLevelTrace
        id := h.stream.ID()
 
-       outgoing := make(chan Message, 50)
+       const msgBufferSize = 1000
+       outgoing := make(chan Message, msgBufferSize)
        defer func() {
                if !h.ResetIdle() {
                        // Hard close of peer. We don't want any more data

thallgren · 2025-01-16T17:09:12Z

Very interesting @petergardfjall, and it's intriguing that this number is hardcoded to 50, but I'm uncertain if increasing it actually addresses the problem in this ticket because the error arrives after 50 creations of tunnels, not after sending 50 messages on one of them.

Again, some kind of reproducer would be extremely helpful here, so that we can monitor what's really going on.

phooijenga · 2025-01-16T17:35:00Z

I can reliably reproduce the issue with the steps given above. There are two other places where a channel with a capacity of 50 appears; I tried changing them all but couldn't get my locally built images to work.

…

On Thu, Jan 16, 2025, 18:13 Thomas Hallgren ***@***.***> wrote: Very interesting @petergardfjall <https://github.com/petergardfjall>, and it's intriguing that this number is hardcoded to 50, but I'm uncertain if increasing it actually addresses the problem in this ticket because the error arrives after 50 *creations* of tunnels, not after sending 50 messages on one of them. Again, some kind of reproducer would be extremely helpful here, so that we can monitor what's really going on. — Reply to this email directly, view it on GitHub <#3766 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BAGBHIWKPKLM5RFXXZWMDST2K7SD5AVCNFSM6AAAAABU336KYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJWGI3DSMZYGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

petergardfjall · 2025-01-17T07:30:30Z

Very interesting @petergardfjall, and it's intriguing that this number is hardcoded to 50, but I'm uncertain if increasing it actually addresses the problem in this ticket because the error arrives after 50 creations of tunnels, not after sending 50 messages on one of them.

Ah, right. I was just triggered by 50 and found that hard-coded along the code path I was looking at.
@phooijenga is right that the hardcoded 50 appears in other Message channels as well.
I modified the patch to bump those numbers and also increase the number of max concurrent gRPC streams in servers. (Note that I patched version v2.21.0 since that is what I'm running locally).

I'm gonna run with this and see how it holds up.

diff --git cmd/traffic/cmd/agent/agent.go cmd/traffic/cmd/agent/agent.go
index 460c0b99e..9f8c4c24c 100644
--- cmd/traffic/cmd/agent/agent.go
+++ cmd/traffic/cmd/agent/agent.go
@@ -222,6 +222,7 @@ func TalkToManagerLoop(ctx context.Context, s State, info *rpc.AgentInfo) {
 func StartServices(ctx context.Context, g *dgroup.Group, config Config, srv State) (*rpc.AgentInfo, error) {
 	var grpcOpts []grpc.ServerOption
 	ac := config.AgentConfig()
+	grpcOpts = append(grpcOpts, grpc.MaxConcurrentStreams(0))
 
 	grpcPortCh := make(chan uint16)
 	g.Go("tunneling", func(ctx context.Context) error {
diff --git cmd/traffic/cmd/manager/manager.go cmd/traffic/cmd/manager/manager.go
index 8c730713c..6a78dec50 100644
--- cmd/traffic/cmd/manager/manager.go
+++ cmd/traffic/cmd/manager/manager.go
@@ -256,6 +256,7 @@ func (s *service) serveHTTP(ctx context.Context) error {
 	if mz, ok := env.MaxReceiveSize.AsInt64(); ok {
 		opts = append(opts, grpc.MaxRecvMsgSize(int(mz)))
 	}
+	opts = append(opts, grpc.MaxConcurrentStreams(0))
 
 	grpcHandler := grpc.NewServer(opts...)
 	httpHandler := http.Handler(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
diff --git pkg/client/rootd/service.go pkg/client/rootd/service.go
index 9ea68746d..60bd1ab83 100644
--- pkg/client/rootd/service.go
+++ pkg/client/rootd/service.go
@@ -413,6 +413,7 @@ func (s *Service) serveGrpc(c context.Context, l net.Listener) error {
 	if mz := cfg.Grpc().MaxReceiveSize(); mz > 0 {
 		opts = append(opts, grpc.MaxRecvMsgSize(int(mz)))
 	}
+	opts = append(opts, grpc.MaxConcurrentStreams(0))
 	svc := grpc.NewServer(opts...)
 	rpc.RegisterDaemonServer(svc, s)
 
diff --git pkg/tunnel/bidipipe.go pkg/tunnel/bidipipe.go
index ab7ee0792..17f5b1a27 100644
--- pkg/tunnel/bidipipe.go
+++ pkg/tunnel/bidipipe.go
@@ -69,7 +69,8 @@ func (p *bidiPipe) doPipe(
 	readBytesProbe, writeBytesProbe *CounterProbe,
 ) {
 	defer wg.Done()
-	wrCh := make(chan Message, 50)
+	const msgBufferSize = 1000
+	wrCh := make(chan Message, msgBufferSize)
 	defer close(wrCh)
 	wg.Add(1)
 	WriteLoop(ctx, b, wrCh, wg, writeBytesProbe)
diff --git pkg/tunnel/dialer.go pkg/tunnel/dialer.go
index ed1d72c0d..d7291eed8 100644
--- pkg/tunnel/dialer.go
+++ pkg/tunnel/dialer.go
@@ -181,7 +181,8 @@ func (h *dialer) connToStreamLoop(ctx context.Context, wg *sync.WaitGroup) {
 	endLevel := dlog.LogLevelTrace
 	id := h.stream.ID()
 
-	outgoing := make(chan Message, 50)
+	const msgBufferSize = 1000
+	outgoing := make(chan Message, msgBufferSize)
 	defer func() {
 		if !h.ResetIdle() {
 			// Hard close of peer. We don't want any more data
diff --git pkg/tunnel/stream.go pkg/tunnel/stream.go
index c798e6331..597d37e48 100644
--- pkg/tunnel/stream.go
+++ pkg/tunnel/stream.go
@@ -74,7 +74,8 @@ type StreamCreator func(context.Context, ConnID) (Stream, error)
 // ReadLoop reads from the Stream and dispatches messages and error to the give channels. There
 // will be max one error since the error also terminates the loop.
 func ReadLoop(ctx context.Context, s Stream, p *CounterProbe) (<-chan Message, <-chan error) {
-	msgCh := make(chan Message, 50)
+	const msgBufferSize = 1000
+	msgCh := make(chan Message, msgBufferSize)
 	errCh := make(chan error, 1) // Max one message will be sent on this channel
 	dlog.Tracef(ctx, "   %s %s, ReadLoop starting", s.Tag(), s.ID())
 	go func() {

petergardfjall · 2025-01-17T07:30:38Z

[...] but couldn't get my locally built images to work.

@phooijenga

It's a bit of work, I can share my procedure (there might be better ones):

Make sure you have a docker hub account. (or similar)
Make sure you are logged in: docker login -u <dockeruser>

Build:

export TELEPRESENCE_REGISTRY=docker.io/<dockeruser>
export TELEPRESENCE_VERSION=v2.22.0-dev # or some other made up version

TELEPRESENCE_REGISTRY=docker.io/<dockeruser> make protoc build push-images

This:

Builds the telepresence binary under ./build-output/bin/telepresence.
Pushes the ${TELEPRESENCE_REGISTRY}/tel2 Docker image.
Pushes the ${TELEPRESENCE_REGISTRY}/telepresence Docker image.

When this is done make sure to "clear" any existing telepresence running locally and in your cluster.
```
telepresence uninstall --all-agents
telepresence quit -s
telepresence helm uninstall
```
Now you should be in a position to start using your built telepresence version:
First, in your ~/.config/telepresence/config.yml set images.registry to
match TELEPRESENCE_REGISTRY. This is the registry from which
traffic-manager and traffic-agent images will be pulled.
```
images:
  registry: docker.io/<dockeruser>
```

Install your traffic-manager into the cluster.

bindir="${PWD}/build-output/bin"
${bindir}/telepresence helm install --set image.pullPolicy=Always

Intercept your workload. ${bindir}/telepresence intercept ...

Hope this helps!

phooijenga · 2025-01-17T07:56:09Z

Thanks for writing out the procedure, it's pretty similar to what I tried yesterday. The key difference is that I'm using a local registry because it was already set up in this cluster.

Running ./build-output/bin/telepresence helm install results in Helm timeout. The newly installed traffic-manager fails to start:

agent-injector : goroutine "/agent-injector" exited with error: secret "mutator-webhook-tls" not found

The secret does exist:

$ kubectl get secret -n ambassador
NAME                                    TYPE                 DATA   AGE
mutator-webhook-tls                     Opaque               3      2m47s
sh.helm.release.v1.traffic-manager.v1   helm.sh/release.v1   1      2m47s

I found I can work around this with --set agentInjector.certificate.accessMethod=mount.

After this, I can connect, but list shows no workloads:

$ ./build-output/bin/telepresence helm install --set agentInjector.certificate.accessMethod=mount
Traffic Manager installed successfully

$ ./build-output/bin/telepresence connect
Launching Telepresence User Daemon
Launching Telepresence Root Daemon
Connected to context founda-k3s-1, namespace apps (https://198.19.249.86:6443)

$ ./build-output/bin/telepresence list
No Workloads (Deployments, StatefulSets, ReplicaSets, or Rollouts)

$ ./build-output/bin/telepresence version
OSS Client         : v2.22.0-dev
OSS Root Daemon    : v2.22.0-dev
OSS User Daemon    : v2.22.0-dev
OSS Traffic Manager: v2.22.0-dev
Traffic Agent      : registry.founda.dev/tel2:2.22.0-dev

thallgren · 2025-01-19T08:19:29Z

@phooijenga Running $ ab -n 5000 -c 50 http://echo-sc/ creates several thousand connections that remain open, which leads to that the client TCP pushes back new connect attempts. This is not normal, and I don't believe it can be fixed. I tested using the same approach using some gVisor samples (i.e. just the TCP-stack, no telepresence), and I get the same results.

Establishing a new connection is a fairly heavyweight operation, and a browser that loads hundreds of javascripts will therefore always use "keep-alive" on the connections that it has. Try adding the -k flag when running ab, and you'll see a significant difference in throughput.

That said, during my testing, I did discover a bug causing a leak of goroutines that I'll fix, and I also found some worthwhile optimizations to the network stack, so stay tuned for some improvements.

thallgren · 2025-01-20T16:25:07Z

@petergardfjall @phooijenga I've created a 2.21.2-rc.0 release candidate. Please try it out to see if it improves the situation. The panic visible in the daemon.log should definitely not be present now.

petergardfjall · 2025-01-21T07:55:34Z

Appreciate the swift improvements, @thallgren!
Having only had time to test things out for a short while I do get the impression that things might be working a little better now.

I'm still seeing some occasional errors like this one in the traffic-manager side-car: Send of closeSend failed: send of closeSend message failed: rpc error: code = Unavailable desc = transport is closing : lis=":9900, but they appear to be less frequent.

I'm starting to suspect though that in my case there might be issues also with the dev server (vitejs/vite#17499). I haven't yet grasped the full dynamics of what's going on here, but will keep you posted if I come to any revelations.

Either way, very much appreciate the work and I hope I haven't distracted too much from the root issue described by @phooijenga.

phooijenga · 2025-01-21T12:16:37Z

I'm still seeing the failed to send DialOK errors, but haven't had the panic in a while.

abstract-entity · 2025-02-05T16:01:03Z

We got this problem too with a front end delivering a lot of JS file, randomly some js file get a 502 error, every time it's different files.
If we call manually the files it work.

thallgren mentioned this issue Jan 20, 2025

Fix panic when agentpf.client creates a Tunnel #3775

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent connection errors under high concurrency #3766

Intermittent connection errors under high concurrency #3766

phooijenga commented Jan 9, 2025

thallgren commented Jan 14, 2025

petergardfjall commented Jan 16, 2025 •

edited

Loading

thallgren commented Jan 16, 2025

phooijenga commented Jan 16, 2025 via email

petergardfjall commented Jan 17, 2025

petergardfjall commented Jan 17, 2025 •

edited

Loading

phooijenga commented Jan 17, 2025

thallgren commented Jan 19, 2025 •

edited

Loading

thallgren commented Jan 20, 2025 •

edited

Loading

petergardfjall commented Jan 21, 2025

phooijenga commented Jan 21, 2025

abstract-entity commented Feb 5, 2025

Intermittent connection errors under high concurrency #3766

Intermittent connection errors under high concurrency #3766

Comments

phooijenga commented Jan 9, 2025

thallgren commented Jan 14, 2025

petergardfjall commented Jan 16, 2025 • edited Loading

thallgren commented Jan 16, 2025

phooijenga commented Jan 16, 2025 via email

petergardfjall commented Jan 17, 2025

petergardfjall commented Jan 17, 2025 • edited Loading

phooijenga commented Jan 17, 2025

thallgren commented Jan 19, 2025 • edited Loading

thallgren commented Jan 20, 2025 • edited Loading

petergardfjall commented Jan 21, 2025

phooijenga commented Jan 21, 2025

abstract-entity commented Feb 5, 2025

petergardfjall commented Jan 16, 2025 •

edited

Loading

petergardfjall commented Jan 17, 2025 •

edited

Loading

thallgren commented Jan 19, 2025 •

edited

Loading

thallgren commented Jan 20, 2025 •

edited

Loading