Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent connection errors under high concurrency #3766

Open
phooijenga opened this issue Jan 9, 2025 · 12 comments
Open

Intermittent connection errors under high concurrency #3766

phooijenga opened this issue Jan 9, 2025 · 12 comments

Comments

@phooijenga
Copy link
Contributor

Describe the bug

When I send a lot of requests to an intercepted service, some connections are reset or the intercept stops working completely.

I am intercepting a web application behind an Nginx ingress which does not bundle resources when running in development mode. When accessing it, the browser loads hundreds of small javascript files over HTTP/2. This results in many concurrent requests from Nginx to the application backend. Some of those connections are reset, which causes nginx to return an error to the browser. Sometimes the intercept stops working completely, causing the connection to eventually time out.

nuxt/nuxt#28424 describes the same issue.

To Reproduce

In one terminal, create an echo service deployment and intercept it. I'm using another echo server running in Docker here, but I've had the same result with a simple 'Hello World' application in Go:

$ kubectl apply -f https://github.com/telepresenceio/telepresence/raw/refs/heads/release/v2/k8s/echo-sc.yaml
service/echo-sc created
deployment.apps/echo-sc created

$ telepresence intercept echo-sc --docker-run -- --rm -p 8080:8080 jmalloc/echo-server
Using Deployment echo-sc
   Intercept name         : echo-sc
   State                  : ACTIVE
   Workload kind          : Deployment
   Destination            : 127.0.0.1:8080
   Service Port Identifier: proxied/TCP
   Volume Mount Point     : /tmp/telfs-2245137005
   Intercepting           : all TCP connections
Echo server listening on port 8080.

While the intercept is active, make many requests with high concurrency, e.g. with trusty old ApacheBench:

$ ab -n 5000 -c 50 http://echo-sc/
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking echo-sc (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
apr_pollset_poll: The timeout specified has expired (70007)
Total of 3292 requests completed

(Sometimes ApacheBench fails with apr_socket_recv: Connection reset by peer (104) instead.)

The traffic-agent log contains errors like this:

2025-01-09 10:30:06.6702 error   forward-echo-sc:8080 : !! SRV tcp 10.244.0.1:48681 -> 127.0.0.1:8080, Send failed: rpc error: code = Unavailable desc = transport is closing : lis=":9900"
2025-01-09 10:30:06.6704 error   forward-echo-sc:8080 : !! SRV tcp 10.244.0.1:48681 -> 127.0.0.1:8080, Send of closeSend failed: send of closeSend message failed: rpc error: code = Unavailable desc = transport is closing : lis=":9900"

The daemon log contains errors like this:

2025-01-09 11:30:06.6701 error   daemon/session/agentPods : !! CLI tcp 10.244.0.1:48681 -> 127.0.0.1:8080, Send failed: EOF
2025-01-09 11:30:06.6701 error   daemon/session/agentPods : !! CONN tcp 10.244.0.1:48681 -> 127.0.0.1:8080, failed to send DialOK: EOF

Expected behavior

All requests should complete.

Versions (please complete the following information):

OSS Client         : v2.21.1
OSS Root Daemon    : v2.21.1
OSS User Daemon    : v2.21.1
OSS Traffic Manager: v2.21.1
Traffic Agent      : ghcr.io/telepresenceio/tel2:2.21.1

I've encountered the same problem on Ubuntu 24.10 with Kind and macOS Sequioa 15.2 with k3s in an OrbStack VM.

Logs

connector.log
daemon.log
echo-sc-7867967d69-lmrhm.apps.log
traffic-manager-588cdf459f-h7bg7.ambassador.log

@thallgren
Copy link
Member

Initial thoughs when looking at this is that you probably hit a max-conclurrent-streams limit. The first error seems to come consistently when opening more than 50 streams. Can you confirm this?
A reproducer would be very valuable. It would help identifying the problem and could also become one of our regular integration tests once the problem has been properly addressed.

@petergardfjall
Copy link
Contributor

petergardfjall commented Jan 16, 2025

Hey, I'm seeing similar issues when intercepting a web application and running it locally through https://vite.dev/.
Inspired by @thallgren's comment I built telepresence locally after applying a small patch like below and that at least seems to make it work a lot better.

diff --git a/pkg/tunnel/dialer.go b/pkg/tunnel/dialer.go
index ed1d72c0d..d7291eed8 100644
--- a/pkg/tunnel/dialer.go
+++ b/pkg/tunnel/dialer.go
@@ -181,7 +181,8 @@ func (h *dialer) connToStreamLoop(ctx context.Context, wg *sync.WaitGroup) {
        endLevel := dlog.LogLevelTrace
        id := h.stream.ID()
 
-       outgoing := make(chan Message, 50)
+       const msgBufferSize = 1000
+       outgoing := make(chan Message, msgBufferSize)
        defer func() {
                if !h.ResetIdle() {
                        // Hard close of peer. We don't want any more data

@thallgren
Copy link
Member

Very interesting @petergardfjall, and it's intriguing that this number is hardcoded to 50, but I'm uncertain if increasing it actually addresses the problem in this ticket because the error arrives after 50 creations of tunnels, not after sending 50 messages on one of them.

Again, some kind of reproducer would be extremely helpful here, so that we can monitor what's really going on.

@phooijenga
Copy link
Contributor Author

phooijenga commented Jan 16, 2025 via email

@petergardfjall
Copy link
Contributor

Very interesting @petergardfjall, and it's intriguing that this number is hardcoded to 50, but I'm uncertain if increasing it actually addresses the problem in this ticket because the error arrives after 50 creations of tunnels, not after sending 50 messages on one of them.

Ah, right. I was just triggered by 50 and found that hard-coded along the code path I was looking at.
@phooijenga is right that the hardcoded 50 appears in other Message channels as well.
I modified the patch to bump those numbers and also increase the number of max concurrent gRPC streams in servers. (Note that I patched version v2.21.0 since that is what I'm running locally).

I'm gonna run with this and see how it holds up.

diff --git cmd/traffic/cmd/agent/agent.go cmd/traffic/cmd/agent/agent.go
index 460c0b99e..9f8c4c24c 100644
--- cmd/traffic/cmd/agent/agent.go
+++ cmd/traffic/cmd/agent/agent.go
@@ -222,6 +222,7 @@ func TalkToManagerLoop(ctx context.Context, s State, info *rpc.AgentInfo) {
 func StartServices(ctx context.Context, g *dgroup.Group, config Config, srv State) (*rpc.AgentInfo, error) {
 	var grpcOpts []grpc.ServerOption
 	ac := config.AgentConfig()
+	grpcOpts = append(grpcOpts, grpc.MaxConcurrentStreams(0))
 
 	grpcPortCh := make(chan uint16)
 	g.Go("tunneling", func(ctx context.Context) error {
diff --git cmd/traffic/cmd/manager/manager.go cmd/traffic/cmd/manager/manager.go
index 8c730713c..6a78dec50 100644
--- cmd/traffic/cmd/manager/manager.go
+++ cmd/traffic/cmd/manager/manager.go
@@ -256,6 +256,7 @@ func (s *service) serveHTTP(ctx context.Context) error {
 	if mz, ok := env.MaxReceiveSize.AsInt64(); ok {
 		opts = append(opts, grpc.MaxRecvMsgSize(int(mz)))
 	}
+	opts = append(opts, grpc.MaxConcurrentStreams(0))
 
 	grpcHandler := grpc.NewServer(opts...)
 	httpHandler := http.Handler(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
diff --git pkg/client/rootd/service.go pkg/client/rootd/service.go
index 9ea68746d..60bd1ab83 100644
--- pkg/client/rootd/service.go
+++ pkg/client/rootd/service.go
@@ -413,6 +413,7 @@ func (s *Service) serveGrpc(c context.Context, l net.Listener) error {
 	if mz := cfg.Grpc().MaxReceiveSize(); mz > 0 {
 		opts = append(opts, grpc.MaxRecvMsgSize(int(mz)))
 	}
+	opts = append(opts, grpc.MaxConcurrentStreams(0))
 	svc := grpc.NewServer(opts...)
 	rpc.RegisterDaemonServer(svc, s)
 
diff --git pkg/tunnel/bidipipe.go pkg/tunnel/bidipipe.go
index ab7ee0792..17f5b1a27 100644
--- pkg/tunnel/bidipipe.go
+++ pkg/tunnel/bidipipe.go
@@ -69,7 +69,8 @@ func (p *bidiPipe) doPipe(
 	readBytesProbe, writeBytesProbe *CounterProbe,
 ) {
 	defer wg.Done()
-	wrCh := make(chan Message, 50)
+	const msgBufferSize = 1000
+	wrCh := make(chan Message, msgBufferSize)
 	defer close(wrCh)
 	wg.Add(1)
 	WriteLoop(ctx, b, wrCh, wg, writeBytesProbe)
diff --git pkg/tunnel/dialer.go pkg/tunnel/dialer.go
index ed1d72c0d..d7291eed8 100644
--- pkg/tunnel/dialer.go
+++ pkg/tunnel/dialer.go
@@ -181,7 +181,8 @@ func (h *dialer) connToStreamLoop(ctx context.Context, wg *sync.WaitGroup) {
 	endLevel := dlog.LogLevelTrace
 	id := h.stream.ID()
 
-	outgoing := make(chan Message, 50)
+	const msgBufferSize = 1000
+	outgoing := make(chan Message, msgBufferSize)
 	defer func() {
 		if !h.ResetIdle() {
 			// Hard close of peer. We don't want any more data
diff --git pkg/tunnel/stream.go pkg/tunnel/stream.go
index c798e6331..597d37e48 100644
--- pkg/tunnel/stream.go
+++ pkg/tunnel/stream.go
@@ -74,7 +74,8 @@ type StreamCreator func(context.Context, ConnID) (Stream, error)
 // ReadLoop reads from the Stream and dispatches messages and error to the give channels. There
 // will be max one error since the error also terminates the loop.
 func ReadLoop(ctx context.Context, s Stream, p *CounterProbe) (<-chan Message, <-chan error) {
-	msgCh := make(chan Message, 50)
+	const msgBufferSize = 1000
+	msgCh := make(chan Message, msgBufferSize)
 	errCh := make(chan error, 1) // Max one message will be sent on this channel
 	dlog.Tracef(ctx, "   %s %s, ReadLoop starting", s.Tag(), s.ID())
 	go func() {

@petergardfjall
Copy link
Contributor

petergardfjall commented Jan 17, 2025

[...] but couldn't get my locally built images to work.

@phooijenga

It's a bit of work, I can share my procedure (there might be better ones):

  1. Make sure you have a docker hub account. (or similar)

  2. Make sure you are logged in: docker login -u <dockeruser>

  3. Build:

    export TELEPRESENCE_REGISTRY=docker.io/<dockeruser>
    export TELEPRESENCE_VERSION=v2.22.0-dev # or some other made up version
    
    TELEPRESENCE_REGISTRY=docker.io/<dockeruser> make protoc build push-images
    

    This:

    • Builds the telepresence binary under ./build-output/bin/telepresence.
    • Pushes the ${TELEPRESENCE_REGISTRY}/tel2 Docker image.
    • Pushes the ${TELEPRESENCE_REGISTRY}/telepresence Docker image.
  4. When this is done make sure to "clear" any existing telepresence running locally and in your cluster.

    telepresence uninstall --all-agents
    telepresence quit -s
    telepresence helm uninstall
    
  5. Now you should be in a position to start using your built telepresence version:
    First, in your ~/.config/telepresence/config.yml set images.registry to
    match TELEPRESENCE_REGISTRY. This is the registry from which
    traffic-manager and traffic-agent images will be pulled.

    images:
      registry: docker.io/<dockeruser>
  6. Install your traffic-manager into the cluster.

    bindir="${PWD}/build-output/bin"
    ${bindir}/telepresence helm install --set image.pullPolicy=Always
  7. Intercept your workload. ${bindir}/telepresence intercept ...

Hope this helps!

@phooijenga
Copy link
Contributor Author

Thanks for writing out the procedure, it's pretty similar to what I tried yesterday. The key difference is that I'm using a local registry because it was already set up in this cluster.


Running ./build-output/bin/telepresence helm install results in Helm timeout. The newly installed traffic-manager fails to start:

agent-injector : goroutine "/agent-injector" exited with error: secret "mutator-webhook-tls" not found

The secret does exist:

$ kubectl get secret -n ambassador
NAME                                    TYPE                 DATA   AGE
mutator-webhook-tls                     Opaque               3      2m47s
sh.helm.release.v1.traffic-manager.v1   helm.sh/release.v1   1      2m47s

I found I can work around this with --set agentInjector.certificate.accessMethod=mount.

After this, I can connect, but list shows no workloads:

$ ./build-output/bin/telepresence helm install --set agentInjector.certificate.accessMethod=mount
Traffic Manager installed successfully

$ ./build-output/bin/telepresence connect
Launching Telepresence User Daemon
Launching Telepresence Root Daemon
Connected to context founda-k3s-1, namespace apps (https://198.19.249.86:6443)

$ ./build-output/bin/telepresence list
No Workloads (Deployments, StatefulSets, ReplicaSets, or Rollouts)

$ ./build-output/bin/telepresence version
OSS Client         : v2.22.0-dev
OSS Root Daemon    : v2.22.0-dev
OSS User Daemon    : v2.22.0-dev
OSS Traffic Manager: v2.22.0-dev
Traffic Agent      : registry.founda.dev/tel2:2.22.0-dev

@thallgren
Copy link
Member

thallgren commented Jan 19, 2025

@phooijenga Running $ ab -n 5000 -c 50 http://echo-sc/ creates several thousand connections that remain open, which leads to that the client TCP pushes back new connect attempts. This is not normal, and I don't believe it can be fixed. I tested using the same approach using some gVisor samples (i.e. just the TCP-stack, no telepresence), and I get the same results.

Establishing a new connection is a fairly heavyweight operation, and a browser that loads hundreds of javascripts will therefore always use "keep-alive" on the connections that it has. Try adding the -k flag when running ab, and you'll see a significant difference in throughput.

That said, during my testing, I did discover a bug causing a leak of goroutines that I'll fix, and I also found some worthwhile optimizations to the network stack, so stay tuned for some improvements.

@thallgren
Copy link
Member

thallgren commented Jan 20, 2025

@petergardfjall @phooijenga I've created a 2.21.2-rc.0 release candidate. Please try it out to see if it improves the situation. The panic visible in the daemon.log should definitely not be present now.

@petergardfjall
Copy link
Contributor

Appreciate the swift improvements, @thallgren!
Having only had time to test things out for a short while I do get the impression that things might be working a little better now.

I'm still seeing some occasional errors like this one in the traffic-manager side-car: Send of closeSend failed: send of closeSend message failed: rpc error: code = Unavailable desc = transport is closing : lis=":9900, but they appear to be less frequent.

I'm starting to suspect though that in my case there might be issues also with the dev server (vitejs/vite#17499). I haven't yet grasped the full dynamics of what's going on here, but will keep you posted if I come to any revelations.

Either way, very much appreciate the work and I hope I haven't distracted too much from the root issue described by @phooijenga.

@phooijenga
Copy link
Contributor Author

I'm still seeing the failed to send DialOK errors, but haven't had the panic in a while.

@abstract-entity
Copy link

We got this problem too with a front end delivering a lot of JS file, randomly some js file get a 502 error, every time it's different files.
If we call manually the files it work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants