Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the possible cause of "Connection reset by peer" when watching ListNamespacedPodWithHttpMessagesAsync #773

Closed
AntonPetrov83 opened this issue Feb 4, 2022 · 16 comments

Comments

@AntonPetrov83
Copy link

Hi!

A have an AKS (Azure Kubernetes Service) and recently when I deployed my app I started receiving an exception:

System.IO.IOException: Unable to read data from the transport connection: Connection reset by peer.
 ---> System.Net.Sockets.SocketException (104): Connection reset by peer
   --- End of inner exception stack trace ---
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
   at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](TIOAdapter adapter, Memory`1 buffer)
   at System.Net.Http.HttpConnection.FillAsync(Boolean async)
   at System.Net.Http.HttpConnection.ChunkedEncodingReadStream.ReadAsyncCore(Memory`1 buffer, CancellationToken cancellationToken)
   at k8s.WatcherDelegatingHandler.CancelableStream.ReadAsync(Byte[] buffer, Int32 offset, Int32 count, CancellationToken cancellationToken)
   at System.IO.StreamReader.ReadBufferAsync(CancellationToken cancellationToken)
   at System.IO.StreamReader.ReadLineAsyncInternal()
   at k8s.Watcher`1.WatcherLoop(CancellationToken cancellationToken)

   ...

Later it stopped responding like that. Is it a transient error? Or should I investigate?

P.S. My app is based on Microsoft.Orleans and there is a Orleans.Hosting.Kubernetes extension which uses ListNamespacedPodWithHttpMessagesAsync API.

@tg123
Copy link
Member

tg123 commented Feb 4, 2022

see #533
this could be mitigated by add tcp keepalive or using http2 which was included in 6.x+

@AntonPetrov83
Copy link
Author

AntonPetrov83 commented Feb 4, 2022

The strange thing is it's the first time I encounter such exceptions and we work like that half a year already. And another strange thing is it stopped throwing exceptions after some time. If you look into the code you will see Orleans handles exception in a loop so I guess now it stabilized and works as expected...

@tg123
Copy link
Member

tg123 commented Feb 4, 2022

Connection reset by peer might be caused by any routing node to the dst endpoint.
it is a underlayer network issue.

in the case #533, we noticed that it caused by LB kicked idle connections. retrying maybe the solution if it is caused by network layer.

@zhiweiv
Copy link
Contributor

zhiweiv commented Feb 7, 2022

see #533 this could be mitigated by add tcp keepalive or using http2 which was included in 6.x+

@tg123
For client 7.x on .net 6.0, can you add more details about http2? Can it send tcp keep alive automatically without os level change?
The workaround in #533 (comment) is too complex.

@tg123
Copy link
Member

tg123 commented Feb 7, 2022

see #533 this could be mitigated by add tcp keepalive or using http2 which was included in 6.x+

@tg123 For client 7.x on .net 6.0, can you add more details about http2? Can it send tcp keep alive automatically without os level change? The workaround in #533 (comment) is too complex.

http2 is default if your server supports it (https).
but you have to set os level keepalive if does not

@zhiweiv
Copy link
Contributor

zhiweiv commented Feb 8, 2022

I found the workaround #533 (comment) has been merged, and there is property KubernetesClientConfiguration.TcpKeepAlive to control it, but it only works for .net 5, I added PR #777 to support .net 6, please take a look.

@zhiweiv
Copy link
Contributor

zhiweiv commented Feb 8, 2022

There is no good way to change tcp keepalive in container level, seems it is Linux kernel level setting, see https://stackoverflow.com/questions/69302681/setting-tcp-keepalive-on-a-container.

I have to keep the application level monitor logic, reset watch after N minutes no new data. I have to do this because I am running workload in AKS with basic load balancer. BLB will silently drop the connection for idle connection(SLB will sent RST to client and server, aka the Connection reset by peer), the watch hang forever after connection dropped.

@zhiweiv
Copy link
Contributor

zhiweiv commented Feb 8, 2022

I am wondering does SocketsHttpHandler.KeepAlivePingDelay in http2 can send keepalive correctly without the kernel setting change. It doesn't mention in https://docs.microsoft.com/en-us/dotnet/api/system.net.http.socketshttphandler.keepalivepingdelay?view=net-6.0.

@tg123
Copy link
Member

tg123 commented Feb 8, 2022

@zhiweiv see #590 and #592
SocketsHttpHandler should be the default one after autorest2 removed
I believe http2 was tested with #590 and #590 should get merged

@zhiweiv
Copy link
Contributor

zhiweiv commented Feb 9, 2022

Is it ok to update code in https://github.com/kubernetes-client/csharp/blob/master/src/KubernetesClient/Kubernetes.ConfigInit.cs#L195 to following now?
This enable http2 keepalive when KubernetesClientConfiguration.TcpKeepAlive is true. I can test if it work in my aks cluster after get merged.

var sh = new SocketsHttpHandler
{
    KeepAlivePingPolicy = HttpKeepAlivePingPolicy.WithActiveRequests,
    KeepAlivePingDelay = new TimeSpan(0, 3, 0),
    KeepAlivePingTimeout = new TimeSpan(0, 0, 30)  
};

@tg123
Copy link
Member

tg123 commented Feb 9, 2022

@zhiweiv thanks
we forget to merge #590 lets do it

cc @brendanburns

@zhiweiv
Copy link
Contributor

zhiweiv commented Feb 9, 2022

For the KeepAlivePingDelay, the 3mins is a bit long, the idle time of AKS basic load balancer is 4mins, I am afraid it is not enough in that case. Can you recommend the value of KeepAlivePingDelay?

@tg123
Copy link
Member

tg123 commented Feb 11, 2022

For the KeepAlivePingDelay, the 3mins is a bit long, the idle time of AKS basic load balancer is 4mins, I am afraid it is not enough in that case. Can you recommend the value of KeepAlivePingDelay?

let test 3 mins first.

@zhiweiv
Copy link
Contributor

zhiweiv commented Feb 14, 2022

PR submitted, please take a look.

@zhiweiv
Copy link
Contributor

zhiweiv commented Feb 17, 2022

The 7.0.13 fixed Connection reset by peer in my test, @AntonPetrov83 you can take a try. The .net version should be 5.0+.

@tg123
Copy link
Member

tg123 commented Feb 17, 2022

The 7.0.13 fixed Connection reset by peer in my test, @AntonPetrov83 you can take a try. The .net version should be 5.0+.

Thanks for PR and verification

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants