Watch Method stops watching #533

pbalexlear · 2020-12-06T13:51:00Z

When using a watch method it appears to stop watching after a certain time period, not sure if this is a task deadlock or a stream timeout but after around 30 minutes new events are no longer triggered even if changes are made on the cluster, also the watch object says its still watching but when debugging and looking at the tasks they are awaiting the streamreader.readline

pbalexlear · 2020-12-06T21:39:59Z

Having spent a lot of time looking into this i believe its due to a bug in the Ssl stream, potentially a deadlock but essentially the await calls on the read stream method just hand indefinitely and never get called back.

tg123 · 2020-12-07T19:23:20Z

did you server close the connection?

pbalexlear · 2020-12-07T21:08:00Z

Not that i can see, either way were that the case i would expect and exception to be thrown which doesnt happen.

brendandburns · 2020-12-09T02:04:17Z

It's quite possible also that something in the network path timed out. You need to send keep-alives in many cases to keep the TCP stream open. (or you need to timeout the watch at a lower number like 5 minutes)

pbalexlear · 2020-12-09T09:57:46Z

Maybe, but im using the default implementation of this client and by default it isnt working, it doesn't throw an exception just seems to get stuck in an await cycle.
i did try changing the HttpClient inside this library to have an infinite timeout but this didnt seem to change this behaviour.

eskaufel · 2020-12-09T11:54:47Z

@pbalexlear I think I have the same issue. Just out of curiosity, are you using Windows or Linux?

pbalexlear · 2020-12-09T11:57:44Z

@eskaufel

I was doing this on mac initially but as the debugger for tasks is better in VS for windows i tried on windows with the same results but was able to see more in detail what was happening. Although I've since deleted the windows VM so no longer have the information, given this happens to me every-time after maybe 5mins of no events i would have thought it would be fairly easy to reproduce.

It does look to be like its potentially a bug in SSL streams as when you do a kube proxy and watch locally using the http address this behaviour doesn't happen.

pbalexlear · 2020-12-09T11:58:50Z

Ideally there should be some exception thrown and then i could restart the watcher but as there isnt i cannot capture the event and as the indicator 'Watching' is set in the same method thats stuck awaiting a task result this never gets changed either so i cannot capture if a watcher truly is watching or not.

tg123 · 2020-12-09T18:17:37Z

could you please create a test case in e2e project?

and i will try to understand and fix it

pbalexlear · 2020-12-09T18:21:52Z

@tg123 Could you explain a bit more by what you mean by create a test case in e2e project? I'm happy to help assist in any way i can although i think the best way for you to understand this issue is to recreate it.

tg123 · 2020-12-09T19:14:21Z

create a test case for your bug
so i can easily repo it on my dev env to help you

csharp/tests/E2E.Tests/MnikubeTests.cs

Line 19 in a614d95

public void SimpleTest()

adoprog · 2020-12-09T21:34:39Z

I have the same issue when using https://github.com/falox/csharp-operator-sdk which is based on this project. It works for 5-10 minutes, then hangs - no errors thrown.

pbalexlear · 2020-12-09T21:43:22Z

@tg123 i have created a test here:
pbalexlear@42392ab
i couldnt use the mock api server as this doesnt listen on https and this issue only occurs when listening to a https endpoint, i think its due to an issue in the SslStream Class

pbalexlear · 2020-12-09T22:02:03Z

Also please be aware that in order to run this test there needs to be a namespace in the k8s cluster on your current config called 'testns'

tg123 · 2020-12-14T07:56:49Z

@pbalexlear i am trying to understand your case

here is what i got after cherry pick your testcase

should it fail?

pbalexlear · 2020-12-14T10:07:42Z

Hello, its not possible that the case can run in 1.7 mins as it has a 5 min task.delay which is when the issue occurs.

the scenario is supposed to be

Connect to api server
watch for events
wait for any exisiting events to come through
create a new pod
wait for new pod event
wait for 5 mins where no events come through
create pod
- new event should come through but doesnt this is the bug

eskaufel · 2020-12-15T21:28:20Z

I have spent some time on this today. First, I can not find any specific reason for this to happen, but I did find a similar case in the JAVA SDK, kubernetes-client/java#1370.
The solution was to configure a keep-alive ping for the HttpClient. Unfortunately, this does not exist in Dotnet core, only in .Net 5, dotnet/runtime#31198.

I am not sure this will be the solution as I have not had time to confirm it.

pbalexlear · 2020-12-16T10:42:52Z

Yes i think that there is a bug in dotnet core particularly as this only happens over ssl streams as well.

tg123 · 2020-12-16T11:02:22Z

allow me some time
l will take a deeper look at weekend

tg123 · 2020-12-20T11:49:52Z

I still cannot repo this
Could you please create an easy repo test tool base on mine for me to investigate?

Here is how I test it (latest master 0e3cb94)

$ cat /etc/issue
Ubuntu 20.04.1 LTS

$ dotnet --version
5.0.101

Server

$ minikube start
😄  minikube v1.15.1 on Ubuntu 20.04
✨  Using the docker driver based on existing profile

🧯  The requested memory allocation of 1921MiB does not leave room for system overhead (total system  memory: 2407MiB). You may face stability issues.
💡  Suggestion: Start minikube with less memory allocated: 'minikube start --memory=2200mb'

👍  Starting control plane node minikube in cluster minikube
🔄  Restarting existing docker container for "minikube" ...
🐳  Preparing Kubernetes v1.20.0 on Docker 19.03.8 ...
🔎  Verifying Kubernetes components...
🌟  Enabled addons: storage-provisioner, default-storageclass
🏄  Done! kubectl is now configured to use "minikube" cluster and "" namespace by default

Client code (from watch example)

   private static void Main(string[] args)
        {
            var config = KubernetesClientConfiguration.BuildConfigFromConfigFile();

            IKubernetes client = new Kubernetes(config);

            var podlistResp = client.ListNamespacedPodWithHttpMessagesAsync("default", watch: true);
            using (podlistResp.Watch<V1Pod, V1PodList>((type, item) =>
            {
                Console.WriteLine(DateTime.Now +   " == on watch event==");
                Console.WriteLine(type);
                Console.WriteLine(item.Metadata.Name);
                Console.WriteLine(DateTime.Now + ", ==on watch event==");
            }, onError: (exception =>
            {
                Console.WriteLine(exception);
            })))
            {
                Console.WriteLine("press ctrl + c to stop watching");

                var ctrlc = new ManualResetEventSlim(false);
                Console.CancelKeyPress += (sender, eventArgs) => ctrlc.Set();
                ctrlc.Wait();
            }
        }

create pod periodically (random delay)

kubectl run nginx-$RANDOM --image=nginx

here is logs output,

press ctrl + c to stop watching
12/20/2020 11:18:34 AM == on watch event==
Added
nginx-10583
12/20/2020 11:18:34 AM, ==on watch event==
12/20/2020 11:18:34 AM == on watch event==
Added
nginx-22296
12/20/2020 11:18:34 AM, ==on watch event==
12/20/2020 11:18:34 AM == on watch event==
Added
nginx-23595
12/20/2020 11:18:34 AM, ==on watch event==
12/20/2020 11:18:34 AM == on watch event==
Added
nginx-12364
12/20/2020 11:18:34 AM, ==on watch event==
12/20/2020 11:18:34 AM == on watch event==
Added
nginx-6853
12/20/2020 11:18:34 AM, ==on watch event==
12/20/2020 11:18:34 AM == on watch event==
Added
nginx-3122
12/20/2020 11:18:34 AM, ==on watch event==
12/20/2020 11:18:34 AM == on watch event==
Added
nginx-27359
12/20/2020 11:18:34 AM, ==on watch event==

12/20/2020 11:19:06 AM == on watch event==
Modified
nginx-27359
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Deleted
nginx-27359
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Modified
nginx-6853
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Deleted
nginx-6853
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Modified
nginx-3122
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Deleted
nginx-3122
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Modified
nginx-23595
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Deleted
nginx-23595
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Modified
nginx-12364
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Deleted
nginx-12364
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:07 AM == on watch event==
Modified
nginx-10583
12/20/2020 11:19:07 AM, ==on watch event==
12/20/2020 11:19:07 AM == on watch event==
Deleted
nginx-10583
12/20/2020 11:19:07 AM, ==on watch event==
12/20/2020 11:19:07 AM == on watch event==
Modified
nginx-22296
12/20/2020 11:19:07 AM, ==on watch event==
12/20/2020 11:19:07 AM == on watch event==
Deleted
nginx-22296
12/20/2020 11:19:07 AM, ==on watch event==
12/20/2020 11:19:29 AM == on watch event==
Added
nginx-8242
12/20/2020 11:19:29 AM, ==on watch event==
12/20/2020 11:19:29 AM == on watch event==
Modified
nginx-8242
12/20/2020 11:19:29 AM, ==on watch event==
12/20/2020 11:19:29 AM == on watch event==
Modified
nginx-8242
12/20/2020 11:19:29 AM, ==on watch event==
12/20/2020 11:19:32 AM == on watch event==
Modified
nginx-8242
12/20/2020 11:19:32 AM, ==on watch event==

12/20/2020 11:21:02 AM == on watch event==
Added
nginx-19397
12/20/2020 11:21:02 AM, ==on watch event==
12/20/2020 11:21:02 AM == on watch event==
Modified
nginx-19397
12/20/2020 11:21:02 AM, ==on watch event==
12/20/2020 11:21:02 AM == on watch event==
Modified
nginx-19397
12/20/2020 11:21:02 AM, ==on watch event==
12/20/2020 11:21:05 AM == on watch event==
Modified
nginx-19397
12/20/2020 11:21:05 AM, ==on watch event==
12/20/2020 11:27:22 AM == on watch event==
Added
nginx-11549
12/20/2020 11:27:22 AM, ==on watch event==
12/20/2020 11:27:22 AM == on watch event==
Modified
nginx-11549
12/20/2020 11:27:22 AM, ==on watch event==
12/20/2020 11:27:22 AM == on watch event==
Modified
nginx-11549
12/20/2020 11:27:22 AM, ==on watch event==
12/20/2020 11:27:25 AM == on watch event==
Modified
nginx-11549
12/20/2020 11:27:25 AM, ==on watch event==
12/20/2020 11:35:48 AM == on watch event==
Added
nginx-31585
12/20/2020 11:35:48 AM, ==on watch event==
12/20/2020 11:35:48 AM == on watch event==
Modified
nginx-31585
12/20/2020 11:35:48 AM, ==on watch event==
12/20/2020 11:35:48 AM == on watch event==
Modified
nginx-31585
12/20/2020 11:35:48 AM, ==on watch event==
12/20/2020 11:35:52 AM == on watch event==
Modified
nginx-31585
12/20/2020 11:35:52 AM, ==on watch event==
12/20/2020 11:38:45 AM == on watch event==
Modified
nginx-11549
12/20/2020 11:38:45 AM, ==on watch event==
12/20/2020 11:38:45 AM == on watch event==
Modified
nginx-19397
12/20/2020 11:38:45 AM, ==on watch event==
12/20/2020 11:38:45 AM == on watch event==
Modified
nginx-31585
12/20/2020 11:38:45 AM, ==on watch event==
12/20/2020 11:38:45 AM == on watch event==
Modified
nginx-8242
12/20/2020 11:38:45 AM, ==on watch event==
12/20/2020 11:38:46 AM == on watch event==
Modified
nginx-19397
12/20/2020 11:38:46 AM, ==on watch event==
12/20/2020 11:38:46 AM == on watch event==
Modified
nginx-11549
12/20/2020 11:38:46 AM, ==on watch event==
12/20/2020 11:38:46 AM == on watch event==
Modified
nginx-31585
12/20/2020 11:38:46 AM, ==on watch event==
12/20/2020 11:38:46 AM == on watch event==
Modified
nginx-8242
12/20/2020 11:38:46 AM, ==on watch event==
12/20/2020 11:38:56 AM == on watch event==
Modified
nginx-31585
12/20/2020 11:38:56 AM, ==on watch event==
12/20/2020 11:38:56 AM == on watch event==
Deleted
nginx-31585
12/20/2020 11:38:56 AM, ==on watch event==
12/20/2020 11:38:56 AM == on watch event==
Modified
nginx-8242
12/20/2020 11:38:56 AM, ==on watch event==
12/20/2020 11:38:56 AM == on watch event==
Deleted
nginx-8242
12/20/2020 11:38:56 AM, ==on watch event==
12/20/2020 11:38:56 AM == on watch event==
Modified
nginx-19397
12/20/2020 11:38:56 AM, ==on watch event==
12/20/2020 11:38:56 AM == on watch event==
Deleted
nginx-19397
12/20/2020 11:38:56 AM, ==on watch event==
12/20/2020 11:38:56 AM == on watch event==
Modified
nginx-11549
12/20/2020 11:38:56 AM, ==on watch event==
12/20/2020 11:38:56 AM == on watch event==
Deleted
nginx-11549
12/20/2020 11:38:56 AM, ==on watch event==

I did same thing on Windows, cant repo either

dotnet --version
5.0.101

Microsoft Windows [Version 10.0.19042.685]

eskaufel · 2020-12-20T13:48:27Z

Hi, strange that you can not reproduce. I reproduced it using the same example. My app is running inside an AKS service, but it also fail on Windows. I am pretty confident that a long duration without any changes is the trigger(about 5min). I will publish an example later today or tomorrow.

tg123 · 2020-12-20T14:06:51Z

I did a tcpdump on the connection to monitor what happened

after ~30 min the client shutdown the connection
however, the onClose event was triggered, could you please check if onClose works on your side.
I think we can re-establish the connection when onClose fired.

please also do the tcpdump when you test the connection stuck.

195	2189.998901	172.18.0.1	172.18.0.2	TCP	66	43106 → 8443 [FIN, ACK] Seq=1542 Ack=27282 Win=64128 Len=0 TSval=787227117 TSecr=2749129859
196	2189.999444	172.18.0.2	172.18.0.1	TLSv1.3	90	Application Data
197	2189.999534	172.18.0.1	172.18.0.2	TCP	54	43106 → 8443 [RST] Seq=1543 Win=0 Len=0

eskaufel · 2020-12-21T07:54:34Z

When onClose is triggered, everything is good.

So I copied @tg123's approach with a minikube installation, and it worked just fine. So I booted an AKS-cluster, and it did not.

I expected nothing from the tcpdump, but I got a reset without onClose triggered.

07:20:21.453280 IP zim.40204 > 111.222.111.222.https: Flags [F.], seq 2544590575, ack 778394570, win 781, options [nop,nop,TS val 992367345 ecr 3598969883], length 0
07:20:22.466308 IP 111.222.111.222.https > zim.42448: Flags [S.], seq 3156301740, ack 922143489, win 65160, options [mss 1412,sackOK,TS val 3599329751 ecr 992368332,nop,wscale 7], length 0
.
.
.
07:21:30.276370 IP zim.42448 > 111.222.111.222.https: Flags [.], ack 59924, win 645, options [nop,nop,TS val 992436168 ecr 3599397561], length 0
07:25:41.074788 IP 111.222.111.222.https > zim.42448: Flags [R], seq 3156361353, win 0, length 0

Nothing more in the tcpdump after this point

I logged the status of watcher.Watching every minute

07:20 Is watching: True
07:20:Watcher_OnEvent: app1-forward-566d848454-2w282 - Added
07:21 Is watching: True
07:21:Watcher_OnEvent: app1-snowball-7f5c6bfcff-vq9g7 - Modified
07:22 Is watching: True
07:23 Is watching: True
07:24 Is watching: True
07:25 Is watching: True
07:26 Is watching: True
07:27 Is watching: True
.
.

tg123 · 2020-12-21T10:35:01Z

see my later reply and workaround

there are 2 issues now
AWS LB (maybe) sent RST to close the connection after 5 min idle

echo 100 > /proc/sys/net/ipv4/tcp_keepalive_time

lets do this and see if it mitigates the issue
you should see Flags [P.] in tcpdump after turn on tcp keepalive

Underlayer connection did not honor RST

i do not have AWS K8S handy (but Azure lol)
please take a connection status snap using netstat when you see the RST

i will try to mirror the behavior of RST and update findings here.

pbalexlear · 2020-12-21T13:42:04Z

hello, so i have only experienced this issue when using AKS and a HTTPS connection, if you use the same set up i would expect you should be able to replicate this as it happens every time for me using AKS and a HTTPS connection.

Thanks

Alex

tg123 · 2020-12-22T08:33:25Z

I created an AKS and tested
This issue is due to LB RST connection after 5m idle and client side still remains ESTAB

We can set tcp keep alive to avoid remote kick us and

set tcp keep alive params

$ cat /proc/sys/net/ipv4/tcp_keepalive_intvl
10

$ cat /proc/sys/net/ipv4/tcp_keepalive_time
10

here is my workaround, net5 only due to SocketsHttpHandler
I am looking if we can find a solution to set tcp keep alive in netstandard
however, I do recommend you to update to net5

using System;
using System.Linq;
using System.Net;
using System.Net.Http;
using System.Net.Sockets;
using System.Reflection;
using System.Threading;
using k8s;
using k8s.Models;


namespace watch
{
    internal class Program
    {
        private static void Main(string[] args)
        {
            var config = KubernetesClientConfiguration.BuildConfigFromConfigFile();

            var client = new Kubernetes(config);

            {

                var hf = typeof(HttpMessageInvoker).GetField("_handler", BindingFlags.Instance | BindingFlags.NonPublic);
                var h = hf.GetValue(client.HttpClient);

                while (!(h is HttpClientHandler))
                {
                    h = ((DelegatingHandler) h).InnerHandler;
                }

                var sh = new SocketsHttpHandler();
                sh.ConnectCallback = async (context, token) =>
                {
                    var socket = new Socket(SocketType.Stream, ProtocolType.Tcp)
                    {
                        NoDelay = true,
                    };

                    socket.SetSocketOption(SocketOptionLevel.Socket, SocketOptionName.KeepAlive, true);

                    var ip = Dns.GetHostEntry(context.DnsEndPoint.Host).AddressList.First();
                    var ep = new IPEndPoint(ip, context.DnsEndPoint.Port);

                    // Console.WriteLine(ep);
                    await socket.ConnectAsync(ep, token).ConfigureAwait(false);


                    return new NetworkStream(socket, ownsSocket: true);
                };

                var p = h.GetType().GetField("_underlyingHandler", BindingFlags.NonPublic | BindingFlags.Instance);
                p.SetValue(h, (sh));
            }

            {
                var m = client.GetType().GetMethod("InitializeFromConfig", BindingFlags.NonPublic | BindingFlags.Instance);
                m.Invoke(client, new []{config});
            }


           var podlistResp = client.ListNamespacedPodWithHttpMessagesAsync("default", watch: true);
            using (podlistResp.Watch<V1Pod, V1PodList>((type, item) =>
            {
                Console.WriteLine(DateTime.Now + " == on watch event==");
                Console.WriteLine(type);
                Console.WriteLine(item.Metadata.Name);
                Console.WriteLine(DateTime.Now + ", ==on watch event==");
            }, onError: (exception =>
            {
                Console.WriteLine(exception);
            }), onClosed: () =>
            {
                Console.WriteLine("close");
            }))
            {
                Console.WriteLine("press ctrl + c to stop watching");

                var ctrlc = new ManualResetEventSlim(false);
                Console.CancelKeyPress += (sender, eventArgs) => ctrlc.Set();
                ctrlc.Wait();
            }
        }
    }
}

update log after 10min idle

...
nginx-4176
12/22/2020 8:26:35 AM, ==on watch event==
12/22/2020 8:26:35 AM == on watch event==
Added
nginx-11719
12/22/2020 8:26:35 AM, ==on watch event==
12/22/2020 8:35:56 AM == on watch event==
Added
nginx-22531
12/22/2020 8:35:56 AM, ==on watch event==
12/22/2020 8:35:56 AM == on watch event==
...

chengaofeng · 2020-12-31T08:54:48Z

kube-apiserver has an option --min-request-timeout int Default: 1800 ,as to spread out load ,The Apisever actively sets the connection to timeout.

pbalexlear · 2020-12-31T09:25:12Z

Thanks @tg123 i am happy to upgrade to dotnet 5, will this fix be being added to a release?

eskaufel · 2020-12-31T12:18:08Z

@tg123 great work, sorry for being so quiet, but Christmas...
I will test your patch in a few days. I am still confused as to why the client deems the connection is active.

tg123 · 2021-01-31T23:16:06Z

@roblapp did you try my workaround?

eskaufel · 2021-02-03T08:08:22Z

@roblapp I followed the changes suggested by @tg123and got it running stable. There might still be an issue when running it in a hosted service. In our case, the root cause was Azure LB, not AWS. To make this bomb-proof, I added a timer. When the duration between events passes the TTL limit, we recreate the watch. This only occurs in stale environments, but it provides another layer of stability as it will resolve all network issues causing a half-open connection. It is not beautiful :)

roblappcoupa · 2021-02-09T15:36:25Z

@eskaufel Are you passing the resource version in when establishing the watch? I added support for it but then I wasn't sure how you were supposed obtain it initially. I'd be very curious to see your solution, even if it's in pseudo code... I am interested in seeing how you are doing it. My watch functionality was kicked off by using an IHostedService in ASP.NET Core like so:

    while (!cancellationToken.IsCancellationRequested)
     {
          try
          {
                this.logger.LogInformation("[Monitor] Creating Kubernetes Client");

                 using (var kubernetesClient = this.kubernetesClientFactory.CreateKubernetesClient())
                 {
                        // See https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes
                       this.logger.LogInformation("[Monitor] Watching namespace {Ns}", this.options.Value.Kubernetes.ExecutionNamespace);
                        await this.TryWatchJobs(kubernetesClient, null, cancellationToken);
                        this.logger.LogInformation("[Monitor] Stopped watching namespace {Ns}", this.options.Value.Kubernetes.ExecutionNamespace);
                    }
             }
             catch (Exception exception)
             {
                  this.logger.LogError(exception, "[Monitor] An exception was thrown during a monitor loop iteration");
             }
             finally
             {
                    // If cancellation has been set do not bother waiting
                    if (!cancellationToken.IsCancellationRequested)
                    {
                        await Task.Delay(this.options.Value.Monitoring.MonitorLoopReconnectDelay, cancellationToken);
                    }
             }
    }

This code calls this method:

private Task TryWatchJobs(IKubernetes kubernetesClient, string resourceVersion, CancellationToken cancellationToken)
{
	try
	{
		var connectionClosedResetEvent = new ManualResetEvent(false);

		this.logger.LogInformation("Getting job Task for Watch");

		// Note: this task gets awaited within the Watch<T, K>(...)
		var jobs = kubernetesClient.ListNamespacedJobWithHttpMessagesAsync(
			namespaceParameter: this.options.Value.Kubernetes.ExecutionNamespace,
			watch: true,
			resourceVersion: resourceVersion,
			cancellationToken: cancellationToken);

		this.logger.LogInformation("Starting Kubernetes Watch");

		using (jobs.Watch<V1Job, V1JobList>(
			(watchEventType, v1Job) =>
			{
				this.OnWatchEvent(watchEventType, v1Job);
			},
			exception =>
			{
				this.OnWatchException(exception);
			},
			() =>
			{
				this.OnWatchClosed();
				connectionClosedResetEvent.Set();
			}))
		{
			this.logger.LogInformation("Invoking WaitHandle.WaitAny");
			
			var indexOfOperationThatChangedState = WaitHandle.WaitAny(new[] { cancellationToken.WaitHandle, connectionClosedResetEvent });

			switch (indexOfOperationThatChangedState)
			{
				case 0:
					this.logger.LogInformation("WaitHandle.WaitAny returned. Reason: cancellation was signaled");
					break;
				case 1:
					this.logger.LogInformation("WaitHandle.WaitAny returned. Reason: watch HTTP connection was closed");
					break;
				default:
					this.logger.LogInformation("WaitHandle.WaitAny returned. Reason: could not be determined");
					break;
			}
		}
	}
	catch (Exception exception)
	{
		this.logger.LogWarning(
			exception,
			"An Exception was thrown on the Kubernetes Watch");
	}
	finally
	{
		this.logger.LogInformation("Completed Kubernetes Watch");
	}

	return Task.CompletedTask;
}

If anyone sees something wrong with this please let me know and I will be eternally grateful. I ended up moving to a polling based solution, which works for my use case. But I would much rather get the watch working correctly.

P.S. It might be worth noting that my cluster is in EKS. I am curious to see if that is going to be a commonality for those experiencing this issue.

roblappcoupa · 2021-02-09T15:37:37Z

@tg123 I did not try that solution yet. I may give it a shot within the next week or so. Thanks!

P.S. It might be worth noting that my cluster is in EKS. I am curious to see if that is going to be a commonality for those experiencing this issue.

fejta-bot · 2021-05-10T16:17:54Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-06-09T16:45:40Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-07-09T17:43:55Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-07-09T17:43:56Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

paulsavides · 2021-07-27T21:04:11Z

Hello, I am encountering something very similar to this issue using version 5.0.13 of the KubernetesClient using .net5.

Could somebody clarify whether this issue is expected to be fixed or not? If not, I can try and dig deeper and determine whether it is the same issue.

roblappcoupa · 2021-07-27T21:24:41Z

@paulsavides Unfortunately, I kept running into this issue. The watch would disconnect and there was nothing to indicate what happened. After days of not being able to resolve it I abandoned the watch functionality completely in lieu of polling. I posted the code that I was using to perform the watch here. In my new solution, I have a background process that polls the V1Jobs API on a 15 second interval. For my use case, this is good enough as I don't need to capture events instantly. For testing, I simulated thousands of requests per minute against the API server and saw no noticeable impact on the cluster or with the Kubernetes API Server.

Unfortunately, I don't think a solution was found nor was I able to figure it out which is why I changed my approach to polling.

karok2m · 2022-08-01T14:24:49Z

@tg123, is the modification for underlying message handler still required (referring to #533 (comment) ) to enable TCP KeepAlive messages ? Or is it enabled by default as part of creation of Kubernetes client ?

tg123 · 2022-08-01T17:03:24Z

@karok2m it is enabled by default on net5+
please make sure your api server supports http2 (just check if it is https://)

samfullman · 2022-08-21T23:06:12Z

I have not resolved this issue myself but if it helps anyone this code is a workaround when using the CLI/kubectl:

slm-watch(){
    if [[ -z $1 ]]; then
        echo "\nCurrent context has these CRDs:"
        i=-1
        kubectl get crd -A | awk '{ print $1 }' | while read line; do
            i=$((i + 1))
            if [[ $i -eq 0 ]]; then
                continue
            fi
            CHOICES[$i]=$line
            echo $i. $line;
        done
        echo
        read "CHOICE?Select a number between 1 and $i: "
        CHOICE=$CHOICES[$CHOICE]
    else
        CHOICE=$1
    fi
    echo OK, getting $CHOICE objects ...
    while true; do
        kubectl get $CHOICE -A -w | while read line; do
            echo "$line      $(date +%T)"
        done
        echo "\nDamn, lost the watcher! What happened? ..."
        sleep 2
    done
}
# example: `slm-watch storagesnapshots`

Quite obviously a hack, and you lose the declarativeness because events "rebuild" on re-connection, but it works for me.

This is on a MacBook running minikube

lknite · 2023-05-13T14:27:43Z

Is there a good solution for this now? (Currently seeing timeouts watching for changes of instances of a crd). In developing a controller I want to perform activities in response to a watch event. I'm checking on whether it's possible to have an infinite --min-request-timeout but it occurs to me there may be a race condition. I would hope the api could reconnect as necessary and "catch up" on any missed events which occurred when reconnecting. If not, what's the algorithm to deal with watch functionality coming and going? (Am using csharp on linux in a pod.)

Does the api do this (reconnect with last resource version)?

eskaufel mentioned this issue Dec 15, 2020

Watch() missing events after a while #515

Closed

tg123 mentioned this issue Feb 12, 2021

add tcp keep alive hack to net5 target #560

Merged

jkryl mentioned this issue Feb 25, 2021

watcher connection stops receiving events after some time kubernetes-client/javascript#596

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 10, 2021

rogeralsing mentioned this issue Jun 2, 2021

Cluster - KubernetesProvider asynkron/protoactor-dotnet#1017

Closed

CarlosLanderas mentioned this issue Jun 7, 2021

Operator does not see updated services Xabaril/AspNetCore.Diagnostics.HealthChecks#635

Closed

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 9, 2021

k8s-ci-robot closed this as completed Jul 9, 2021

tg123 unpinned this issue Oct 11, 2021

audunsol mentioned this issue Nov 8, 2021

Watching for events stops working after ~10 min stale time emberstack/kubernetes-reflector#228

Closed

tg123 mentioned this issue Feb 4, 2022

What is the possible cause of "Connection reset by peer" when watching ListNamespacedPodWithHttpMessagesAsync #773

Closed

zhiweiv mentioned this issue Jun 6, 2022

Watch hang occasionally #884

Closed

Crespalvaro mentioned this issue Oct 17, 2022

System.IO.IOException: The request was aborted in the KubeOps.Operator.Kubernetes.ResourceWatcher buehler/dotnet-operator-sdk#482

Closed

tg123 mentioned this issue Dec 12, 2022

Connecting to api server fails with HTTP/2 handshake error #1112

Closed

cmwylie19 mentioned this issue May 1, 2024

Pepr Watch is not Responding to Changes after 90 mins defenseunicorns/pepr#745

Closed

tintoy mentioned this issue Jul 7, 2024

The Pod WatchAll will fail in about half an hour tintoy/dotnet-kube-client#157

Closed

Watch Method stops watching #533

Watch Method stops watching #533

Comments

pbalexlear commented Dec 6, 2020

pbalexlear commented Dec 6, 2020

tg123 commented Dec 7, 2020

pbalexlear commented Dec 7, 2020

brendandburns commented Dec 9, 2020

pbalexlear commented Dec 9, 2020

eskaufel commented Dec 9, 2020

pbalexlear commented Dec 9, 2020

pbalexlear commented Dec 9, 2020

tg123 commented Dec 9, 2020

pbalexlear commented Dec 9, 2020

tg123 commented Dec 9, 2020

adoprog commented Dec 9, 2020

pbalexlear commented Dec 9, 2020

pbalexlear commented Dec 9, 2020

tg123 commented Dec 14, 2020

pbalexlear commented Dec 14, 2020

eskaufel commented Dec 15, 2020

pbalexlear commented Dec 16, 2020

tg123 commented Dec 16, 2020

tg123 commented Dec 20, 2020 • edited Loading

eskaufel commented Dec 20, 2020

tg123 commented Dec 20, 2020

eskaufel commented Dec 21, 2020 • edited Loading

tg123 commented Dec 21, 2020 • edited Loading

AWS LB (maybe) sent RST to close the connection after 5 min idle

Underlayer connection did not honor RST

pbalexlear commented Dec 21, 2020

tg123 commented Dec 22, 2020 • edited Loading

chengaofeng commented Dec 31, 2020

pbalexlear commented Dec 31, 2020

eskaufel commented Dec 31, 2020

tg123 commented Jan 31, 2021

eskaufel commented Feb 3, 2021

roblappcoupa commented Feb 9, 2021 • edited Loading

roblappcoupa commented Feb 9, 2021 • edited Loading

fejta-bot commented May 10, 2021

fejta-bot commented Jun 9, 2021

fejta-bot commented Jul 9, 2021

k8s-ci-robot commented Jul 9, 2021

paulsavides commented Jul 27, 2021

roblappcoupa commented Jul 27, 2021 • edited Loading

karok2m commented Aug 1, 2022 • edited Loading

tg123 commented Aug 1, 2022

samfullman commented Aug 21, 2022

lknite commented May 13, 2023 • edited Loading

tg123 commented Dec 20, 2020 •

edited

Loading

eskaufel commented Dec 21, 2020 •

edited

Loading

tg123 commented Dec 21, 2020 •

edited

Loading

tg123 commented Dec 22, 2020 •

edited

Loading

roblappcoupa commented Feb 9, 2021 •

edited

Loading

roblappcoupa commented Feb 9, 2021 •

edited

Loading

roblappcoupa commented Jul 27, 2021 •

edited

Loading

karok2m commented Aug 1, 2022 •

edited

Loading

lknite commented May 13, 2023 •

edited

Loading