Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch Method stops watching #533

Closed
pbalexlear opened this issue Dec 6, 2020 · 49 comments
Closed

Watch Method stops watching #533

pbalexlear opened this issue Dec 6, 2020 · 49 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@pbalexlear
Copy link

When using a watch method it appears to stop watching after a certain time period, not sure if this is a task deadlock or a stream timeout but after around 30 minutes new events are no longer triggered even if changes are made on the cluster, also the watch object says its still watching but when debugging and looking at the tasks they are awaiting the streamreader.readline

@pbalexlear
Copy link
Author

Having spent a lot of time looking into this i believe its due to a bug in the Ssl stream, potentially a deadlock but essentially the await calls on the read stream method just hand indefinitely and never get called back.

@tg123
Copy link
Member

tg123 commented Dec 7, 2020

did you server close the connection?

@pbalexlear
Copy link
Author

Not that i can see, either way were that the case i would expect and exception to be thrown which doesnt happen.

@brendandburns
Copy link
Contributor

It's quite possible also that something in the network path timed out. You need to send keep-alives in many cases to keep the TCP stream open. (or you need to timeout the watch at a lower number like 5 minutes)

@pbalexlear
Copy link
Author

Maybe, but im using the default implementation of this client and by default it isnt working, it doesn't throw an exception just seems to get stuck in an await cycle.
i did try changing the HttpClient inside this library to have an infinite timeout but this didnt seem to change this behaviour.

@eskaufel
Copy link

eskaufel commented Dec 9, 2020

@pbalexlear I think I have the same issue. Just out of curiosity, are you using Windows or Linux?

@pbalexlear
Copy link
Author

@eskaufel

I was doing this on mac initially but as the debugger for tasks is better in VS for windows i tried on windows with the same results but was able to see more in detail what was happening. Although I've since deleted the windows VM so no longer have the information, given this happens to me every-time after maybe 5mins of no events i would have thought it would be fairly easy to reproduce.

It does look to be like its potentially a bug in SSL streams as when you do a kube proxy and watch locally using the http address this behaviour doesn't happen.

@pbalexlear
Copy link
Author

Ideally there should be some exception thrown and then i could restart the watcher but as there isnt i cannot capture the event and as the indicator 'Watching' is set in the same method thats stuck awaiting a task result this never gets changed either so i cannot capture if a watcher truly is watching or not.

@tg123
Copy link
Member

tg123 commented Dec 9, 2020

could you please create a test case in e2e project?

and i will try to understand and fix it

@pbalexlear
Copy link
Author

@tg123 Could you explain a bit more by what you mean by create a test case in e2e project? I'm happy to help assist in any way i can although i think the best way for you to understand this issue is to recreate it.

@tg123
Copy link
Member

tg123 commented Dec 9, 2020

create a test case for your bug
so i can easily repo it on my dev env to help you

public void SimpleTest()

@adoprog
Copy link

adoprog commented Dec 9, 2020

I have the same issue when using https://github.com/falox/csharp-operator-sdk which is based on this project. It works for 5-10 minutes, then hangs - no errors thrown.

@pbalexlear
Copy link
Author

@tg123 i have created a test here:
pbalexlear@42392ab
i couldnt use the mock api server as this doesnt listen on https and this issue only occurs when listening to a https endpoint, i think its due to an issue in the SslStream Class

@pbalexlear
Copy link
Author

Also please be aware that in order to run this test there needs to be a namespace in the k8s cluster on your current config called 'testns'

@tg123
Copy link
Member

tg123 commented Dec 14, 2020

@pbalexlear i am trying to understand your case

here is what i got after cherry pick your testcase
image

should it fail?

@pbalexlear
Copy link
Author

Hello, its not possible that the case can run in 1.7 mins as it has a 5 min task.delay which is when the issue occurs.

the scenario is supposed to be

  • Connect to api server
  • watch for events
  • wait for any exisiting events to come through
  • create a new pod
  • wait for new pod event
  • wait for 5 mins where no events come through
  • create pod
    - new event should come through but doesnt this is the bug

@eskaufel
Copy link

I have spent some time on this today. First, I can not find any specific reason for this to happen, but I did find a similar case in the JAVA SDK, kubernetes-client/java#1370.
The solution was to configure a keep-alive ping for the HttpClient. Unfortunately, this does not exist in Dotnet core, only in .Net 5, dotnet/runtime#31198.

I am not sure this will be the solution as I have not had time to confirm it.

@pbalexlear
Copy link
Author

Yes i think that there is a bug in dotnet core particularly as this only happens over ssl streams as well.

@tg123
Copy link
Member

tg123 commented Dec 16, 2020

allow me some time
l will take a deeper look at weekend

@tg123
Copy link
Member

tg123 commented Dec 20, 2020

I still cannot repo this
Could you please create an easy repo test tool base on mine for me to investigate?

Here is how I test it (latest master 0e3cb94)

$ cat /etc/issue
Ubuntu 20.04.1 LTS

$ dotnet --version
5.0.101

Server

$ minikube start
😄  minikube v1.15.1 on Ubuntu 20.04
✨  Using the docker driver based on existing profile

🧯  The requested memory allocation of 1921MiB does not leave room for system overhead (total system  memory: 2407MiB). You may face stability issues.
💡  Suggestion: Start minikube with less memory allocated: 'minikube start --memory=2200mb'

👍  Starting control plane node minikube in cluster minikube
🔄  Restarting existing docker container for "minikube" ...
🐳  Preparing Kubernetes v1.20.0 on Docker 19.03.8 ...
🔎  Verifying Kubernetes components...
🌟  Enabled addons: storage-provisioner, default-storageclass
🏄  Done! kubectl is now configured to use "minikube" cluster and "" namespace by default

Client code (from watch example)

   private static void Main(string[] args)
        {
            var config = KubernetesClientConfiguration.BuildConfigFromConfigFile();

            IKubernetes client = new Kubernetes(config);

            var podlistResp = client.ListNamespacedPodWithHttpMessagesAsync("default", watch: true);
            using (podlistResp.Watch<V1Pod, V1PodList>((type, item) =>
            {
                Console.WriteLine(DateTime.Now +   " == on watch event==");
                Console.WriteLine(type);
                Console.WriteLine(item.Metadata.Name);
                Console.WriteLine(DateTime.Now + ", ==on watch event==");
            }, onError: (exception =>
            {
                Console.WriteLine(exception);
            })))
            {
                Console.WriteLine("press ctrl + c to stop watching");

                var ctrlc = new ManualResetEventSlim(false);
                Console.CancelKeyPress += (sender, eventArgs) => ctrlc.Set();
                ctrlc.Wait();
            }
        }

create pod periodically (random delay)

kubectl run nginx-$RANDOM --image=nginx

here is logs output,

press ctrl + c to stop watching
12/20/2020 11:18:34 AM == on watch event==
Added
nginx-10583
12/20/2020 11:18:34 AM, ==on watch event==
12/20/2020 11:18:34 AM == on watch event==
Added
nginx-22296
12/20/2020 11:18:34 AM, ==on watch event==
12/20/2020 11:18:34 AM == on watch event==
Added
nginx-23595
12/20/2020 11:18:34 AM, ==on watch event==
12/20/2020 11:18:34 AM == on watch event==
Added
nginx-12364
12/20/2020 11:18:34 AM, ==on watch event==
12/20/2020 11:18:34 AM == on watch event==
Added
nginx-6853
12/20/2020 11:18:34 AM, ==on watch event==
12/20/2020 11:18:34 AM == on watch event==
Added
nginx-3122
12/20/2020 11:18:34 AM, ==on watch event==
12/20/2020 11:18:34 AM == on watch event==
Added
nginx-27359
12/20/2020 11:18:34 AM, ==on watch event==

12/20/2020 11:19:06 AM == on watch event==
Modified
nginx-27359
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Deleted
nginx-27359
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Modified
nginx-6853
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Deleted
nginx-6853
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Modified
nginx-3122
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Deleted
nginx-3122
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Modified
nginx-23595
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Deleted
nginx-23595
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Modified
nginx-12364
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:06 AM == on watch event==
Deleted
nginx-12364
12/20/2020 11:19:06 AM, ==on watch event==
12/20/2020 11:19:07 AM == on watch event==
Modified
nginx-10583
12/20/2020 11:19:07 AM, ==on watch event==
12/20/2020 11:19:07 AM == on watch event==
Deleted
nginx-10583
12/20/2020 11:19:07 AM, ==on watch event==
12/20/2020 11:19:07 AM == on watch event==
Modified
nginx-22296
12/20/2020 11:19:07 AM, ==on watch event==
12/20/2020 11:19:07 AM == on watch event==
Deleted
nginx-22296
12/20/2020 11:19:07 AM, ==on watch event==
12/20/2020 11:19:29 AM == on watch event==
Added
nginx-8242
12/20/2020 11:19:29 AM, ==on watch event==
12/20/2020 11:19:29 AM == on watch event==
Modified
nginx-8242
12/20/2020 11:19:29 AM, ==on watch event==
12/20/2020 11:19:29 AM == on watch event==
Modified
nginx-8242
12/20/2020 11:19:29 AM, ==on watch event==
12/20/2020 11:19:32 AM == on watch event==
Modified
nginx-8242
12/20/2020 11:19:32 AM, ==on watch event==

12/20/2020 11:21:02 AM == on watch event==
Added
nginx-19397
12/20/2020 11:21:02 AM, ==on watch event==
12/20/2020 11:21:02 AM == on watch event==
Modified
nginx-19397
12/20/2020 11:21:02 AM, ==on watch event==
12/20/2020 11:21:02 AM == on watch event==
Modified
nginx-19397
12/20/2020 11:21:02 AM, ==on watch event==
12/20/2020 11:21:05 AM == on watch event==
Modified
nginx-19397
12/20/2020 11:21:05 AM, ==on watch event==
12/20/2020 11:27:22 AM == on watch event==
Added
nginx-11549
12/20/2020 11:27:22 AM, ==on watch event==
12/20/2020 11:27:22 AM == on watch event==
Modified
nginx-11549
12/20/2020 11:27:22 AM, ==on watch event==
12/20/2020 11:27:22 AM == on watch event==
Modified
nginx-11549
12/20/2020 11:27:22 AM, ==on watch event==
12/20/2020 11:27:25 AM == on watch event==
Modified
nginx-11549
12/20/2020 11:27:25 AM, ==on watch event==
12/20/2020 11:35:48 AM == on watch event==
Added
nginx-31585
12/20/2020 11:35:48 AM, ==on watch event==
12/20/2020 11:35:48 AM == on watch event==
Modified
nginx-31585
12/20/2020 11:35:48 AM, ==on watch event==
12/20/2020 11:35:48 AM == on watch event==
Modified
nginx-31585
12/20/2020 11:35:48 AM, ==on watch event==
12/20/2020 11:35:52 AM == on watch event==
Modified
nginx-31585
12/20/2020 11:35:52 AM, ==on watch event==
12/20/2020 11:38:45 AM == on watch event==
Modified
nginx-11549
12/20/2020 11:38:45 AM, ==on watch event==
12/20/2020 11:38:45 AM == on watch event==
Modified
nginx-19397
12/20/2020 11:38:45 AM, ==on watch event==
12/20/2020 11:38:45 AM == on watch event==
Modified
nginx-31585
12/20/2020 11:38:45 AM, ==on watch event==
12/20/2020 11:38:45 AM == on watch event==
Modified
nginx-8242
12/20/2020 11:38:45 AM, ==on watch event==
12/20/2020 11:38:46 AM == on watch event==
Modified
nginx-19397
12/20/2020 11:38:46 AM, ==on watch event==
12/20/2020 11:38:46 AM == on watch event==
Modified
nginx-11549
12/20/2020 11:38:46 AM, ==on watch event==
12/20/2020 11:38:46 AM == on watch event==
Modified
nginx-31585
12/20/2020 11:38:46 AM, ==on watch event==
12/20/2020 11:38:46 AM == on watch event==
Modified
nginx-8242
12/20/2020 11:38:46 AM, ==on watch event==
12/20/2020 11:38:56 AM == on watch event==
Modified
nginx-31585
12/20/2020 11:38:56 AM, ==on watch event==
12/20/2020 11:38:56 AM == on watch event==
Deleted
nginx-31585
12/20/2020 11:38:56 AM, ==on watch event==
12/20/2020 11:38:56 AM == on watch event==
Modified
nginx-8242
12/20/2020 11:38:56 AM, ==on watch event==
12/20/2020 11:38:56 AM == on watch event==
Deleted
nginx-8242
12/20/2020 11:38:56 AM, ==on watch event==
12/20/2020 11:38:56 AM == on watch event==
Modified
nginx-19397
12/20/2020 11:38:56 AM, ==on watch event==
12/20/2020 11:38:56 AM == on watch event==
Deleted
nginx-19397
12/20/2020 11:38:56 AM, ==on watch event==
12/20/2020 11:38:56 AM == on watch event==
Modified
nginx-11549
12/20/2020 11:38:56 AM, ==on watch event==
12/20/2020 11:38:56 AM == on watch event==
Deleted
nginx-11549
12/20/2020 11:38:56 AM, ==on watch event==

I did same thing on Windows, cant repo either

dotnet --version
5.0.101

Microsoft Windows [Version 10.0.19042.685]

@eskaufel
Copy link

Hi, strange that you can not reproduce. I reproduced it using the same example. My app is running inside an AKS service, but it also fail on Windows. I am pretty confident that a long duration without any changes is the trigger(about 5min). I will publish an example later today or tomorrow.

@tg123
Copy link
Member

tg123 commented Dec 20, 2020

I did a tcpdump on the connection to monitor what happened

after ~30 min the client shutdown the connection
however, the onClose event was triggered, could you please check if onClose works on your side.
I think we can re-establish the connection when onClose fired.

please also do the tcpdump when you test the connection stuck.

195	2189.998901	172.18.0.1	172.18.0.2	TCP	66	43106 → 8443 [FIN, ACK] Seq=1542 Ack=27282 Win=64128 Len=0 TSval=787227117 TSecr=2749129859
196	2189.999444	172.18.0.2	172.18.0.1	TLSv1.3	90	Application Data
197	2189.999534	172.18.0.1	172.18.0.2	TCP	54	43106 → 8443 [RST] Seq=1543 Win=0 Len=0

@eskaufel
Copy link

eskaufel commented Dec 21, 2020

When onClose is triggered, everything is good.

So I copied @tg123's approach with a minikube installation, and it worked just fine. So I booted an AKS-cluster, and it did not.

I expected nothing from the tcpdump, but I got a reset without onClose triggered.

07:20:21.453280 IP zim.40204 > 111.222.111.222.https: Flags [F.], seq 2544590575, ack 778394570, win 781, options [nop,nop,TS val 992367345 ecr 3598969883], length 0
07:20:22.466308 IP 111.222.111.222.https > zim.42448: Flags [S.], seq 3156301740, ack 922143489, win 65160, options [mss 1412,sackOK,TS val 3599329751 ecr 992368332,nop,wscale 7], length 0
.
.
.
07:21:30.276370 IP zim.42448 > 111.222.111.222.https: Flags [.], ack 59924, win 645, options [nop,nop,TS val 992436168 ecr 3599397561], length 0
07:25:41.074788 IP 111.222.111.222.https > zim.42448: Flags [R], seq 3156361353, win 0, length 0

Nothing more in the tcpdump after this point

I logged the status of watcher.Watching every minute

07:20 Is watching: True
07:20:Watcher_OnEvent: app1-forward-566d848454-2w282 - Added
07:21 Is watching: True
07:21:Watcher_OnEvent: app1-snowball-7f5c6bfcff-vq9g7 - Modified
07:22 Is watching: True
07:23 Is watching: True
07:24 Is watching: True
07:25 Is watching: True
07:26 Is watching: True
07:27 Is watching: True
.
.

@tg123
Copy link
Member

tg123 commented Dec 21, 2020

see my later reply and workaround

there are 2 issues now

AWS LB (maybe) sent RST to close the connection after 5 min idle

echo 100 > /proc/sys/net/ipv4/tcp_keepalive_time

lets do this and see if it mitigates the issue
you should see Flags [P.] in tcpdump after turn on tcp keepalive

Underlayer connection did not honor RST

i do not have AWS K8S handy (but Azure lol)
please take a connection status snap using netstat when you see the RST

i will try to mirror the behavior of RST and update findings here.

@pbalexlear
Copy link
Author

hello, so i have only experienced this issue when using AKS and a HTTPS connection, if you use the same set up i would expect you should be able to replicate this as it happens every time for me using AKS and a HTTPS connection.

Thanks

Alex

@tg123
Copy link
Member

tg123 commented Dec 22, 2020

I created an AKS and tested
This issue is due to LB RST connection after 5m idle and client side still remains ESTAB

We can set tcp keep alive to avoid remote kick us and

set tcp keep alive params

$ cat /proc/sys/net/ipv4/tcp_keepalive_intvl
10

$ cat /proc/sys/net/ipv4/tcp_keepalive_time
10

here is my workaround, net5 only due to SocketsHttpHandler
I am looking if we can find a solution to set tcp keep alive in netstandard
however, I do recommend you to update to net5

using System;
using System.Linq;
using System.Net;
using System.Net.Http;
using System.Net.Sockets;
using System.Reflection;
using System.Threading;
using k8s;
using k8s.Models;


namespace watch
{
    internal class Program
    {
        private static void Main(string[] args)
        {
            var config = KubernetesClientConfiguration.BuildConfigFromConfigFile();

            var client = new Kubernetes(config);

            {

                var hf = typeof(HttpMessageInvoker).GetField("_handler", BindingFlags.Instance | BindingFlags.NonPublic);
                var h = hf.GetValue(client.HttpClient);

                while (!(h is HttpClientHandler))
                {
                    h = ((DelegatingHandler) h).InnerHandler;
                }

                var sh = new SocketsHttpHandler();
                sh.ConnectCallback = async (context, token) =>
                {
                    var socket = new Socket(SocketType.Stream, ProtocolType.Tcp)
                    {
                        NoDelay = true,
                    };

                    socket.SetSocketOption(SocketOptionLevel.Socket, SocketOptionName.KeepAlive, true);

                    var ip = Dns.GetHostEntry(context.DnsEndPoint.Host).AddressList.First();
                    var ep = new IPEndPoint(ip, context.DnsEndPoint.Port);

                    // Console.WriteLine(ep);
                    await socket.ConnectAsync(ep, token).ConfigureAwait(false);


                    return new NetworkStream(socket, ownsSocket: true);
                };

                var p = h.GetType().GetField("_underlyingHandler", BindingFlags.NonPublic | BindingFlags.Instance);
                p.SetValue(h, (sh));
            }

            {
                var m = client.GetType().GetMethod("InitializeFromConfig", BindingFlags.NonPublic | BindingFlags.Instance);
                m.Invoke(client, new []{config});
            }


           var podlistResp = client.ListNamespacedPodWithHttpMessagesAsync("default", watch: true);
            using (podlistResp.Watch<V1Pod, V1PodList>((type, item) =>
            {
                Console.WriteLine(DateTime.Now + " == on watch event==");
                Console.WriteLine(type);
                Console.WriteLine(item.Metadata.Name);
                Console.WriteLine(DateTime.Now + ", ==on watch event==");
            }, onError: (exception =>
            {
                Console.WriteLine(exception);
            }), onClosed: () =>
            {
                Console.WriteLine("close");
            }))
            {
                Console.WriteLine("press ctrl + c to stop watching");

                var ctrlc = new ManualResetEventSlim(false);
                Console.CancelKeyPress += (sender, eventArgs) => ctrlc.Set();
                ctrlc.Wait();
            }
        }
    }
}

update log after 10min idle

...
nginx-4176
12/22/2020 8:26:35 AM, ==on watch event==
12/22/2020 8:26:35 AM == on watch event==
Added
nginx-11719
12/22/2020 8:26:35 AM, ==on watch event==
12/22/2020 8:35:56 AM == on watch event==
Added
nginx-22531
12/22/2020 8:35:56 AM, ==on watch event==
12/22/2020 8:35:56 AM == on watch event==
...

@chengaofeng
Copy link

kube-apiserver has an option --min-request-timeout int Default: 1800 ,as to spread out load ,The Apisever actively sets the connection to timeout.

@pbalexlear
Copy link
Author

Thanks @tg123 i am happy to upgrade to dotnet 5, will this fix be being added to a release?

@eskaufel
Copy link

@tg123 great work, sorry for being so quiet, but Christmas...
I will test your patch in a few days. I am still confused as to why the client deems the connection is active.

@tg123
Copy link
Member

tg123 commented Jan 31, 2021

@roblapp did you try my workaround?

@eskaufel
Copy link

eskaufel commented Feb 3, 2021

@roblapp I followed the changes suggested by @tg123and got it running stable. There might still be an issue when running it in a hosted service. In our case, the root cause was Azure LB, not AWS. To make this bomb-proof, I added a timer. When the duration between events passes the TTL limit, we recreate the watch. This only occurs in stale environments, but it provides another layer of stability as it will resolve all network issues causing a half-open connection. It is not beautiful :)

@roblappcoupa
Copy link

roblappcoupa commented Feb 9, 2021

@eskaufel Are you passing the resource version in when establishing the watch? I added support for it but then I wasn't sure how you were supposed obtain it initially. I'd be very curious to see your solution, even if it's in pseudo code... I am interested in seeing how you are doing it. My watch functionality was kicked off by using an IHostedService in ASP.NET Core like so:

    while (!cancellationToken.IsCancellationRequested)
     {
          try
          {
                this.logger.LogInformation("[Monitor] Creating Kubernetes Client");

                 using (var kubernetesClient = this.kubernetesClientFactory.CreateKubernetesClient())
                 {
                        // See https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes
                       this.logger.LogInformation("[Monitor] Watching namespace {Ns}", this.options.Value.Kubernetes.ExecutionNamespace);
                        await this.TryWatchJobs(kubernetesClient, null, cancellationToken);
                        this.logger.LogInformation("[Monitor] Stopped watching namespace {Ns}", this.options.Value.Kubernetes.ExecutionNamespace);
                    }
             }
             catch (Exception exception)
             {
                  this.logger.LogError(exception, "[Monitor] An exception was thrown during a monitor loop iteration");
             }
             finally
             {
                    // If cancellation has been set do not bother waiting
                    if (!cancellationToken.IsCancellationRequested)
                    {
                        await Task.Delay(this.options.Value.Monitoring.MonitorLoopReconnectDelay, cancellationToken);
                    }
             }
    } 

This code calls this method:

private Task TryWatchJobs(IKubernetes kubernetesClient, string resourceVersion, CancellationToken cancellationToken)
{
	try
	{
		var connectionClosedResetEvent = new ManualResetEvent(false);

		this.logger.LogInformation("Getting job Task for Watch");

		// Note: this task gets awaited within the Watch<T, K>(...)
		var jobs = kubernetesClient.ListNamespacedJobWithHttpMessagesAsync(
			namespaceParameter: this.options.Value.Kubernetes.ExecutionNamespace,
			watch: true,
			resourceVersion: resourceVersion,
			cancellationToken: cancellationToken);

		this.logger.LogInformation("Starting Kubernetes Watch");

		using (jobs.Watch<V1Job, V1JobList>(
			(watchEventType, v1Job) =>
			{
				this.OnWatchEvent(watchEventType, v1Job);
			},
			exception =>
			{
				this.OnWatchException(exception);
			},
			() =>
			{
				this.OnWatchClosed();
				connectionClosedResetEvent.Set();
			}))
		{
			this.logger.LogInformation("Invoking WaitHandle.WaitAny");
			
			var indexOfOperationThatChangedState = WaitHandle.WaitAny(new[] { cancellationToken.WaitHandle, connectionClosedResetEvent });

			switch (indexOfOperationThatChangedState)
			{
				case 0:
					this.logger.LogInformation("WaitHandle.WaitAny returned. Reason: cancellation was signaled");
					break;
				case 1:
					this.logger.LogInformation("WaitHandle.WaitAny returned. Reason: watch HTTP connection was closed");
					break;
				default:
					this.logger.LogInformation("WaitHandle.WaitAny returned. Reason: could not be determined");
					break;
			}
		}
	}
	catch (Exception exception)
	{
		this.logger.LogWarning(
			exception,
			"An Exception was thrown on the Kubernetes Watch");
	}
	finally
	{
		this.logger.LogInformation("Completed Kubernetes Watch");
	}

	return Task.CompletedTask;
}

If anyone sees something wrong with this please let me know and I will be eternally grateful. I ended up moving to a polling based solution, which works for my use case. But I would much rather get the watch working correctly.

P.S. It might be worth noting that my cluster is in EKS. I am curious to see if that is going to be a commonality for those experiencing this issue.

@roblappcoupa
Copy link

roblappcoupa commented Feb 9, 2021

@tg123 I did not try that solution yet. I may give it a shot within the next week or so. Thanks!

P.S. It might be worth noting that my cluster is in EKS. I am curious to see if that is going to be a commonality for those experiencing this issue.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 9, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@paulsavides
Copy link

Hello, I am encountering something very similar to this issue using version 5.0.13 of the KubernetesClient using .net5.

Could somebody clarify whether this issue is expected to be fixed or not? If not, I can try and dig deeper and determine whether it is the same issue.

@roblappcoupa
Copy link

roblappcoupa commented Jul 27, 2021

@paulsavides Unfortunately, I kept running into this issue. The watch would disconnect and there was nothing to indicate what happened. After days of not being able to resolve it I abandoned the watch functionality completely in lieu of polling. I posted the code that I was using to perform the watch here. In my new solution, I have a background process that polls the V1Jobs API on a 15 second interval. For my use case, this is good enough as I don't need to capture events instantly. For testing, I simulated thousands of requests per minute against the API server and saw no noticeable impact on the cluster or with the Kubernetes API Server.

Unfortunately, I don't think a solution was found nor was I able to figure it out which is why I changed my approach to polling.

@karok2m
Copy link

karok2m commented Aug 1, 2022

@tg123, is the modification for underlying message handler still required (referring to #533 (comment) ) to enable TCP KeepAlive messages ? Or is it enabled by default as part of creation of Kubernetes client ?

@tg123
Copy link
Member

tg123 commented Aug 1, 2022

@karok2m it is enabled by default on net5+
please make sure your api server supports http2 (just check if it is https://)

@samfullman
Copy link

I have not resolved this issue myself but if it helps anyone this code is a workaround when using the CLI/kubectl:

slm-watch(){
    if [[ -z $1 ]]; then
        echo "\nCurrent context has these CRDs:"
        i=-1
        kubectl get crd -A | awk '{ print $1 }' | while read line; do
            i=$((i + 1))
            if [[ $i -eq 0 ]]; then
                continue
            fi
            CHOICES[$i]=$line
            echo $i. $line;
        done
        echo
        read "CHOICE?Select a number between 1 and $i: "
        CHOICE=$CHOICES[$CHOICE]
    else
        CHOICE=$1
    fi
    echo OK, getting $CHOICE objects ...
    while true; do
        kubectl get $CHOICE -A -w | while read line; do
            echo "$line      $(date +%T)"
        done
        echo "\nDamn, lost the watcher! What happened? ..."
        sleep 2
    done
}
# example: `slm-watch storagesnapshots`

Quite obviously a hack, and you lose the declarativeness because events "rebuild" on re-connection, but it works for me.

This is on a MacBook running minikube

@lknite
Copy link
Contributor

lknite commented May 13, 2023

Is there a good solution for this now? (Currently seeing timeouts watching for changes of instances of a crd). In developing a controller I want to perform activities in response to a watch event. I'm checking on whether it's possible to have an infinite --min-request-timeout but it occurs to me there may be a race condition. I would hope the api could reconnect as necessary and "catch up" on any missed events which occurred when reconnecting. If not, what's the algorithm to deal with watch functionality coming and going? (Am using csharp on linux in a pod.)

Does the api do this (reconnect with last resource version)?
Screenshot_20230513_083808_Chrome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests