-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
watcher connection stops receiving events after some time #596
Comments
let me correct myself. The keep-alives shown by tcpdump were not keep-alives sent by the watcher but they are sent by the api server every 3 minutes:
Now comes the breakpoint roughly after half an hour since the connection has been established. The watcher receives a couple of bytes from the server and immediately starts to send keep-alives every second. That's when the connection becomes dead and does not receive any new events.
My keep-alive patch does not seem to change this behaviour. Either it's not the way to go or it is not complete and still missing something. For the reference this is the commit I have been testing: https://github.com/jkryl/javascript/commit/0fef4de1e7c1dacfa7da04f1e4f6e2047525c9f3#diff-8af1000c89b03ced37f439c61c5696c45e1e83a70cc07182feef6595123f0bad |
In general, I think relying on watches to stick around for greater than N minutes (where N is fairly small) is probably not the best bet. I think adding a timeout if an event hasn't been received in N minutes (you can tune N depending on your workload) is probably the right approach. The network is just too weird a place to expect that HTTP/TCP connections will stay alive for a long time. |
yes, I tend to agree. It would be interesting to know what golang watcher does - given that it's kinda referential implementation for k8s watchers. I decided to work around the problem by doing exactly what you have suggested 🤞 |
Can you point out to how to set the timeout? Would that be on the list function, e.g. listNamespacedPod(
namespace,
undefined, // pretty?: string,
undefined, // allowWatchBookmarks?: boolean,
undefined, // _continue?: string,
undefined, // fieldSelector?: string,
undefined, // labelSelector?: string,
undefined, // limit?: number,
undefined, // resourceVersion?: string,
undefined, // resourceVersionMatch?: string,
300, // timeoutSeconds?: number,
true, // watch?: boolean
) |
@DocX I would place a timer in your own code, e.g. something like: function maybeRestartInformer() {
if (noUpdates) {
restartInformer();
}
setTimeout(maybeRestartInformer, <timeout>);
}
setTimeout(maybeRestartInformer, <timeout>); or something like that. |
@jkryl the relevant code starts here: https://github.com/kubernetes/client-go/blob/master/rest/request.go#L674 There's also a big discussion here: and here kubernetes/kubernetes#65012 (comment) The consensus seems to be switching to HTTP/2 and sending Pings from the client. |
@brendandburns This might be slightly off the topic of this issue, but can you give any advice on how to properly restart an informer? Informers don't seem to have a |
@DocX : As @brendanburns suggests I have a special code in my watcher wrapper that restarts the watcher after being idle for too long. Snippet from my current code:
Full implementation specific for my use case is here: https://github.com/openebs/Mayastor/blob/develop/csi/moac/watcher.ts#L225 |
It seems that Yes, there is a need for a proper stop method 👍 |
@dominykas that's a fair request. I filed #604 to track. A PR would be welcome, or I will get to it eventually. |
Thanks 🙇 I'll see if I can catch a break to PR this soon enough. Guidelines on how you'd approach it most welcome - I have my ideas, but not sure about alternatives. |
I would add some sort of intercept/break in the timeout handler code path. e.g. if (aborted) { return; } Or somesuch. |
Not sure if that is relevant, but just FYI, specifying the timeout seconds on the list function as shared above, seems to have get rid of the problem for us. |
@brendanburns We've seen a similar issue with Informer API in AKS. After a short period of time, depending on how big was the cluster, Informer stops receiving events and goes into I've raised a ticket with Azure support. They raised some findings back, might be helpful
Indeed, there is no keepalive in Informer requests. Hope this helps :) |
Can you check and test whether #630 fixes your issue? |
@dkontorovskyy I pushed an 0.14.2 release which contains @bacongobbler 's keep-alive fix. |
i got this error exactly after 10 minutes (v 0.14.3) (OpenShift Version 4.5.35)
and also for the informer
node v15.9.0 |
We are experiencing the same issue after about ~1 hour (v0.14.3). |
We noticed a similar issue and verified it was solved by adding the |
We upgrade node from version 12 to 15 and the issue seems to be gone. Not sure why, but maybe this helps :-) |
Interesting find @mariusziemke. Can anyone else confirm upgrading to node 15 works for them as well? |
We tried Node 16 and it also works... |
We use Node |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
Just as a data point, I have exactly that issue as well. Node |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
## Description Under specific conditions the Kubernetes Watch will stop sending events to the client even though the connection seems healthy. This issue is well known. After researching, we found that tuning the resync interval is the best way to ensure the connection stays healthy. Note that this issue is very rare and is not likely to occur but if your environment has a lot of network pressure and constant churn it is a possibility. ## Related Issue Fixes #765 <!-- or --> Relates to # kubernetes-client/javascript#596 ## Type of change - [ ] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Other (security config, docs update, etc) ## Checklist before merging - [x] Test, docs, adr added or updated as needed - [x] [Contributor Guide Steps](https://docs.pepr.dev/main/contribute/contributor-guide/#submitting-a-pull-request) followed --------- Signed-off-by: Case Wylie <cmwylie19@defenseunicorns.com>
It is a known issue that some people have worked around by using rather pull than push method for receiving updates about changes. Another workaround is to restart the watcher every n minutes. As @brendandburns pointed out in my earlier PR576, csharp k8s client suffers from the same problem: kubernetes-client/csharp#533 .
My experience shows that it happens when the connection is idle for a long time. The connection is dropped without closing it so the client keeps waiting for events, not receiving any. I have seen it in Azure and Google cloud with managed k8s service.
The c# issue suggests that it happens because keepalives are not enabled on underlaying connection. And indeed, I found that it is the case for JS k8s client too. That can be fixed by adding keep-alive option to "request" options if there wasn't a bug in the request library. I have created a new ticket for it: request/request#3367 . The request library has been deprecated and cannot be fixed. I was able to work around the bug in watcher's code. So with my fix, the connections are kept alive. My experience shows that every three minutes TCP ACK is exchanged between client and server. I would like the keep-alive to happen more often to detect dead watcher connections in more timely fashion however it does not seem to be possible to tweak keep-alive interval for the connection in nodejs: nodejs/node-v0.x-archive#4109 .
The fix I have does not seem to fix the problem in all cases. That might be because the keep-alive of 3 minutes may not be sufficient in all cases. I will test the fix more thoroughly and update the ticket with the results of testing.
The text was updated successfully, but these errors were encountered: