Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

watcher connection stops receiving events after some time #596

Closed
jkryl opened this issue Feb 25, 2021 · 37 comments
Closed

watcher connection stops receiving events after some time #596

jkryl opened this issue Feb 25, 2021 · 37 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@jkryl
Copy link
Contributor

jkryl commented Feb 25, 2021

It is a known issue that some people have worked around by using rather pull than push method for receiving updates about changes. Another workaround is to restart the watcher every n minutes. As @brendandburns pointed out in my earlier PR576, csharp k8s client suffers from the same problem: kubernetes-client/csharp#533 .

My experience shows that it happens when the connection is idle for a long time. The connection is dropped without closing it so the client keeps waiting for events, not receiving any. I have seen it in Azure and Google cloud with managed k8s service.

The c# issue suggests that it happens because keepalives are not enabled on underlaying connection. And indeed, I found that it is the case for JS k8s client too. That can be fixed by adding keep-alive option to "request" options if there wasn't a bug in the request library. I have created a new ticket for it: request/request#3367 . The request library has been deprecated and cannot be fixed. I was able to work around the bug in watcher's code. So with my fix, the connections are kept alive. My experience shows that every three minutes TCP ACK is exchanged between client and server. I would like the keep-alive to happen more often to detect dead watcher connections in more timely fashion however it does not seem to be possible to tweak keep-alive interval for the connection in nodejs: nodejs/node-v0.x-archive#4109 .

The fix I have does not seem to fix the problem in all cases. That might be because the keep-alive of 3 minutes may not be sufficient in all cases. I will test the fix more thoroughly and update the ticket with the results of testing.

@jkryl
Copy link
Contributor Author

jkryl commented Feb 25, 2021

let me correct myself. The keep-alives shown by tcpdump were not keep-alives sent by the watcher but they are sent by the api server every 3 minutes:

...
11:32:59.239617 IP 227.114.246.35.bc.googleusercontent.com.https > gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832: Flags [.], ack 1, win 501, options [nop,nop,TS val 2612916151 ecr 3766451319], length 0
11:32:59.239753 IP gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832 > 227.114.246.35.bc.googleusercontent.com.https: Flags [.], ack 19557, win 331, options [nop,nop,TS val 3766631543 ecr 2611652359], length 0
11:35:59.463662 IP 227.114.246.35.bc.googleusercontent.com.https > gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832: Flags [.], ack 1, win 501, options [nop,nop,TS val 2613096375 ecr 3766631543], length 0
11:35:59.463762 IP gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832 > 227.114.246.35.bc.googleusercontent.com.https: Flags [.], ack 19557, win 331, options [nop,nop,TS val 3766811767 ecr 2611652359], length 0

Now comes the breakpoint roughly after half an hour since the connection has been established. The watcher receives a couple of bytes from the server and immediately starts to send keep-alives every second. That's when the connection becomes dead and does not receive any new events.

11:38:09.258810 IP 227.114.246.35.bc.googleusercontent.com.https > gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832: Flags [P.], seq 19557:19584, ack 1, win 501, options [nop,nop,TS val 2613226170 ecr 3766811767], length 27
11:38:09.258890 IP gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832 > 227.114.246.35.bc.googleusercontent.com.https: Flags [.], ack 19584, win 331, options [nop,nop,TS val 3766941562 ecr 2613226170], length 0
11:38:10.273184 IP gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832 > 227.114.246.35.bc.googleusercontent.com.https: Flags [.], ack 19584, win 331, options [nop,nop,TS val 3766942576 ecr 2613226170], length 0
11:38:10.273612 IP 227.114.246.35.bc.googleusercontent.com.https > gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832: Flags [.], ack 1, win 501, options [nop,nop,TS val 2613227185 ecr 3766941562], length 0
11:38:11.297176 IP gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832 > 227.114.246.35.bc.googleusercontent.com.https: Flags [.], ack 19584, win 331, options [nop,nop,TS val 3766943600 ecr 2613227185], length 0
11:38:11.297533 IP 227.114.246.35.bc.googleusercontent.com.https > gke-jan-test2-default-pool-664e7812-gs7g.c.mayastor.internal.55832: Flags [.], ack 1, win 501, options [nop,nop,TS val 2613228209 ecr 3766941562], length 0
...

My keep-alive patch does not seem to change this behaviour. Either it's not the way to go or it is not complete and still missing something. For the reference this is the commit I have been testing: https://github.com/jkryl/javascript/commit/0fef4de1e7c1dacfa7da04f1e4f6e2047525c9f3#diff-8af1000c89b03ced37f439c61c5696c45e1e83a70cc07182feef6595123f0bad

@brendandburns
Copy link
Contributor

In general, I think relying on watches to stick around for greater than N minutes (where N is fairly small) is probably not the best bet. I think adding a timeout if an event hasn't been received in N minutes (you can tune N depending on your workload) is probably the right approach.

The network is just too weird a place to expect that HTTP/TCP connections will stay alive for a long time.

@jkryl
Copy link
Contributor Author

jkryl commented Mar 2, 2021

yes, I tend to agree. It would be interesting to know what golang watcher does - given that it's kinda referential implementation for k8s watchers. I decided to work around the problem by doing exactly what you have suggested 🤞

@DocX
Copy link
Contributor

DocX commented Mar 4, 2021

Can you point out to how to set the timeout? Would that be on the list function, e.g.

listNamespacedPod(
  namespace, 
  undefined, // pretty?: string, 
  undefined, // allowWatchBookmarks?: boolean, 
  undefined, // _continue?: string, 
  undefined, // fieldSelector?: string,
  undefined, // labelSelector?: string, 
  undefined, // limit?: number, 
  undefined, // resourceVersion?: string, 
  undefined, // resourceVersionMatch?: string,
  300, // timeoutSeconds?: number, 
  true, // watch?: boolean
)

@brendandburns
Copy link
Contributor

brendandburns commented Mar 6, 2021

@DocX I would place a timer in your own code, e.g. something like:

function maybeRestartInformer() {
    if (noUpdates) {
        restartInformer();
   }
   setTimeout(maybeRestartInformer, <timeout>);
}

setTimeout(maybeRestartInformer, <timeout>);

or something like that.

@brendandburns
Copy link
Contributor

@jkryl the relevant code starts here:

https://github.com/kubernetes/client-go/blob/master/rest/request.go#L674

There's also a big discussion here:

kubernetes/client-go#374

and here

kubernetes/kubernetes#65012 (comment)

The consensus seems to be switching to HTTP/2 and sending Pings from the client.

@jmickey
Copy link

jmickey commented Mar 7, 2021

@brendandburns This might be slightly off the topic of this issue, but can you give any advice on how to properly restart an informer? Informers don't seem to have a stop() function, and I'm worried that just re-calling start() is going to cause some problems.

@jkryl
Copy link
Contributor Author

jkryl commented Mar 8, 2021

@DocX : As @brendanburns suggests I have a special code in my watcher wrapper that restarts the watcher after being idle for too long. Snippet from my current code:

 _setIdleTimeout() {
    if (this.idleTimeout > 0) {
      this._clearTimer();
      this.timer = setTimeout(() => {
        this.stop();
        this.start();
      }, this.idleTimeout);
    }
  }
  
   _onEvent(event: string, cr: CustomResource) {
     ...
     this._setIdleTimeout();
     ...
   }

Full implementation specific for my use case is here: https://github.com/openebs/Mayastor/blob/develop/csi/moac/watcher.ts#L225

@dominykas
Copy link
Contributor

dominykas commented Mar 9, 2021

Informers don't seem to have a stop() function, and I'm worried that just re-calling start() is going to cause some problems.

makeInformer returns a ListWatch which does have a stop() method, which just works in JS world?

Quite possibly the interface Informer<T> should be updated to expose that in TS world.

It seems that stop() just makes the request abort, which in turn just restarts it.

Yes, there is a need for a proper stop method 👍

@brendandburns
Copy link
Contributor

@dominykas that's a fair request. I filed #604 to track.

A PR would be welcome, or I will get to it eventually.

@dominykas
Copy link
Contributor

dominykas commented Mar 11, 2021

Thanks 🙇

I'll see if I can catch a break to PR this soon enough. Guidelines on how you'd approach it most welcome - I have my ideas, but not sure about alternatives.

@brendandburns
Copy link
Contributor

I would add some sort of intercept/break in the timeout handler code path.

e.g.

if (aborted) { return; }

Or somesuch.

@DocX
Copy link
Contributor

DocX commented Mar 12, 2021

Not sure if that is relevant, but just FYI, specifying the timeout seconds on the list function as shared above, seems to have get rid of the problem for us.

@dkontorovskyy
Copy link

@brendanburns We've seen a similar issue with Informer API in AKS. After a short period of time, depending on how big was the cluster, Informer stops receiving events and goes into ECONNRESET loop, and then eventually becomes unresponsive.

I've raised a ticket with Azure support. They raised some findings back, might be helpful

  • From the traces interestingly we saw lot of RST from the API Server directly without even having any handshake
  • It seems there is no keepalive ? i.e. the client TCP socket will send a keepalive packet from time to time to make sure this is an active TCP stream . Servers usually discard inactive TCP streams with RST.

Indeed, there is no keepalive in Informer requests.

Hope this helps :)

@bacongobbler
Copy link
Contributor

Can you check and test whether #630 fixes your issue?

@brendandburns
Copy link
Contributor

@dkontorovskyy I pushed an 0.14.2 release which contains @bacongobbler 's keep-alive fix.

@huineng
Copy link

huineng commented Apr 12, 2021

i got this error exactly after 10 minutes (v 0.14.3) (OpenShift Version 4.5.35)

watch on resource pods.v1 failed: {"code":"ECONNRESET",
"name":"Error","message":"aborted","stack":"Error: aborted\n    
at connResetException (node:internal/errors:642:14)\n    
at TLSSocket.socketCloseListener (node:_http_client:424:27)\n    
at TLSSocket.emit (node:events:381:22)\n    at node:net:666:12\n    
at TCP.done (node:_tls_wrap:577:7)"

and also for the informer

Error: aborted
    at connResetException (node:internal/errors:642:14)
    at TLSSocket.socketCloseListener (node:_http_client:424:27)
    at TLSSocket.emit (node:events:381:22)
    at node:net:666:12
    at TCP.done (node:_tls_wrap:577:7) {
  code: 'ECONNRESET'
}

node v15.9.0

@mariusziemke
Copy link

We are experiencing the same issue after about ~1 hour (v0.14.3). setKeepalive is probably not working due to request/request#3367 not being merged yet. Is anyone having a suitable workaround until we get http2 pinging working?

Bildschirmfoto 2021-04-30 um 13 37 08

@hermansje
Copy link

hermansje commented May 3, 2021

We noticed a similar issue and verified it was solved by adding the close handler to the request instead of the stream in Watch.watch:
https://github.com/kubernetes-client/javascript/blob/master/src/watch.ts#L120
Alternatively, changing close to end on that line also fixed it, which we tried because of this code in request.js.
As a cause for the request closing, we thought it could be the --min-request-timeout flag from kube-apiserver.

@mariusziemke
Copy link

We upgrade node from version 12 to 15 and the issue seems to be gone. Not sure why, but maybe this helps :-)

@bacongobbler
Copy link
Contributor

Interesting find @mariusziemke. Can anyone else confirm upgrading to node 15 works for them as well?

@ivanstanev
Copy link

Interesting find @mariusziemke. Can anyone else confirm upgrading to node 15 works for them as well?

We tried Node 16 and it also works...

@DocX
Copy link
Contributor

DocX commented May 17, 2021

We use Node 14.16.0 and have no issue with kuberntes-client 0.14.3 (also using watch, and with no additional timeout logic)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 15, 2021
@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 14, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 12, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 14, 2022
@chengjianhua
Copy link

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 31, 2022
@haraldschilly
Copy link

Just as a data point, I have exactly that issue as well. Node v14.19.3, kubernetes client: 0.16.3 and GKE 1.23.5 and 1.23.6.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 12, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 12, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 11, 2022
@brendandburns brendandburns reopened this Jan 5, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 4, 2023
btlghrants pushed a commit to defenseunicorns/pepr that referenced this issue May 1, 2024
## Description

Under specific conditions the Kubernetes Watch will stop sending events
to the client even though the connection seems healthy. This issue is
well known. After researching, we found that tuning the resync interval
is the best way to ensure the connection stays healthy.

Note that this issue is very rare and is not likely to occur but if your
environment has a lot of network pressure and constant churn it is a
possibility.

## Related Issue

Fixes #765 
<!-- or -->
Relates to # kubernetes-client/javascript#596

## Type of change

- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Other (security config, docs update, etc)

## Checklist before merging

- [x] Test, docs, adr added or updated as needed
- [x] [Contributor Guide
Steps](https://docs.pepr.dev/main/contribute/contributor-guide/#submitting-a-pull-request)
followed

---------

Signed-off-by: Case Wylie <cmwylie19@defenseunicorns.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests