cache: Fix connection leak. #576

jkryl · 2021-01-11T17:48:37Z

Make sure that done callback of a watcher isn't called more than once and call abort() on request object to close the connection when done.

It reimplements earlier fix done by @brendandburns for the problem of calling watcher done callback more than once (PR #505) and fixes missing abort call on watcher connections (alternative fix is being currently reviewed in PR #575). Also it undos changes done by @jhagestedt on behalf of PR #526, that I don't understand (as I have commented here #526 (comment)).

Let me know guys what do you think about it or what could be improved. Thanks!

The changes have been tested to work on kubeadm cluster running in aws and setup by kubeadm and in azure with managed k8s service (AKS).

k8s-ci-robot · 2021-01-11T17:48:45Z

Welcome @jkryl!

It looks like this is your first PR to kubernetes-client/javascript 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-client/javascript has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

jkryl · 2021-01-12T16:56:52Z

if people would like to test fixes that this PR delivers, they can use client-node-fixed-watcher npm module that is compatible with @kubernetes/client-node (drop-in replacement). I would be interested to know if it brings any relief to people having problem with connection leak.

brendandburns · 2021-01-12T18:49:52Z

src/cache.ts

            this.path,
            { resourceVersion: list.metadata!.resourceVersion },
            this.watchHandler.bind(this),
            this.doneHandler.bind(this),
-            this.errorHandler.bind(this),


Please preserve the error handler here.

done callback handles the error as it is usually done in nodejs code - no need to pass an error handler anymore. If the watcher ends with an error then it will be the first argument of done callback. IMHO having a distinct callback just for errors makes usage of the watcher more complicated. Or is there a use case when error callback is useful and cannot be replaced by a single done callback?

The web socket makes a distinction between the two cases:
OnError:
https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/onerror

OnClose:
https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/onclose

I'd like to preserve that distinction.

In particular, from the docs at least:

"A function or EventHandler which is executed whenever an error event occurs on the WebSocket connection."

An error can occur independent of a close event (which triggers done)

It is an abstraction that makes sense for the websocket, where as you have cited:

An error can occur independent of a close event (which triggers done)

in our case that cannot happen. If there is an error, it is always followed by close (done). By adopting websocket model we have to supply two callbacks instead of one and save the state between the callbacks. Something like:

private error: any; private async errorHandler(err: any): Promise<void> { this.error = err; } private async doneHandler(): Promise<void> { this.stop(); if (this.error) { this.callbackCache[ERROR].forEach((elt: ObjectCallback<T>) => elt(this.error)); return; } ... }

The current code calls callbackCache[ERROR] immdiately in error handler and introduces new stopped variable. But the principle remains the same. If we compare that with a simple done callback with error argument, that is crystal clear and easy to understand, I'm wondering why would we deliberately introduce more complex code without bringing any advantage to us.

Sorry, I mis-spoke, it's not a WebSocket in this case, it is a nodejs Stream:

https://github.com/kubernetes-client/javascript/blob/master/src/watch.ts#L82

but the idea is the same. The Stream has an error callback and a done callback and they're distinct, I don't see much value in adding more code so that we can hide that. (esp. since Stream is a core NodeJS class)

AFAIK the code as it is now does all what we need. As shown above having additional error callback implies more complexity in all consumers of the API. Watcher code itself remains as it is. It does either:

doneCb(err);

or with error callback

errorCb(err); doneCb();

error callback in the watcher does not make anything easier. We still need to handle error events from underlaying streams, we still need to ensure that error and done callbacks are called just once. Unless I overlooked something.

If the only reason is to make watcher API more similar to nodejs stream API, then even after adding error callback they will be significantly different (events vs callbacks, drain, finish pipe events which are not there, # of methods that are missing). If we wanted to convert watcher to a true stream module, that would be something else and neat. Then we could do something like:

watcher.on('data', ...) watcher.pipe(other-stream) watcher.on('error', ...)

but adding an error callback without transforming the whole API to be stream-compliant seems like a meaningless step to me.

ok, I'm alright with this after further thought.

src/cache.ts

brendandburns · 2021-01-12T18:53:15Z

Thanks for the PR! I've got a couple of edits, but I think then we can merge it.

jkryl · 2021-01-12T21:02:48Z

@brendanburns thanks for looking at the changes! I have few more comments.

I have changed the DefaultRequest implementation to return rather a stream than request object directly as I think it abstracts the code better from request library. That said I don't think that it's easy to use with other implementations because the parameter in webrequest interface is still option object which is specific to the request library.

When I was running patched library with these changes in azure using AKS, I still had an issue with watcher connection not receiving any events after 2 minutes or so. However it's not the problem that this PR is trying to fix.

brendandburns · 2021-01-13T05:51:17Z

Stream vs Request is fine with me.

For issues w/ Watch + AKS, please see a discussion here:
kubernetes-client/csharp#533

I think we need to send keep-alive pings.

brendandburns · 2021-01-14T16:41:50Z

src/watch.ts

-            if (err) {
-                error(err);
-                done();
-            } else if (response && response.statusCode !== 200) {


We need to preserve this code so that we correctly handle non-200 responses.

yes. It is done now in 'response' event handler. There we handle non-200 status codes and inject error event to the stream if the status isn't ok. Abstracting the user (watcher) from handling different error types (error on data stream, http status error), all get handled the same way then.

thanks, I see it now.

brendandburns · 2021-01-14T16:42:38Z

src/watch.ts


-        const req = this.requestImpl.webRequest(requestOptions, (err, response, body) => {


I don't think we want to eliminate this callback from the web request.

If "response" event is used together with error event on stream, then you don't need the callback. The stream that is returned by webRequest should emit error event if anything went wrong (i.e. IIRC that's how got behaves https://github.com/sindresorhus/got). So there is one place where the error is handled from watcher's perspective instead of handling errors in stream error event handler and also the callback which leads to ambiguity and is quite difficult to understand how both error handlers relate to each other. More about proper error handling when using request library is here: request/request#647 . The conclusion of that very long discussion (if you scroll all the way down) is basically what I have implemented here.

ack. thanks for clarifying.

brendandburns · 2021-01-21T16:45:30Z

Please address the stop() refactor comment, then I think this is ready to merge.

Make sure that done callback of a watcher isn't called more than once and call abort() on request object to close the connection when done.

jkryl · 2021-01-26T10:57:52Z

I have addressed the comment about stop(). Also added a few comments to make the code more clear. Thanks!

brendandburns · 2021-01-27T03:47:54Z

/lgtm
/approve

Thanks for the patience!

k8s-ci-robot · 2021-01-27T03:48:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brendandburns, jkryl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [brendandburns]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

DocX · 2021-01-28T10:48:58Z

Hi, does this fix also this issue? #494 If so, will it be released in new version soon?

jkryl · 2021-01-29T16:05:30Z

I'm not sure because I haven't tried to run the test case mentioned in that ticket and I'm not using informer in my environment (just cache + watcher). But I would say there is high probability that it could have been fixed. At least I don't see a way at the watcher level how done callback could have been called without first destroying connection.

DocX · 2021-02-01T13:45:32Z

I will test it in our case from master.

First thing that came up is that RequestInterface.webRequest method now only accepts one argument, which is breaking compatibility with our code. So that probably needs to be minor version bump at least :)

jkryl · 2021-02-01T21:02:06Z

@DocX : yea, normally that would be a major version change. Unless the fact that the software is beta changes the normal rules. The idea is that webRequest() returns a nodejs stream now and any error should be emitted using error event on the stream, rather than mixing error event with a callback for error handling as it was done previously.

Since you got this problem it means that you are using a non-default http library (so not using a request lib). I thought that almost no one would do that because the interface as it is now requires to pass request options object and in order to do that, you have to import types from the request library or the http lib you are using must have compatible type. I find this a bit awkward. Just curious what http library do you use in your project?

brendandburns · 2021-02-01T23:26:17Z

We'll definitely release this as a minor release, along with the code for Kubernetes API 1.20, will get it pushed soon-ish

DocX · 2021-02-02T09:57:22Z

Just curious what http library do you use in your project?

We use request. But the issue with the type is actually minor. We had some code to debug the issues with this library, so we had that code using the interface. It was not actual production code. So it was not really a problem.

I will test it in our case from master.

We have run it over night yesterday, and it looks like the leak is gone. The network traffic is stable, and memory is also ok. Before we used to get OOM killed every 4 or so hours (depending on our memory limits), now it's all fine.

dkontorovskyy · 2021-02-18T16:02:52Z

👋 @brendanburns is there any plan on releasing this fix?

brendandburns · 2021-02-18T16:22:23Z

@jkryl @DocX @dkontorovskyy 0.14.0 including this fix was just pushed.

dkontorovskyy · 2021-02-18T16:47:42Z

@brendanburns highly appreciate it!

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 11, 2021

k8s-ci-robot requested review from brendandburns and drubin January 11, 2021 17:48

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 11, 2021

brendandburns reviewed Jan 12, 2021

View reviewed changes

src/cache.ts Outdated Show resolved Hide resolved

brendandburns reviewed Jan 14, 2021

View reviewed changes

cache: Fix connection leak.

8d3e933

Make sure that done callback of a watcher isn't called more than once and call abort() on request object to close the connection when done.

k8s-ci-robot assigned brendandburns Jan 27, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 27, 2021

k8s-ci-robot merged commit b872b08 into kubernetes-client:master Jan 27, 2021

DocX mentioned this pull request Feb 2, 2021

Connection leak in informer? #494

Closed

brendandburns mentioned this pull request Feb 3, 2021

Explicitly abort and destroy active requests before re-initiating a watch. #575

Closed

ivanstanev mentioned this pull request Feb 18, 2021

Retrying to start the Informer on AKS never succeeds and ends up in immediate ECONNRESET #589

Closed

dkontorovskyy mentioned this pull request Feb 18, 2021

fix: update k8s client to bring connection leak fix snyk/kubernetes-monitor#660

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache: Fix connection leak. #576

cache: Fix connection leak. #576

jkryl commented Jan 11, 2021

k8s-ci-robot commented Jan 11, 2021

jkryl commented Jan 12, 2021

brendandburns Jan 12, 2021

jkryl Jan 12, 2021

brendandburns Jan 13, 2021

jkryl Jan 13, 2021

brendandburns Jan 14, 2021

jkryl Jan 20, 2021

brendandburns Jan 21, 2021

brendandburns commented Jan 12, 2021

jkryl commented Jan 12, 2021

brendandburns commented Jan 13, 2021

brendandburns Jan 14, 2021

jkryl Jan 14, 2021

brendandburns Jan 19, 2021

brendandburns Jan 14, 2021

jkryl Jan 14, 2021

brendandburns Jan 19, 2021

brendandburns commented Jan 21, 2021

jkryl commented Jan 26, 2021

brendandburns commented Jan 27, 2021

k8s-ci-robot commented Jan 27, 2021

DocX commented Jan 28, 2021

jkryl commented Jan 29, 2021

DocX commented Feb 1, 2021

jkryl commented Feb 1, 2021

brendandburns commented Feb 1, 2021

DocX commented Feb 2, 2021 •

edited

Loading

dkontorovskyy commented Feb 18, 2021

brendandburns commented Feb 18, 2021

dkontorovskyy commented Feb 18, 2021


		const req = this.requestImpl.webRequest(requestOptions, (err, response, body) => {

cache: Fix connection leak. #576

cache: Fix connection leak. #576

Conversation

jkryl commented Jan 11, 2021

k8s-ci-robot commented Jan 11, 2021

jkryl commented Jan 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brendandburns commented Jan 12, 2021

jkryl commented Jan 12, 2021

brendandburns commented Jan 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brendandburns commented Jan 21, 2021

jkryl commented Jan 26, 2021

brendandburns commented Jan 27, 2021

k8s-ci-robot commented Jan 27, 2021

DocX commented Jan 28, 2021

jkryl commented Jan 29, 2021

DocX commented Feb 1, 2021

jkryl commented Feb 1, 2021

brendandburns commented Feb 1, 2021

DocX commented Feb 2, 2021 • edited Loading

dkontorovskyy commented Feb 18, 2021

brendandburns commented Feb 18, 2021

dkontorovskyy commented Feb 18, 2021

DocX commented Feb 2, 2021 •

edited

Loading