Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache: Fix connection leak. #576

Merged
merged 1 commit into from
Jan 27, 2021
Merged

cache: Fix connection leak. #576

merged 1 commit into from
Jan 27, 2021

Conversation

jkryl
Copy link

@jkryl jkryl commented Jan 11, 2021

Make sure that done callback of a watcher isn't called more than once and call abort() on request object to close the connection when done.

It reimplements earlier fix done by @brendandburns for the problem of calling watcher done callback more than once (PR #505) and fixes missing abort call on watcher connections (alternative fix is being currently reviewed in PR #575). Also it undos changes done by @jhagestedt on behalf of PR #526, that I don't understand (as I have commented here #526 (comment)).

Let me know guys what do you think about it or what could be improved. Thanks!

The changes have been tested to work on kubeadm cluster running in aws and setup by kubeadm and in azure with managed k8s service (AKS).

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 11, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @jkryl!

It looks like this is your first PR to kubernetes-client/javascript 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-client/javascript has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 11, 2021
@jkryl
Copy link
Author

jkryl commented Jan 12, 2021

if people would like to test fixes that this PR delivers, they can use client-node-fixed-watcher npm module that is compatible with @kubernetes/client-node (drop-in replacement). I would be interested to know if it brings any relief to people having problem with connection leak.

this.path,
{ resourceVersion: list.metadata!.resourceVersion },
this.watchHandler.bind(this),
this.doneHandler.bind(this),
this.errorHandler.bind(this),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please preserve the error handler here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done callback handles the error as it is usually done in nodejs code - no need to pass an error handler anymore. If the watcher ends with an error then it will be the first argument of done callback. IMHO having a distinct callback just for errors makes usage of the watcher more complicated. Or is there a use case when error callback is useful and cannot be replaced by a single done callback?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The web socket makes a distinction between the two cases:
OnError:
https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/onerror

OnClose:
https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/onclose

I'd like to preserve that distinction.

In particular, from the docs at least:

"A function or EventHandler which is executed whenever an error event occurs on the WebSocket connection."

An error can occur independent of a close event (which triggers done)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an abstraction that makes sense for the websocket, where as you have cited:

An error can occur independent of a close event (which triggers done)

in our case that cannot happen. If there is an error, it is always followed by close (done). By adopting websocket model we have to supply two callbacks instead of one and save the state between the callbacks. Something like:

   private error: any;
   
   private async errorHandler(err: any): Promise<void> {
         this.error = err;
   }

   private async doneHandler(): Promise<void> {
         this.stop();
         if (this.error) {
             this.callbackCache[ERROR].forEach((elt: ObjectCallback<T>) => elt(this.error));
             return;
        }
        ...
   }

The current code calls callbackCache[ERROR] immdiately in error handler and introduces new stopped variable. But the principle remains the same. If we compare that with a simple done callback with error argument, that is crystal clear and easy to understand, I'm wondering why would we deliberately introduce more complex code without bringing any advantage to us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I mis-spoke, it's not a WebSocket in this case, it is a nodejs Stream:

https://github.com/kubernetes-client/javascript/blob/master/src/watch.ts#L82

but the idea is the same. The Stream has an error callback and a done callback and they're distinct, I don't see much value in adding more code so that we can hide that. (esp. since Stream is a core NodeJS class)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the code as it is now does all what we need. As shown above having additional error callback implies more complexity in all consumers of the API. Watcher code itself remains as it is. It does either:

doneCb(err);

or with error callback

errorCb(err);
doneCb();

error callback in the watcher does not make anything easier. We still need to handle error events from underlaying streams, we still need to ensure that error and done callbacks are called just once. Unless I overlooked something.

If the only reason is to make watcher API more similar to nodejs stream API, then even after adding error callback they will be significantly different (events vs callbacks, drain, finish pipe events which are not there, # of methods that are missing). If we wanted to convert watcher to a true stream module, that would be something else and neat. Then we could do something like:

watcher.on('data', ...)
watcher.pipe(other-stream)
watcher.on('error', ...)

but adding an error callback without transforming the whole API to be stream-compliant seems like a meaningless step to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'm alright with this after further thought.

src/cache.ts Outdated Show resolved Hide resolved
@brendandburns
Copy link
Contributor

Thanks for the PR! I've got a couple of edits, but I think then we can merge it.

@jkryl
Copy link
Author

jkryl commented Jan 12, 2021

@brendanburns thanks for looking at the changes! I have few more comments.

I have changed the DefaultRequest implementation to return rather a stream than request object directly as I think it abstracts the code better from request library. That said I don't think that it's easy to use with other implementations because the parameter in webrequest interface is still option object which is specific to the request library.

When I was running patched library with these changes in azure using AKS, I still had an issue with watcher connection not receiving any events after 2 minutes or so. However it's not the problem that this PR is trying to fix.

@brendandburns
Copy link
Contributor

Stream vs Request is fine with me.

For issues w/ Watch + AKS, please see a discussion here:
kubernetes-client/csharp#533

I think we need to send keep-alive pings.

if (err) {
error(err);
done();
} else if (response && response.statusCode !== 200) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to preserve this code so that we correctly handle non-200 responses.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. It is done now in 'response' event handler. There we handle non-200 status codes and inject error event to the stream if the status isn't ok. Abstracting the user (watcher) from handling different error types (error on data stream, http status error), all get handled the same way then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I see it now.


const req = this.requestImpl.webRequest(requestOptions, (err, response, body) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to eliminate this callback from the web request.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "response" event is used together with error event on stream, then you don't need the callback. The stream that is returned by webRequest should emit error event if anything went wrong (i.e. IIRC that's how got behaves https://github.com/sindresorhus/got). So there is one place where the error is handled from watcher's perspective instead of handling errors in stream error event handler and also the callback which leads to ambiguity and is quite difficult to understand how both error handlers relate to each other. More about proper error handling when using request library is here: request/request#647 . The conclusion of that very long discussion (if you scroll all the way down) is basically what I have implemented here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack. thanks for clarifying.

@brendandburns
Copy link
Contributor

Please address the stop() refactor comment, then I think this is ready to merge.

Make sure that done callback of a watcher isn't called more than
once and call abort() on request object to close the connection
when done.
@jkryl
Copy link
Author

jkryl commented Jan 26, 2021

I have addressed the comment about stop(). Also added a few comments to make the code more clear. Thanks!

@brendandburns
Copy link
Contributor

/lgtm
/approve

Thanks for the patience!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brendandburns, jkryl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 27, 2021
@k8s-ci-robot k8s-ci-robot merged commit b872b08 into kubernetes-client:master Jan 27, 2021
@DocX
Copy link
Contributor

DocX commented Jan 28, 2021

Hi, does this fix also this issue? #494 If so, will it be released in new version soon?

@jkryl
Copy link
Author

jkryl commented Jan 29, 2021

I'm not sure because I haven't tried to run the test case mentioned in that ticket and I'm not using informer in my environment (just cache + watcher). But I would say there is high probability that it could have been fixed. At least I don't see a way at the watcher level how done callback could have been called without first destroying connection.

@DocX
Copy link
Contributor

DocX commented Feb 1, 2021

I will test it in our case from master.

First thing that came up is that RequestInterface.webRequest method now only accepts one argument, which is breaking compatibility with our code. So that probably needs to be minor version bump at least :)

@jkryl
Copy link
Author

jkryl commented Feb 1, 2021

@DocX : yea, normally that would be a major version change. Unless the fact that the software is beta changes the normal rules. The idea is that webRequest() returns a nodejs stream now and any error should be emitted using error event on the stream, rather than mixing error event with a callback for error handling as it was done previously.

Since you got this problem it means that you are using a non-default http library (so not using a request lib). I thought that almost no one would do that because the interface as it is now requires to pass request options object and in order to do that, you have to import types from the request library or the http lib you are using must have compatible type. I find this a bit awkward. Just curious what http library do you use in your project?

@brendandburns
Copy link
Contributor

We'll definitely release this as a minor release, along with the code for Kubernetes API 1.20, will get it pushed soon-ish

@DocX DocX mentioned this pull request Feb 2, 2021
@DocX
Copy link
Contributor

DocX commented Feb 2, 2021

Just curious what http library do you use in your project?

We use request. But the issue with the type is actually minor. We had some code to debug the issues with this library, so we had that code using the interface. It was not actual production code. So it was not really a problem.

I will test it in our case from master.

We have run it over night yesterday, and it looks like the leak is gone. The network traffic is stable, and memory is also ok. Before we used to get OOM killed every 4 or so hours (depending on our memory limits), now it's all fine.

@dkontorovskyy
Copy link

👋 @brendanburns is there any plan on releasing this fix?

@brendandburns
Copy link
Contributor

@jkryl @DocX @dkontorovskyy 0.14.0 including this fix was just pushed.

@dkontorovskyy
Copy link

@brendanburns highly appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants