-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection leak in informer? #494
Comments
Here's a self contained example that demonstrates the leak. const k8s = require('@kubernetes/client-node')
const util = require('util')
const _ = require('lodash')
const { // initialize from environment variables
CASSANDRA_ENDPOINT_LABELS = 'app.kubernetes.io/component=database',
POD_NAMESPACE = 'cassandra'
} = process.env
const kc = new k8s.KubeConfig()
kc.loadFromDefault()
const k8sApi = kc.makeApiClient(k8s.CoreV1Api)
function startEndpointWatcher (labelSelector = '', namespace = POD_NAMESPACE) {
const informer = k8s.makeInformer(kc, `/api/v1/namespaces/${namespace}/endpoints?labelSelector=${labelSelector}`,
() => k8sApi.listNamespacedEndpoints(namespace, undefined, undefined, undefined, undefined, labelSelector))
_.forEach(['add', 'update', 'delete'], event => informer.on(event, updateEndpointsFn))
informer.on('error', err => {
console.log('Watcher ERROR event: \n', err, '\nRestarting Watcher after 5 sec...')
setTimeout(informer.start, 5000)
})
informer.start()
.then(() => console.log('HostIPs-Endpoint-watcher successfully started'))
.catch(err => console.log('HostIPs-Endpoint-watcher failed to start: \n', err))
}
async function updateEndpointsFn (endpointsObj) {
console.log('updateEndpointFn: ' + util.inspect(endpointsObj, {depth: null}))
}
startEndpointWatcher(CASSANDRA_ENDPOINT_LABELS) |
I created a PR here: Any chance you can patch it and see if it fixes things? Thanks! |
@brendanburns, I spent some time debugging the leak (also tried #505 to no avail) and discovered that the problem is not that the request connection is not being closed in the watch/informer/cache. I focused on figuring out why the informer was doubling connections and narrowed it down to an issue with cache implementation; (It's not that it does not disconnect, and then creates an additional new connection. ) The informer properly closes the connection once the k8s api-server disconnects, the issue is that right after, two new connections are created. My initial hunch was that two different listeners are recreating new watches once the original closes. I'm already working on a proof of concept first. TODO: I'll submit a PR once I confirm a fix with more tests, and if I can find the time, figure out how an accompanying test to add into test_cache.ts, which at the moment doesn't have one. |
Hi. We are probably experiencing the same issue. Version We are not receiving any This is the memory usage: And perhaps useful/interesting is the Node's async resources number: The drops are when process is being killed by OOM killer and restarted. There is no changes in the traffic in kubernetes (in fact this chart is over weekend when nothing is happening) Hope this helps you perhaps somehow to pinpoint the issue. |
Thanks for the extra data. You can clearly see the expected exponential growth due to two full new async requests objects being created for every one that expires. To make matters worse, when one event is sent by the kubernetes API, the callback for every one the ever growing copies of watch instances will get executed. You can see in the first chart that even though the total number of events processed grows exponentially, the ratio for each event type (update/add/delete) is maintained. FYI: the Watch doesn't have this issue. If the complexity of your specific use of the informer is low, you can replace it with a watch and restart it after from the |
@edqd couple of things:
Thanks |
I just pushed Please test that and see if it fixes things for you. |
Thanks for releasing the new version However, it looks like it's a little slower to raise, but can't tell for sure: Not sure if we do something wrong tho. This is our code that we use to register the informer: const kube_config = new k8s.KubeConfig();
kube_config.loadFromDefault();
const k8sApi = kube_config.makeApiClient(k8s.CoreV1Api);
const listFn = () => k8sApi.listNamespacedPod("scooter-launch");
const informer = k8s.makeInformer(
kube_config,
"/api/v1/namespaces/scooter-launch/pods",
listFn
);
informer.on("add", async (pod: k8s.V1Pod) => {
statsd.increment("informer.events_by_type", { event_type: "add" });
await this.handle_pod_change_event(pod);
});
// ... same for "update" and "delete"
informer.on("error", (err: k8s.V1Pod) => {
setTimeout(() => {
this.informer.start();
}, 5000);
}); |
@edqd It's a little hard to tell, but it doesn't look to me like you are getting any errors? If you are getting errors, it would be interesting to see if you are getting duplicate errors for some reason. In either case, I think the informer should be doing the right thing. Do you have monitoring for the number of active TCP connections from your code? That's the easiest way to see if the problem is fixed or not. If the number of connections is also growing exponentially, then the problem isn't resolved. If the number of connections is constant then it is a memory leak somewhere else. |
So I dug into this a little more and I found a bug in the caching such that the cache was never emptied when a delete occurred after the connection re-connected. It definitely will cause a memory leak, especially if you are creating/deleting a lot of pods. I'll send a PR with the fix. |
See #572 |
@brendandburns Hi, Thanks for quick reply. I will also try to see if we can get network connections metrics. Thanks for tip. |
@DocX thanks for the additional data. sadness that this doesn't seem to be fixed! I'll keep digging and seeing if I can add better tests... |
@DocX one other thing is that this code is continuously re-listing from the start instead of from the current resource version. I need to optimize that (it's been a long-standing 'todo') |
@DocX I just pushed |
Hi @jc-roman , I am having the same issue here, could you provide a working example of using Watch for the replacement of Informer? |
Ok, taking another stab at this (since I can't seem to repro it locally) I sent #575 which explicitly aborts and deletes the connection before initiating a new watch. If someone can patch and try that it would be appreciated. |
Hi @brendanburns , I have tried this approach before you made this change (manually abort or destroy last active request), but unfortunately this didn't help. |
The latest master with #576 seems to have fixed the leak. See the network and memory usage on our app, the spikes are from before, and after that the current master is deployed: |
In the latest release of the library, maintainers addressed the issue that might be related to the elevated errors we were seeing. More info here kubernetes-client/javascript#494
@brendandburns Hello. What is the timeline for when we can expect new release with the fix? Thank you :) |
Awesome thank you. No pressure :) |
I just pushed |
In the latest release of the library, maintainers addressed the issue that might be related to the elevated errors we were seeing. More info here kubernetes-client/javascript#494
kubernetes-client/javascript version: 0.12.0
node version: v14.9.0
The number of connections to the kubernetes API seems to double every 10 minutes. We've setup an endpoint informer to monitor when IPs change so we can update a service w/external IPs. We use this to discover Cassandra cluster seeds in other regions.
I've tried explicitly closing the responses that I can with no luck, so I'm assuming that the leak must be in the
listNamespacedEndpoints
call or somewhere in informer.Here's logging from the pod:
Here's netstat output:
Any ideas as to what may be going on here?
The text was updated successfully, but these errors were encountered: