Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Close etcd client, connections, and data streams after each use #3513

Merged

Conversation

vincepri
Copy link
Member

Signed-off-by: Vince Prignano vincepri@vmware.com

What this PR does / why we need it:

This PR fixes a bug where we were not closing the etcd clients after each use.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #3482

/milestone v0.3.9
/kind bug

@vincepri
Copy link
Member Author

/assign @ncdc

@vincepri
Copy link
Member Author

cc @jdef @fabriziopandini

@jdef
Copy link
Contributor

jdef commented Aug 20, 2020

this addresses the first part of the problem. closing the etcd client isn't enough though because the dialer leaks a SPDY connection, as I also documented in the bug.

spdy multiplexes connections, and closing the etcd client will close an "inner" stream, but leaves the outer connection (the one that holds all the inner streams) remaining open. if this is not fixed, then the memory leak persists.

@jdef
Copy link
Contributor

jdef commented Aug 20, 2020

this addresses the first part of the problem. closing the etcd client isn't enough though because the dialer leaks a SPDY connection, as I also documented in the bug.

spdy multiplexes connections, and closing the etcd client will close an "inner" stream, but leaves the outer connection (the one that holds all the inner streams) remaining open. if this is not fixed, then the memory leak persists.

dataStream is created here:
https://github.com/kubernetes-sigs/cluster-api/blob/master/controlplane/kubeadm/internal/proxy/dial.go#L115

and then it's returned as the stream that the etcd client uses. but p (which produces dataStream) is left dangling and is never properly closed, even if dataStream is closed. my solution was to embed dataStream in a wrapper struct that satisfied the needed interface, and overloaded Close and Reset such that (effectively) p.Close was also called in such cases.

@ncdc
Copy link
Contributor

ncdc commented Aug 20, 2020

Sounds like we should fix dial.go so p isn't dangling/inaccessible

@ncdc
Copy link
Contributor

ncdc commented Aug 20, 2020

Maybe NewConn should take in p as well so it can close it

@jdef
Copy link
Contributor

jdef commented Aug 20, 2020

Maybe NewConn should take in p as well so it can close it

either way. i embedded the value of p in the struct that wraps the dataStream value (so the struct embeds the value of dataStream as an unnamed field, and p as a named field). and the just overloaded:

func (..) Close() error {... }
func (..) Reset() error { ... }

to also invoke p.Close() after closing the data stream. this keeps NewConn simple and focused on what it needs to do, and keeps the complexity of the wrapper in the dialer, which is already making decisions about how to deal w/ multiplexing.

also, in the dialer func, where it it returns early due to errors, it should close p before actually returning, otherwise it'll leak there too

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 20, 2020
@vincepri
Copy link
Member Author

@ncdc @jdef Take another look, I ended up cleaning up the functions as well given the confusing they created (very similar, but different purposes)

@vincepri
Copy link
Member Author

While having a connection and a stream tied together defeats the purpose of multiplexing, that's what we're doing today + leaking the connection

@vincepri
Copy link
Member Author

/milestone v0.3.9
/kind bug

@k8s-ci-robot k8s-ci-robot added this to the v0.3.9 milestone Aug 20, 2020
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 20, 2020
@jdef
Copy link
Contributor

jdef commented Aug 20, 2020

this looks functionally equiv to the fix that I made in my prototype

/approve

@ncdc
Copy link
Contributor

ncdc commented Aug 20, 2020

Reviewing
/hold
for comments

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 20, 2020
controlplane/kubeadm/internal/proxy/dial.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/proxy/dial.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/proxy/dial.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/proxy/dial.go Show resolved Hide resolved
controlplane/kubeadm/internal/proxy/dial.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/proxy/dial.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/proxy/dial.go Show resolved Hide resolved
Signed-off-by: Vince Prignano <vincepri@vmware.com>
@ncdc
Copy link
Contributor

ncdc commented Aug 20, 2020

/lgtm

I'd like @randomvariable to review as well.

/assign @randomvariable

Please feel free to remove the hold if you're happy with it.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 20, 2020
@vincepri
Copy link
Member Author

/test pull-cluster-api-e2e

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm for me.
only a question but not blocking

}

// Close the error stream right away, we're not writing to it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding: why are we creating an error string and then closing it immediately after?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding is that you need to create the error stream in all cases for the websocket to function, even though for port-forwarding, it's not used. I could be wrong. It was based off of https://github.com/kubernetes/client-go/blob/master/tools/portforward/portforward.go#L324:L324

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protocol I designed for port-forward, attach, and exec multiplexes different streams in a single connection. All of these require an error stream, which is used for communicating errors about the port-forward/attach/exec back to the client. With SPDY, either side (client or server) can close its write end of the stream. With the error stream here, the client never writes to it, so we need to create it and then "close" it, which really only closes the client's write half.

@randomvariable
Copy link
Member

looks good, thanks for deborking.

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 21, 2020
@vincepri
Copy link
Member Author

/retitle Close etcd client, connections, and data streams after each use

@k8s-ci-robot k8s-ci-robot changed the title 🐛 Close etcd clients after use Close etcd client, connections, and data streams after each use Aug 21, 2020
@vincepri vincepri changed the title Close etcd client, connections, and data streams after each use 🐛 Close etcd client, connections, and data streams after each use Aug 21, 2020
@fabriziopandini
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini, jdef

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 21, 2020
@k8s-ci-robot k8s-ci-robot merged commit 269a482 into kubernetes-sigs:master Aug 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

etcd client memory leak in controlplane/kubeadm/internal
6 participants