-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix goroutine leak when reconciling #539
Conversation
This patch fixes the propagation of context cancellation through the call stack. It prevents leaks of channel and goroutine from the [terraform provider][provider_code]. Fixes: crossplane-contrib#538 [provider_code]: https://github.com/hashicorp/terraform-provider-google/blob/1d1a50adf64af60815b7a08ffc5e9d3e856d2e9c/google/transport/batcher.go#L117-L123 Signed-off-by: Maxime Vidori <maxime.vidori@gmail.com>
Thanks for the fix @IxDay 🙏 @erhancagirici reminded me that we had used |
Yes, I was editing my message to ask if there was a reason to choose this, it was not obvious from the code/history/comment. |
We are starting to roll this across our entire infrastructure. We still haven't notice any issue yet. I am bumping this channel in order to make this move forward. |
@IxDay, I've been working on this issue from time to time. The fact that you haven't had any issues is great news. I've scheduled a meeting with a team member next week. They have more experience in the problem. Having a memory leak greatly disturbs me. Now that I've addressed some of my urgent tasks, this issue will be at the top of my priority list. |
Thank you @ulucinar for providing background information off-channel. We hypothesized that the implementation was ported from Azure provider, which had context cancelation issues when the context was propagated. It is likely that GCP provider would never had any issues, in the first place, even if the context was propagated. We will run some simple tests on our side. If all goes well, we will merge. |
Let me know if you need anything. We are really looking forward for this PR to be merged |
/test-examples="examples/container/v1beta2/cluster.yaml" |
/test-examples="examples/cloudplatform/v1beta1/serviceaccount.yaml" |
/test-examples="examples/storage/v1beta2/bucket.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for your effort in this PR @IxDay, LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @IxDay for your work. Your well-prepared reports made our jobs a lot easier. Most importantly thank you for your patience 🙏 We wish we were able to merge this important PR sooner.
I will take a look, but I will need a few days before, since I have other priorities at the moment. |
This patch fixes the propagation of context cancellation through the call stack. It prevents leaks of channel and goroutine from the terraform provider.
Description of your changes
In order to fix this bug we tracked down the leak to the underlying terraform provider. We managed to isolate this function: provider code using pprof.
By adding it to our deployment, we noticed the creation of 2 channels and 2 goroutines on each resource every time the reconciliation is kicking. All the never closing routines had the same stack trace:
As we can see the routine is waiting on the closing of the
Done
channel from the parent "process". However, we see in the controller bootstraping that we are overriding the parent context withWithoutCancellation
context. Implementation shows from source code that channel is nownil
. And anil
channel will never close and block the goroutine as showcased in this playground demoFixes #538
I have:
make reviewable
to ensure this PR is ready for review.backport release-x.y
labels to auto-backport this PR if necessary.How has this code been tested