Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(controller): add stack dump for better debugging #2360

Closed
wants to merge 1 commit into from

Conversation

loafoe
Copy link
Contributor

@loafoe loafoe commented Jul 29, 2024

Encountered a nil error in controller in current version. Using tilt up does not immediately show the location of the issue, so added a stack dump using runtime/debug Go package to make this visible.

@loafoe loafoe requested a review from a team as a code owner July 29, 2024 10:20
Copy link

netlify bot commented Jul 29, 2024

Deploy Preview for docs-kargo-akuity-io ready!

Name Link
🔨 Latest commit 39cebf7
🔍 Latest deploy log https://app.netlify.com/sites/docs-kargo-akuity-io/deploys/66a945bc32c111000882ea1a
😎 Deploy Preview https://deploy-preview-2360.kargo.akuity.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@loafoe loafoe force-pushed the chore/stack-on-panic branch from 5e91fa3 to d7359bd Compare July 29, 2024 10:24
@loafoe loafoe mentioned this pull request Jul 29, 2024
4 tasks
@krancour krancour modified the milestones: v0.9.0, v0.8.2 Jul 29, 2024
@krancour krancour changed the title chore: add stack dump for better debugging feat(controller): add stack dump for better debugging Jul 29, 2024
Copy link
Contributor

@hiddeco hiddeco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would not be better to configure this by globally enabling recovery from panics (see: https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.18.4/pkg/controller#Options -> RecoverPanic), but I have to admit that I lack historical insight here (cc: @krancour).

Copy link

codecov bot commented Jul 29, 2024

Codecov Report

Attention: Patch coverage is 5.00000% with 19 lines in your changes missing coverage. Please review.

Project coverage is 47.58%. Comparing base (aff97d8) to head (39cebf7).
Report is 55 commits behind head on main.

Files Patch % Lines
cmd/controlplane/controller.go 0.00% 6 Missing ⚠️
cmd/controlplane/api.go 0.00% 3 Missing ⚠️
cmd/controlplane/garbage_collector.go 0.00% 3 Missing ⚠️
cmd/controlplane/management_controller.go 0.00% 3 Missing ⚠️
cmd/controlplane/webhooks.go 0.00% 3 Missing ⚠️
internal/controller/promotions/promotions.go 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2360      +/-   ##
==========================================
- Coverage   47.64%   47.58%   -0.07%     
==========================================
  Files         244      244              
  Lines       17392    17414      +22     
==========================================
  Hits         8287     8287              
- Misses       8694     8716      +22     
  Partials      411      411              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@krancour
Copy link
Member

@hiddeco I would prefer that as well. I'm not sure why the current panic recovery is limited to this one spot.

@hiddeco
Copy link
Contributor

hiddeco commented Jul 29, 2024

@loafoe do you feel up for making this suggested change?

It should be introduced at these points: https://github.com/search?q=repo%3Aakuity%2Fkargo%20NewManager&type=code. Where the ctrl.Options can take a Controller struct, which should have something along the lines of:

config.Controller{
	RecoverPanic: ptr.To(true),
}

as a value.

@loafoe
Copy link
Contributor Author

loafoe commented Jul 30, 2024

Sure, I'll have a look

@loafoe
Copy link
Contributor Author

loafoe commented Jul 30, 2024

@hiddeco @krancour added the controller Config across. However, the panic test function now fails as thepanic is not recovered. I lack insights into exactly what is happening. Also, no idea where to add the stack dump code, as that is the original intent of this change. Any further pointers appreciated, thx.

@hiddeco
Copy link
Contributor

hiddeco commented Jul 30, 2024

@loafoe the test fails because the wrapping of the panic now happens at the point where Reconcile is called, and is thus no longer captured by the test.

However, for this "special" use case, we shell out to a series of other binaries as part of the promotion mechanics. Given this, it would be OK to keep the panic logic in place to surface the error (and be able to persist it to the Kubernetes API).

With the new addition, we can now guarantee the controller will not end up in a crash loop if we make a mistake anywhere else.

@loafoe
Copy link
Contributor Author

loafoe commented Jul 30, 2024

However, for this "special" use case, we shell out to a series of other binaries as part of the promotion mechanics. Given this, it would be OK to keep the panic logic in place to surface the error (and be able to persist it to the Kubernetes API).

Okay, I'll bring the wrapper code back and do some linting, thanks

@loafoe loafoe force-pushed the chore/stack-on-panic branch from 3997f7e to 1f01140 Compare July 30, 2024 19:50
Signed-off-by: Andy Lo-A-Foe <andy.loafoe@gmail.com>
@loafoe loafoe force-pushed the chore/stack-on-panic branch from 1f01140 to 39cebf7 Compare July 30, 2024 19:57
@loafoe loafoe requested a review from hiddeco July 30, 2024 20:06
@loafoe
Copy link
Contributor Author

loafoe commented Jul 30, 2024

@hiddeco fixed linter issues and all unit tests pass again

Copy link
Contributor

@hiddeco hiddeco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following up with the requested changes! 🙇

@hiddeco hiddeco requested a review from krancour July 30, 2024 23:50
@hiddeco hiddeco removed this from the v0.8.2 milestone Aug 2, 2024
@hiddeco hiddeco modified the milestones: v0.8.3, v0.8.4 Aug 2, 2024
@krancour krancour modified the milestones: v0.8.4, v0.8.5, v0.8.6, v0.9.0 Aug 16, 2024
@hiddeco
Copy link
Contributor

hiddeco commented Aug 26, 2024

With the upgrade to controller-runtime v0.19.0 (via #2431), this has now become obsolete as it has been enabled by default upstream.

I want to thank you nonetheless for the work you put in @loafoe. 🙇

@hiddeco hiddeco closed this Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants