Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: take multiple pprof snapshots from 'operator debug' #11514

Closed
tgross opened this issue Nov 16, 2021 · 6 comments · Fixed by #11938
Closed

cli: take multiple pprof snapshots from 'operator debug' #11514

tgross opened this issue Nov 16, 2021 · 6 comments · Fixed by #11938
Labels
good first issue help-wanted We encourage community PRs for these issues! theme/cli type/enhancement

Comments

@tgross
Copy link
Member

tgross commented Nov 16, 2021

A long-running nomad operator debug command gets logs and raft snapshots over a window of time, but only takes pprof profiles (for heap, goroutines, cpu profiling, etc.) at the start of the run. Most of the time this is enough, but being able to take several snapshots of profiles could be useful for debugging the internals over a period of time.

Example use cases:

  • See whether goroutines on chan receive or select states change their states or whether they continue to block. This can help Nomad engineers figure out whether a goroutine is blocked from a deadlock (bug) or just experiencing backpressure.
  • See whether heap profiles change over time. This can help debug memory leaks.
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Nov 16, 2021
@tgross tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Nov 16, 2021
@danishprakash
Copy link
Contributor

@tgross is this something a new contributor should pick up?

@tgross
Copy link
Member Author

tgross commented Nov 18, 2021

It's probably fairly straightforward to pick up. I think there's a bit of UX interaction to figure out here as we have an -interval flag currently being used for the interval of raft snapshots and that maybe we could reuse for this interval as well, but there's also the -pprof-duration flag that sets the window for the CPU profile portion of that. So maybe we want a separate -pprof-interval that defaults to whatever -interval is set to? Our support guru @davemay99 is one of the primary "customers" of this feature so let's @-mention him to get his thoughts on that.

Also, just a heads up we have a few other operator debug issues we're planning on working in the next couple weeks so there may be a little bit of churn in terms of needing to rebase on changes.

@mikenomitch mikenomitch removed this from Needs Roadmapping in Nomad - Community Issues Triage Nov 18, 2021
@mikenomitch mikenomitch added this to Needs Triage in Nomad - Community Issues Triage via automation Nov 18, 2021
@mikenomitch mikenomitch removed this from Needs Triage in Nomad - Community Issues Triage Nov 18, 2021
@danishprakash
Copy link
Contributor

Haven't yet finalized on the flag, will wait for @davemay99's suggestion for that but I went ahead and did a simple implementation using a new pprofInterval flag in line with interval and pprofDuration. I've reused code from collectPeriodic and there are two things I wanted to point out:

  1. Since there's code duplication between the two periodic functions collectPeriodic and collectPeriodicPprofs now, It seems like we can do away with a common "util" method which accepts an interval and a function to execute in set intervals as arguments to avoid duplication. It looks like a good candidate to abstract out. What do you think?
  2. While testing my changes make test, I encountered a lot of unrelated errors, I'm certain that these are due to my environment being not setup properly. Is this expected? Should I narrow down the testing scope for this issue?

I haven't raised a PR yet but If you think this change here looks like its in the right direction, I'll raise a PR.

danishprakash@c375ed0

@danishprakash
Copy link
Contributor

subtle reminder, feel free to ignore otherwise. @tgross

@tgross
Copy link
Member Author

tgross commented Jan 25, 2022

Hi @danishprakash that seems more-or-less the right idea. Can you open it as a PR? That'd help me loop in @davemay99 as well.

@Amier3 Amier3 added the help-wanted We encourage community PRs for these issues! label Apr 1, 2022
@github-actions
Copy link

github-actions bot commented Oct 9, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
good first issue help-wanted We encourage community PRs for these issues! theme/cli type/enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants