-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
elastic-agent diagnostics pprof #28798
elastic-agent diagnostics pprof #28798
Conversation
This pull request does not have a backport label. Could you fix it @michel-laterman? 🙏
NOTE: |
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
💚 Flaky test reportTests succeeded. 🤖 GitHub commentsTo re-run your PR in the CI, just comment with:
|
Force the elastic-agent and all beats that it starts to run the http/pprof listener on a local socket.
@ruflin here's a draft of adding pprof support. I think we should consider adding a config option in order to serve the pprof endpoints on the standard port (I'm not sure why it's purely done via a command line flag at the moment, is there historical reasons?) Also let me know if I should split this PR up as it's gotten pretty large |
++ on also providing a config option. I always appreciate smaller PR's but leave it up to you if the overhead is worth it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, would be good if @michalpristas could take a closer look at the code.
One thing that concerns me but is not only related to this PR is that there are no tests added and no tests changed, meaning we don't have any. I know testing the pprof part itself might be tricky but I would hope at least some parts around creating the correct paths, building endpoints etc. could be tested.
cmd := &cobra.Command{ | ||
Use: "pprof", | ||
Short: "Collect pprof information from a running process.", | ||
Long: "Collect pprof information from the elastic-agent or one of its processes and write to stdout or a file.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is my understanding correct that by default this command will take 30s to execute? Maybe we should mention this in the docs here that it is tied to pprof-duration?
innerCtx, cancel := context.WithTimeout(ctx, 30*time.Second) | ||
d := time.Second * 30 | ||
if pprof { | ||
d = 2 * pprofDur |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it 2* ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i needed a context timeout that would be larger then the trace duration to give the agent time to process/zip the responses. I can change this to a static addition (i.e. 30s + pprofDur)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Can you add this comment to the code so someone else touching this code will know why this is there? What approach you take (30s or 2x) I leave to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll use the 30s+duration value, but I've added an option to specify the top level command timeout so it will be an option to increase it beyond this limit if that's needed. Hopefully with this we can avoid edge cases where it takes the agent over 30s to create an archive.
fmt.Fprintf(streams.Out, "pprof data written to %s\n", outFile) | ||
return nil | ||
} | ||
streams.Out.Write(res.Result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should a potential error here to be handled?
} | ||
endpoint = "npipe" | ||
} | ||
client := newSocketRequester(time.Second*5, n, rk, endpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michalpristas Seems some quite fundamental changes here. Can you take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see further down, most of the code is equal but split up in multiple methods which is a nice cleanup.
// retrieve elastic-agent pprof data if requested or application is unspecified. | ||
if req.AppName == "" || req.AppName == "elastic-agent" { | ||
endpoint := beats.AgentDebugEndpoint(runtime.GOOS) | ||
c := newSocketRequester(dur*2, "elastic-agent", "", endpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more place where we have *2
. Curious to understand this better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the duration passed to newSocketRequester
is an artifact of my development (I was not doing requests with contexts at one point); i'll clean this up
Change pprof timeouts from 2*pprofDur to 30s+pprofDur. Remove timeouts from the socket requester client as cancellations for long running requests will be handled by the passed ctx.
@ruflin, I think the diagnostics commands are a good candidate for adding an e2e test. I'm not sure if we should validate archive structure, i think just ensuring the command exited with a 0 and that an archive was created may be goo enough for now. |
👍 on adding an e2e for the diagnostic command. Can you follow up on this in the e2e repo after this got in? Should this be marked as ready for review? There are some CI complaints .. |
Fix linting issues with new command. Add a timeout flag for when pprof info is gathered. Flag will let users specify the command timeout value. This value whould be greater then the pprof-duration as it needs to gather and process pprof data.
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@ferullo Does endpoint have something similar to pprof which could also be read out in the future?
@@ -150,3 +150,4 @@ | |||
- Add diagnostics collect command to gather beat metadata, config, policy, and logs and bundle it into an archive. {pull}28461[28461] | |||
- Add `KIBANA_FLEET_SERVICE_TOKEN` to Elastic Agent container. {pull}28096[28096] | |||
- Enable pprof endpoints for beats processes. Allow pprof endpoints for elastic-agent if enabled. {pull}28983[28983] | |||
- Add `--pprof` flag to `elastic-agent diagnostics` and an `elastic-agent pprof` command to allow operators to gather pprof data from the agent and beats running under it. {pull}28798[28798] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets make sure we follow up with proper docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
already in a pr
} | ||
programs, ok := routes.Get(rk) | ||
if !ok { | ||
s.logger.With("route_key", rk).Warn("Unable to retrieve route.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past we tried to use Warn as little as possible as it is not clear if the user needs to take action or not. What are the reasons we can't retrieve the route key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we would be unable to retrieve routes in this case, but I thought logging any issue could help with debugging/edge cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
c := newSocketRequester(n, rk, endpoint) | ||
|
||
// Launch a concurrent goroutine to gather all pprof endpoints from a socket. | ||
for _, opt := range req.PprofType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have here 3 for loops nested inside each other. Maybe we could extract some into methods which would make reviewing and potentially testing easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll work on it
I did try to run this locally by running
Is it expected to also work if I have not installed elastic-agent but only run it? After I run |
I haven't tried testing this when run is used instead of installing, i can get to it. |
This pull request is now in conflicts. Could you fix it? 🙏
|
Right now there's a bug with how the agent handles the
references the attribute in the file. However, the non-nill config passed to reload
false setting. I'm trying to find out why this is happening.
|
I think the issue with the pprof attribute is that on startup in fleet mode, the agent gets a policy update from fleet in Line 75 in 191a075
The policy sent by fleet does not have a reference to the pprof attribute |
Chagne the setting in elastic agent from agent.monioring.pprof to agent.monitoring.pprof.enabled so that policy updates (such as the one that occurs when the agent is starting in fleet mode) do not use the default false value if the user has injected the ssetting into fleet.yml
@@ -39,6 +46,5 @@ func DefaultConfig() *MonitoringConfig { | |||
Port: defaultPort, | |||
}, | |||
Namespace: defaultNamespace, | |||
Pprof: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruflin, I changed this from a default false behaviour to default nil so that a value inserted into fleet.yml would be used when the Reload
function is called (so the setting is now applied to beats that the agent starts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, so value only gets overwritten if not set yet?
@narph This likely should also feed into your thoughts on how these files should work in the future.
* Allow -httpprof to bind to sockets/pipes * Enable pprof debug endpoint on socket for agent and beats Force the elastic-agent and all beats that it starts to run the http/pprof listener on a local socket. * Add new Pprof command to control.proto * Add pprof option to diagnostics collect * Fix linting issues * Add diagonstics pprof command allow pprof to collect from agent * Revert debug socket changes * Cleanup timeout handling Change pprof timeouts from 2*pprofDur to 30s+pprofDur. Remove timeouts from the socket requester client as cancellations for long running requests will be handled by the passed ctx. * Fix linting issue add timeout flag Fix linting issues with new command. Add a timeout flag for when pprof info is gathered. Flag will let users specify the command timeout value. This value whould be greater then the pprof-duration as it needs to gather and process pprof data. * Add more command help text. * Add CHANGELOG * move spec collection for routes to fn * add monitoringCfg reference to control server * elastic-agent server only processes pprof requests when enabled * Fix error message fix commands only on elastic-agent * Add pprof fleet.yml, fix nil reference * Change pprof setting name to monitoring.pprof.enabled Chagne the setting in elastic agent from agent.monioring.pprof to agent.monitoring.pprof.enabled so that policy updates (such as the one that occurs when the agent is starting in fleet mode) do not use the default false value if the user has injected the ssetting into fleet.yml (cherry picked from commit f5e0ec4)
* Allow -httpprof to bind to sockets/pipes * Enable pprof debug endpoint on socket for agent and beats Force the elastic-agent and all beats that it starts to run the http/pprof listener on a local socket. * Add new Pprof command to control.proto * Add pprof option to diagnostics collect * Fix linting issues * Add diagonstics pprof command allow pprof to collect from agent * Revert debug socket changes * Cleanup timeout handling Change pprof timeouts from 2*pprofDur to 30s+pprofDur. Remove timeouts from the socket requester client as cancellations for long running requests will be handled by the passed ctx. * Fix linting issue add timeout flag Fix linting issues with new command. Add a timeout flag for when pprof info is gathered. Flag will let users specify the command timeout value. This value whould be greater then the pprof-duration as it needs to gather and process pprof data. * Add more command help text. * Add CHANGELOG * move spec collection for routes to fn * add monitoringCfg reference to control server * elastic-agent server only processes pprof requests when enabled * Fix error message fix commands only on elastic-agent * Add pprof fleet.yml, fix nil reference * Change pprof setting name to monitoring.pprof.enabled Chagne the setting in elastic agent from agent.monioring.pprof to agent.monitoring.pprof.enabled so that policy updates (such as the one that occurs when the agent is starting in fleet mode) do not use the default false value if the user has injected the ssetting into fleet.yml (cherry picked from commit f5e0ec4) Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>
What does this PR do?
Add a new
Pprof
grpc call as well as two ways to gather pprof data.The duration of the
trace
orprofile
data can be controlled with thepprof-duration
flag, set to30s
by default. The timeout for these commands can be controlled separately with thetimeout
flag, by default this value is30s + pprof-duration
.timeout
should be larger thenpprof-duration
to allow the command to process the response data.Why is it important?
This option allows operators to gather pprof data from all applications into the archive in order to assist debugging.
Adds the
elastic-agent diagnostics pprof
command to allow operators to gather one pprof type for debugging (specifically to be composed with other tools on the command line).Checklist
I have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
Run
elastic-agent collect --pprof
and inspect the archive for pprof data.Related issues