elastic-agent diagnostics pprof #28798

michel-laterman · 2021-11-03T21:01:13Z

What does this PR do?

Add a new Pprof grpc call as well as two ways to gather pprof data.

Collect all pprof data from the agent and the beats it runs with

elastic-agent diagnostics collect --pprof

Collect one pprof profile from one application-(route) with

elastic-agent diagnostics pprof [--pprof-application APP-NAME] [--pprof-route-key ROUTE-KEY] [--pprof-type TYPE]

The duration of the trace or profile data can be controlled with the pprof-duration flag, set to 30s by default. The timeout for these commands can be controlled separately with the timeout flag, by default this value is 30s + pprof-duration. timeout should be larger then pprof-duration to allow the command to process the response data.

Why is it important?

This option allows operators to gather pprof data from all applications into the archive in order to assist debugging.
Adds the elastic-agent diagnostics pprof command to allow operators to gather one pprof type for debugging (specifically to be composed with other tools on the command line).

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation PR here
[] ~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Run elastic-agent collect --pprof and inspect the archive for pprof data.

Related issues

Relates Beats diagnostics tooling #21965

mergify · 2021-11-03T21:01:47Z

This pull request does not have a backport label. Could you fix it @michel-laterman? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

elasticmachine · 2021-11-03T23:05:26Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2021-12-14T04:56:30.581+0000
Duration: 100 min 0 sec
Commit: 4e6ecbb

Test stats 🧪

Test	Results
Failed	0
Passed	7136
Skipped	16
Total	7152

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Force the elastic-agent and all beats that it starts to run the http/pprof listener on a local socket.

michel-laterman · 2021-11-09T01:36:39Z

@ruflin here's a draft of adding pprof support.

I think we should consider adding a config option in order to serve the pprof endpoints on the standard port (I'm not sure why it's purely done via a command line flag at the moment, is there historical reasons?)

Also let me know if I should split this PR up as it's gotten pretty large

ruflin · 2021-11-11T09:20:36Z

++ on also providing a config option. I always appreciate smaller PR's but leave it up to you if the overhead is worth it.

ruflin

Overall LGTM, would be good if @michalpristas could take a closer look at the code.

One thing that concerns me but is not only related to this PR is that there are no tests added and no tests changed, meaning we don't have any. I know testing the pprof part itself might be tricky but I would hope at least some parts around creating the correct paths, building endpoints etc. could be tested.

ruflin · 2021-11-17T08:28:18Z

x-pack/elastic-agent/pkg/agent/cmd/diagnostics.go

+	cmd := &cobra.Command{
+		Use:   "pprof",
+		Short: "Collect pprof information from a running process.",
+		Long:  "Collect pprof information from the elastic-agent or one of its processes and write to stdout or a file.",


Is my understanding correct that by default this command will take 30s to execute? Maybe we should mention this in the docs here that it is tied to pprof-duration?

ruflin · 2021-11-17T08:30:07Z

x-pack/elastic-agent/pkg/agent/cmd/diagnostics.go

-	innerCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
+	d := time.Second * 30
+	if pprof {
+		d = 2 * pprofDur


Why is it 2* ?

i needed a context timeout that would be larger then the trace duration to give the agent time to process/zip the responses. I can change this to a static addition (i.e. 30s + pprofDur)

Got it. Can you add this comment to the code so someone else touching this code will know why this is there? What approach you take (30s or 2x) I leave to you.

I'll use the 30s+duration value, but I've added an option to specify the top level command timeout so it will be an option to increase it beyond this limit if that's needed. Hopefully with this we can avoid edge cases where it takes the agent over 30s to create an archive.

ruflin · 2021-11-17T08:36:21Z

x-pack/elastic-agent/pkg/agent/cmd/diagnostics.go

+		fmt.Fprintf(streams.Out, "pprof data written to %s\n", outFile)
+		return nil
+	}
+	streams.Out.Write(res.Result)


Should a potential error here to be handled?

ruflin · 2021-11-17T08:39:45Z

x-pack/elastic-agent/pkg/agent/control/server/server.go

-				}
-				endpoint = "npipe"
-			}
+			client := newSocketRequester(time.Second*5, n, rk, endpoint)


@michalpristas Seems some quite fundamental changes here. Can you take a look?

I see further down, most of the code is equal but split up in multiple methods which is a nice cleanup.

ruflin · 2021-11-17T08:40:53Z

x-pack/elastic-agent/pkg/agent/control/server/server.go

+	// retrieve elastic-agent pprof data if requested or application is unspecified.
+	if req.AppName == "" || req.AppName == "elastic-agent" {
+		endpoint := beats.AgentDebugEndpoint(runtime.GOOS)
+		c := newSocketRequester(dur*2, "elastic-agent", "", endpoint)


One more place where we have *2. Curious to understand this better.

I think that the duration passed to newSocketRequester is an artifact of my development (I was not doing requests with contexts at one point); i'll clean this up

Change pprof timeouts from 2*pprofDur to 30s+pprofDur. Remove timeouts from the socket requester client as cancellations for long running requests will be handled by the passed ctx.

michel-laterman · 2021-11-19T00:45:27Z

@ruflin, I think the diagnostics commands are a good candidate for adding an e2e test. I'm not sure if we should validate archive structure, i think just ensuring the command exited with a 0 and that an archive was created may be goo enough for now.

ruflin · 2021-11-22T08:31:42Z

👍 on adding an e2e for the diagnostic command. Can you follow up on this in the e2e repo after this got in?

Should this be marked as ready for review? There are some CI complaints ..

Fix linting issues with new command. Add a timeout flag for when pprof info is gathered. Flag will let users specify the command timeout value. This value whould be greater then the pprof-duration as it needs to gather and process pprof data.

elasticmachine · 2021-11-23T04:13:14Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

ruflin

LGTM

@ferullo Does endpoint have something similar to pprof which could also be read out in the future?

ruflin · 2021-11-24T07:40:25Z

x-pack/elastic-agent/CHANGELOG.next.asciidoc

@@ -150,3 +150,4 @@
 - Add diagnostics collect command to gather beat metadata, config, policy, and logs and bundle it into an archive. {pull}28461[28461]
 - Add `KIBANA_FLEET_SERVICE_TOKEN` to Elastic Agent container. {pull}28096[28096]
 - Enable pprof endpoints for beats processes. Allow pprof endpoints for elastic-agent if enabled. {pull}28983[28983]
+- Add `--pprof` flag to `elastic-agent diagnostics` and an `elastic-agent pprof` command to allow operators to gather pprof data from the agent and beats running under it. {pull}28798[28798]


Lets make sure we follow up with proper docs.

already in a pr

ruflin · 2021-11-24T07:49:09Z

x-pack/elastic-agent/pkg/agent/control/server/server.go

+		}
+		programs, ok := routes.Get(rk)
+		if !ok {
+			s.logger.With("route_key", rk).Warn("Unable to retrieve route.")


In the past we tried to use Warn as little as possible as it is not clear if the user needs to take action or not. What are the reasons we can't retrieve the route key?

I'm not sure why we would be unable to retrieve routes in this case, but I thought logging any issue could help with debugging/edge cases

ruflin · 2021-11-24T07:50:50Z

x-pack/elastic-agent/pkg/agent/control/server/server.go

+			c := newSocketRequester(n, rk, endpoint)
+
+			// Launch a concurrent goroutine to gather all pprof endpoints from a socket.
+			for _, opt := range req.PprofType {


We have here 3 for loops nested inside each other. Maybe we could extract some into methods which would make reviewing and potentially testing easier.

i'll work on it

ruflin · 2021-11-24T08:21:49Z

I did try to run this locally by running ./elastic-agent run. Likely because of this I get the following errors for pprof:

Get "http://unix/debug/pprof/profile?seconds=30": dial unix /tmp/elastic-agent/93926c22716b6d271825bfa3d3c9f0b3434619940c4a3ccba25d96ff2322946f.sock: connect: no such file or directory

Is it expected to also work if I have not installed elastic-agent but only run it?

After I run ./elastic-agent diagnostics collect --pprof for 30s nothing happens which of course is expected. But before we start the diagnostics run, could we print out a message to std that now collection starts and will likely take x seconds?

michel-laterman · 2021-11-24T16:23:14Z

I haven't tried testing this when run is used instead of installing, i can get to it.

mergify · 2021-11-30T20:29:05Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b elastic-agent-profile upstream/elastic-agent-profile
git merge upstream/master
git push upstream elastic-agent-profile

michel-laterman · 2021-12-06T17:43:51Z

Right now there's a bug with how the agent handles the agent.monitoring.pprof attribute. The output of NewMonitor

beats/x-pack/elastic-agent/pkg/core/monitoring/beats/beats_monitor.go

Line 35 in 27d1a6e

    
           func NewMonitor(downloadConfig *artifact.Config, monitoringCfg *monitoringConfig.MonitoringConfig, logMetrics bool) *Monitor {

references the attribute in the file. However, the non-nill config passed to reload

beats/x-pack/elastic-agent/pkg/core/monitoring/beats/beats_monitor.go

Line 49 in 27d1a6e

func (b *Monitor) Reload(rawConfig *config.Config) error {

has the (default) false setting. I'm trying to find out why this is happening.

michel-laterman · 2021-12-06T22:56:22Z

I think the issue with the pprof attribute is that on startup in fleet mode, the agent gets a policy update from fleet in

beats/x-pack/elastic-agent/pkg/agent/application/pipeline/actions/handlers/handler_action_policy_change.go

Line 75 in 191a075

    
           func (h *PolicyChange) Handle(ctx context.Context, a fleetapi.Action, acker store.FleetAcker) error {

The policy sent by fleet does not have a reference to the pprof attribute

Chagne the setting in elastic agent from agent.monioring.pprof to agent.monitoring.pprof.enabled so that policy updates (such as the one that occurs when the agent is starting in fleet mode) do not use the default false value if the user has injected the ssetting into fleet.yml

michel-laterman · 2021-12-14T05:06:37Z

x-pack/elastic-agent/pkg/core/monitoring/config/config.go

@@ -39,6 +46,5 @@ func DefaultConfig() *MonitoringConfig {
 			Port:    defaultPort,
 		},
 		Namespace: defaultNamespace,
-		Pprof:     false,


@ruflin, I changed this from a default false behaviour to default nil so that a value inserted into fleet.yml would be used when the Reload function is called (so the setting is now applied to beats that the agent starts)

Interesting, so value only gets overwritten if not set yet?

@narph This likely should also feed into your thoughts on how these files should work in the future.

* Allow -httpprof to bind to sockets/pipes * Enable pprof debug endpoint on socket for agent and beats Force the elastic-agent and all beats that it starts to run the http/pprof listener on a local socket. * Add new Pprof command to control.proto * Add pprof option to diagnostics collect * Fix linting issues * Add diagonstics pprof command allow pprof to collect from agent * Revert debug socket changes * Cleanup timeout handling Change pprof timeouts from 2*pprofDur to 30s+pprofDur. Remove timeouts from the socket requester client as cancellations for long running requests will be handled by the passed ctx. * Fix linting issue add timeout flag Fix linting issues with new command. Add a timeout flag for when pprof info is gathered. Flag will let users specify the command timeout value. This value whould be greater then the pprof-duration as it needs to gather and process pprof data. * Add more command help text. * Add CHANGELOG * move spec collection for routes to fn * add monitoringCfg reference to control server * elastic-agent server only processes pprof requests when enabled * Fix error message fix commands only on elastic-agent * Add pprof fleet.yml, fix nil reference * Change pprof setting name to monitoring.pprof.enabled Chagne the setting in elastic agent from agent.monioring.pprof to agent.monitoring.pprof.enabled so that policy updates (such as the one that occurs when the agent is starting in fleet mode) do not use the default false value if the user has injected the ssetting into fleet.yml (cherry picked from commit f5e0ec4)

* Allow -httpprof to bind to sockets/pipes * Enable pprof debug endpoint on socket for agent and beats Force the elastic-agent and all beats that it starts to run the http/pprof listener on a local socket. * Add new Pprof command to control.proto * Add pprof option to diagnostics collect * Fix linting issues * Add diagonstics pprof command allow pprof to collect from agent * Revert debug socket changes * Cleanup timeout handling Change pprof timeouts from 2*pprofDur to 30s+pprofDur. Remove timeouts from the socket requester client as cancellations for long running requests will be handled by the passed ctx. * Fix linting issue add timeout flag Fix linting issues with new command. Add a timeout flag for when pprof info is gathered. Flag will let users specify the command timeout value. This value whould be greater then the pprof-duration as it needs to gather and process pprof data. * Add more command help text. * Add CHANGELOG * move spec collection for routes to fn * add monitoringCfg reference to control server * elastic-agent server only processes pprof requests when enabled * Fix error message fix commands only on elastic-agent * Add pprof fleet.yml, fix nil reference * Change pprof setting name to monitoring.pprof.enabled Chagne the setting in elastic agent from agent.monioring.pprof to agent.monitoring.pprof.enabled so that policy updates (such as the one that occurs when the agent is starting in fleet mode) do not use the default false value if the user has injected the ssetting into fleet.yml (cherry picked from commit f5e0ec4) Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>

Allow -httpprof to bind to sockets/pipes

dd06f1f

michel-laterman added enhancement Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Nov 3, 2021

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Nov 3, 2021

mergify bot added the backport-skip Skip notification from the automated backport with mergify label Nov 3, 2021

michel-laterman added 5 commits November 4, 2021 12:31

Enable pprof debug endpoint on socket for agent and beats

b035fe8

Force the elastic-agent and all beats that it starts to run the http/pprof listener on a local socket.

Add new Pprof command to control.proto

78069ae

Add pprof option to diagnostics collect

131bf1f

Fix linting issues

c61c010

Add diagonstics pprof command allow pprof to collect from agent

e3fba76

michel-laterman added the backport-v7.16.0 Automated backport with mergify label Nov 9, 2021

mergify bot removed the backport-skip Skip notification from the automated backport with mergify label Nov 9, 2021

michel-laterman mentioned this pull request Nov 10, 2021

Add config option to attach pprof endpoints to http socket #28902

Merged

6 tasks

ruflin added the backport-v8.0.0 Automated backport with mergify label Nov 11, 2021

michel-laterman mentioned this pull request Nov 16, 2021

Enable pprof for elastic-agent and beats #28983

Merged

5 tasks

ruflin reviewed Nov 17, 2021

View reviewed changes

michel-laterman added 3 commits November 17, 2021 13:54

Merge branch 'master' into elastic-agent-profile

f750e5b

Revert debug socket changes

dc66dbd

Cleanup timeout handling

bbaf43a

Change pprof timeouts from 2*pprofDur to 30s+pprofDur. Remove timeouts from the socket requester client as cancellations for long running requests will be handled by the passed ctx.

michel-laterman added 3 commits November 22, 2021 13:09

Merge remote-tracking branch 'origin/master' into elastic-agent-profile

f29e207

Fix linting issue add timeout flag

c82f697

Fix linting issues with new command. Add a timeout flag for when pprof info is gathered. Flag will let users specify the command timeout value. This value whould be greater then the pprof-duration as it needs to gather and process pprof data.

Add more command help text.

a9b3693

michel-laterman changed the title ~~Elastic-agent pprof enablement~~ elastic-agent diagnostics pprof Nov 23, 2021

Add CHANGELOG

b4ffdf0

michel-laterman marked this pull request as ready for review November 23, 2021 04:13

ruflin approved these changes Nov 24, 2021

View reviewed changes

michel-laterman added 2 commits November 24, 2021 19:35

move spec collection for routes to fn

8f145a6

add monitoringCfg reference to control server

1b2a9ff

michel-laterman added 3 commits December 2, 2021 11:30

Merge remote-tracking branch 'origin/master' into elastic-agent-profile

639707c

elastic-agent server only processes pprof requests when enabled

e6707c4

Fix error message fix commands only on elastic-agent

1bdc078

Add pprof fleet.yml, fix nil reference

91140b7

nimarezainia mentioned this pull request Dec 13, 2021

[Elastic Agent] Improve Elastic Agent debuggability #26930

Open

30 tasks

michel-laterman commented Dec 14, 2021

View reviewed changes

michel-laterman removed the backport-v7.16.0 Automated backport with mergify label Dec 14, 2021

michel-laterman merged commit f5e0ec4 into elastic:master Dec 14, 2021

michel-laterman deleted the elastic-agent-profile branch December 14, 2021 20:01

mergify bot mentioned this pull request Dec 14, 2021

[8.0](backport #28798) elastic-agent diagnostics pprof #29429

Merged

elastic-agent diagnostics pprof #28798

elastic-agent diagnostics pprof #28798

Conversation

michel-laterman commented Nov 3, 2021 • edited Loading

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

mergify bot commented Nov 3, 2021

elasticmachine commented Nov 3, 2021 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

michel-laterman commented Nov 9, 2021

ruflin commented Nov 11, 2021

ruflin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michel-laterman commented Nov 19, 2021

ruflin commented Nov 22, 2021

elasticmachine commented Nov 23, 2021

ruflin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruflin commented Nov 24, 2021

michel-laterman commented Nov 24, 2021

mergify bot commented Nov 30, 2021

michel-laterman commented Dec 6, 2021

michel-laterman commented Dec 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michel-laterman commented Nov 3, 2021 •

edited

Loading

elasticmachine commented Nov 3, 2021 •

edited by jenkins-beats-ci bot

Loading