Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote agent pprof endpoints #6841

Merged
merged 21 commits into from
Jan 10, 2020
Merged

Remote agent pprof endpoints #6841

merged 21 commits into from
Jan 10, 2020

Conversation

drewbailey
Copy link
Contributor

@drewbailey drewbailey commented Dec 11, 2019

This PR adds server and client rpc endpoints to allow operators to generate pprof reports for any given node or server as long as they have proper acl privileges.

A new HTTP endpoint /v1/agent/pprof/ acts as typical golang pprof endpoints https://golang.org/pkg/net/http/pprof/ but forwards the request to remote nodes.

TODO

  • Docs

api/agent.go Outdated Show resolved Hide resolved
@drewbailey drewbailey force-pushed the f-agent-pprof-acl branch 3 times, most recently from 4d58bb3 to 5a2eec2 Compare December 13, 2019 15:50
@drewbailey drewbailey marked this pull request as ready for review December 13, 2019 20:15
@notnoop notnoop self-requested a review December 16, 2019 15:46
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks great! Mostly docs/wording/style comments that are not blockers.

api/agent.go Show resolved Hide resolved
client/agent_endpoint.go Outdated Show resolved Hide resolved
client/agent_endpoint_test.go Show resolved Hide resolved
command/agent/agent_endpoint.go Outdated Show resolved Hide resolved
command/agent/agent_endpoint_test.go Outdated Show resolved Hide resolved
website/source/api/agent.html.md Outdated Show resolved Hide resolved
website/source/api/agent.html.md Outdated Show resolved Hide resolved
website/source/api/agent.html.md Outdated Show resolved Hide resolved
website/source/api/agent.html.md Outdated Show resolved Hide resolved
website/source/api/agent.html.md Show resolved Hide resolved
Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite meaty - great thinking through so many cases and conditions. I have many stylistic nitpicks though.

I'd be curious if we have considered using a streaming RPC approach with command/agent/profile effectively invoking pprof.handles - we can have a wrapper RequestHandler that stream results to httpserver directly. The logging endpoints might be a pattern to follow here? Doing so would allow us to keep parity with pprof endpoints handling and avoid loading entire profile in memory. I don't have a sense of how big the profiles would be (i guess memory related once can very large in a busy cluster).

api/agent.go Outdated Show resolved Hide resolved
client/agent_endpoint.go Outdated Show resolved Hide resolved
command/agent/agent_endpoint_test.go Outdated Show resolved Hide resolved
command/agent/agent_endpoint.go Outdated Show resolved Hide resolved
command/agent/profile/pprof.go Outdated Show resolved Hide resolved
command/agent/http.go Show resolved Hide resolved
api/agent.go Outdated Show resolved Hide resolved
command/agent/profile/pprof.go Outdated Show resolved Hide resolved
command/agent/agent_endpoint.go Show resolved Hide resolved
command/agent/profile/pprof.go Outdated Show resolved Hide resolved
@drewbailey drewbailey force-pushed the f-agent-pprof-acl branch 3 times, most recently from df82bd4 to 92b5140 Compare December 20, 2019 18:51
Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - thanks.

command/agent/agent_endpoint.go Show resolved Hide resolved
@drewbailey
Copy link
Contributor Author

@notnoop I've been doing some testing on how large the profiles can get on a busy server.

on a t2.2xl server with 27/31 gb utilized (all pending nomad jobs) I've gotten the following results. Trace is by far the largest and grows with duration of the request. I'm wondering if that's small enough to ease your concerns around streaming or if its something we should still consider doing in the near term.

cc @schmichael

trace of 55 seconds -> 72 Mb profile

→ curl -v -o out.profile $NOMAD_ADDR/v1/agent/pprof/trace\?seconds=55
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 52.0.15.160:4646...
* TCP_NODELAY set
* Connected to nomad-server-lb-232516302.us-east-1.elb.amazonaws.com (52.0.15.160) port 4646 (#0)
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0> GET /v1/agent/pprof/trace?seconds=55 HTTP/1.1
> Host: nomad-server-lb-232516302.us-east-1.elb.amazonaws.com:4646
> User-Agent: curl/7.65.3
> Accept: */*
>
  0     0    0     0    0     0      0      0 --:--:--  0:00:55 --:--:--     0* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Disposition: attachment; filename="trace"
< Content-Type: application/octet-stream
< Date: Thu, 09 Jan 2020 19:36:52 GMT
< Vary: Accept-Encoding
< X-Content-Type-Options: nosniff
< transfer-encoding: chunked
< Connection: keep-alive
<
{ [14225 bytes data]
100 71.7M    0 71.7M    0     0  1263k      0 --:--:--  0:00:58 --:--:-- 18.6M
* Connection #0 to host nomad-server-lb-232516302.us-east-1.elb.amazonaws.com left intact

goroutines -> 8.8k

→ curl -v -o out.profile $NOMAD_ADDR/v1/agent/pprof/goroutine\?seconds=40
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 52.201.140.15:4646...
* TCP_NODELAY set
* Connected to nomad-server-lb-232516302.us-east-1.elb.amazonaws.com (52.201.140.15) port 4646 (#0)
> GET /v1/agent/pprof/goroutine?seconds=40 HTTP/1.1
> Host: nomad-server-lb-232516302.us-east-1.elb.amazonaws.com:4646
> User-Agent: curl/7.65.3
> Accept: */*
>
  0     0    0     0    0     0      0      0 --:--:--  0:00:18 --:--:--     0* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Disposition: attachment; filename="goroutine"
< Content-Type: application/octet-stream
< Date: Thu, 09 Jan 2020 19:54:16 GMT
< Vary: Accept-Encoding
< X-Content-Type-Options: nosniff
< Content-Length: 8406
< Connection: keep-alive
<
{ [6987 bytes data]
100  8406  100  8406    0     0    437      0  0:00:19  0:00:19 --:--:--  2114
* Connection #0 to host nomad-server-lb-232516302.us-east-1.elb.amazonaws.com left intact

Heap 97k

→ curl -v -o out.profile $NOMAD_ADDR/v1/agent/pprof/heap\?seconds=40
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 52.0.15.160:4646...
* TCP_NODELAY set
* Connected to nomad-server-lb-232516302.us-east-1.elb.amazonaws.com (52.0.15.160) port 4646 (#0)
> GET /v1/agent/pprof/heap?seconds=40 HTTP/1.1
> Host: nomad-server-lb-232516302.us-east-1.elb.amazonaws.com:4646
> User-Agent: curl/7.65.3
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Disposition: attachment; filename="heap"
< Content-Type: application/octet-stream
< Date: Thu, 09 Jan 2020 19:55:02 GMT
< Vary: Accept-Encoding
< X-Content-Type-Options: nosniff
< transfer-encoding: chunked
< Connection: keep-alive
<
{ [2642 bytes data]
100 98361    0 98361    0     0   533k      0 --:--:-- --:--:-- --:--:--  533k
* Connection #0 to host nomad-server-lb-232516302.us-east-1.elb.amazonaws.com left intact

profile -> 9k

→ curl -v -o out.profile $NOMAD_ADDR/v1/agent/pprof/profile\?seconds=40
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 52.0.15.160:4646...
* TCP_NODELAY set
* Connected to nomad-server-lb-232516302.us-east-1.elb.amazonaws.com (52.0.15.160) port 4646 (#0)
> GET /v1/agent/pprof/profile?seconds=40 HTTP/1.1
> Host: nomad-server-lb-232516302.us-east-1.elb.amazonaws.com:4646
> User-Agent: curl/7.65.3
> Accept: */*
>
  0     0    0     0    0     0      0      0 --:--:--  0:00:39 --:--:--     0* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Disposition: attachment; filename="profile"
< Content-Type: application/octet-stream
< Date: Thu, 09 Jan 2020 19:56:44 GMT
< Vary: Accept-Encoding
< X-Content-Type-Options: nosniff
< Content-Length: 9028
< Connection: keep-alive
<
{ [9028 bytes data]
100  9028  100  9028    0     0    225      0  0:00:40  0:00:40 --:--:--  1874
* Connection #0 to host nomad-server-lb-232516302.us-east-1.elb.amazonaws.com left intact

wip, agent endpoint and client endpoint for pprof profiles

agent endpoint test
Return rpc errors for profile requests, set up remote forwarding to
target leader or server id for profile requests.

server forwarding, endpoint tests
rename implementation method
m -> a receiver name

return codederrors, fix query
tidy up, add comments

clean up seconds param assignment
helper func to return serverPart based off of serverID
prevent region forwarding loop, backfill tests

fix failing test
Passes in agent enable_debug config to nomad server and client configs.
This allows for rpc endpoints to have more granular control if they
should be enabled or not in combination with ACLs.

enable debug on client test
fix test expectation

test wrapNonJSON
Address pr feedback, rename profile package to pprof to more accurately
describe its purpose. Adds gc param for heap lookup profiles.
comment why we ignore errors parsing params
@drewbailey drewbailey merged commit ac0fef1 into master Jan 10, 2020
@drewbailey drewbailey deleted the f-agent-pprof-acl branch January 10, 2020 19:52
@schmichael
Copy link
Member

schmichael commented Jan 10, 2020

@drewbailey Can you create an issue for streaming traces (and make sure it has a link to the discussion here)? As discussed it's not something I think we need to prioritize, but it might make a good starter issue for someone wanting to learn the RPC internals. Or in the future if there's a tool that expects tracing to be a stream, it'd be important to update our implementation.

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants