Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agentlog: add pprof debug http handler #3237

Merged
merged 4 commits into from
Jun 14, 2023

Conversation

christoph-zededa
Copy link
Contributor

pprof show various insights into the process and can be accessed via http://localhost:6543/

to start it, send USR2 to zedbox or run eve http-debug; to stop it, do curl localhost:6543/stop -X POST
or eve http-debug stop

@christoph-zededa
Copy link
Contributor Author

christoph-zededa commented May 23, 2023

TODO @christoph-zededa : yetus warnings

done ...

@christoph-zededa
Copy link
Contributor Author

Previous comment from @milan-zededa : Maybe we should take this opportunity to document all the features of agentlog that one may use for debugging purposes? (e.g. under this folder) - #3229 (comment)

If it is not too much, I can add your suggestions. Also I added https://github.com/lf-edge/eve/pull/3237/files#diff-e76641ce9bfa02962b0687e68a963a44dfa9d1a4f427cfafd9fbfa36d55986f4R225

@christoph-zededa christoph-zededa force-pushed the enable_pprof branch 2 times, most recently from 4d7812c to 4a89c72 Compare May 24, 2023 12:20
conf/grub.cfg Outdated
@@ -3,3 +3,4 @@
# set_global eve_flavor kvm
# to force booting in Xen mode, uncomment:
# set_global eve_flavor xen
set_getty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added by mistake. And yes, we need to change how we prepare the debug grub config. Otherwise it is really annoying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it
(I let this comment open in case I accidentally re-add it ...)

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One heads up. Some stack trace code is invoke from the logrus log.Panic callback, and that can in turn not call back into logrus. Should be comments around that code, but I don't recall how it relates to dumping all of the goroutine stacks. Can you double-check that that code path is not affected by this restructuring?

pkill -USR2 /opt/zededa/bin/zedbox
fi

wget -O - "$URL" --post-data ''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think we have wget included in EVE-OS.
We have curl e.g., in the debug container, but not in pillar etc.Will this be invoked from inside the debug container?
Note that a console (keyboard and screen) lands in a different place than a ssh to EVE-OS.,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to wget as we don't have curl in the root container (terminal / serial console).

Now I also checked if we have wget in the debug container and when using ssh - we do.
So it should all work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think we had curl outside of the debug container either. I wonder where that and wget gets pulled in (we want to minimize the set of packages we use to minimize the CVEs we need to look for).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I changed it to nc: 83f3bec

@christoph-zededa
Copy link
Contributor Author

One heads up. Some stack trace code is invoke from the logrus log.Panic callback, and that can in turn not call back into logrus. Should be comments around that code, but I don't recall how it relates to dumping all of the goroutine stacks. Can you double-check that that code path is not affected by this restructuring?

so, initImpl registers an "exit handler" on logrus which calls https://github.com/lf-edge/eve/blob/master/pkg/pillar/agentlog/agentlog.go#L242 which uses getStacks (https://github.com/lf-edge/eve/blob/master/pkg/pillar/agentlog/agentlog.go#L467) to retrieve the stacktrace which is the same method used when getting the stacktrace via USR1 (this one does not log).

So, as far as I understand, the stacktrace is only printed via log as an ExitHandler which is called by logrus.Exit or when a "Fatal log entry is made" (https://github.com/sirupsen/logrus/blob/master/alt_exit.go#L56) and log.Fatal is only used before registering the ExitHandler (in spoofStdFDs); so there is no re-entrancy.
Our FatalHandler is calling RebootReason which has the comment "NOTE: can not use log here since we are called from a log hook!" I guess you meant. Nothing has been changed here.

@christoph-zededa christoph-zededa marked this pull request as ready for review May 31, 2023 22:01
@christoph-zededa christoph-zededa requested a review from rvs as a code owner May 31, 2023 22:01
pprof show various insights into the process and can be
accessed via http://localhost:6543/

to start it, send USR2 to zedbox or run `eve http-debug`;
to stop it, do curl localhost:6543/stop -X POST
or `eve http-debug stop`

Signed-off-by: Christoph Ostarek <christoph@zededa.com>
this first starts the pprof interface on zedbox and then forwards
the port to the edgeview client
once the user is done the pprof interface is closed again

Signed-off-by: Christoph Ostarek <christoph@zededa.com>
@christoph-zededa christoph-zededa force-pushed the enable_pprof branch 2 times, most recently from 7dee404 to 8c67fea Compare June 7, 2023 16:02
in order to keep the image small and reduce the attack surface
wget may be removed in the future, so instead use nc

Signed-off-by: Christoph Ostarek <christoph@zededa.com>
Copy link
Contributor

@naiming-zededa naiming-zededa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Makefile Outdated
@@ -46,6 +46,8 @@ INSTALLER_IMG_FORMAT=raw
VERIFICATION_IMG_FORMAT=raw
# SSH port to use for running images live
SSH_PORT=2222
# Debug port to access debugging information (e.g. pprof)
DEBUG_PORT=6543
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the normal pprof port '6060' is used? https://pkg.go.dev/net/http/pprof

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code does not exist anymore - is outdated.

iptables -t mangle -F
echo "Listening on :6543 -- use 'eve http-debug stop' to stop"
elif [ "$2" = "stop" ]; then
wget -O - 127.1:6543/stop --post-data '' > /dev/null 2>&1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to remove the iptable setup after stop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code does not exist anymore - is outdated.

pkg/pillar/agentlog/agentlog.go Outdated Show resolved Hide resolved
(both scripts can be found <a href="https://github.com/brendangregg/FlameGraph">here</a>)
`

mux.Handle("/", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zedcloud side has plan to move away from mux router due to: https://github.com/gorilla#gorilla-toolkit, not sure about the EVE status on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the status in EVE; I guess we need to discuss this soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@naiming-zededa @christoph-zededa this is not gorillamux; just the simple http.NewServeMux() which we also use in zedrouter/server.go

@@ -982,6 +983,22 @@ func dispAFile(f os.FileInfo) {
fmt.Printf("%s, %v, %d, %s\n", f.Mode().String(), f.ModTime(), f.Size(), f.Name())
}

func runPprof() {
cmd := exec.Command("/usr/bin/pkill", "-USR2", "/opt/zededa/bin/zedbox")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you expect the client side to query the pprof stats quite soon? otherwise the tcp will timeout if no activity. Or in rare case the cmd run has error, we need to exit here. Would be more flexible to offer more options. For example, with 'pprof/on' to trigger it to start and send back status of success or not and print out the port being used, 'pprof/off' to shut it down. either use general tcp/localhost:port, or do a 'pprof/run' to do all 3.
also, add the 'help' line in the basics.go for the new command.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"do a 'pprof/run' to do all 3" <- that was my initial plan, we can just drop
5ea17eb
and it works that way.

@@ -1215,7 +1233,7 @@ func runTechSupport(cmds cmdOpt, isLocal bool) {
closePipe(true)

printTitle("\n - system info -\n\n", colorRED, false)
runSystem(cmds, "hw,model,pci,usb,lastreboot,newlog,volume,app,datastore,cipher,dmesg,configitem")
runSystem(cmds, "hw,model,pci,usb,lastreboot,newlog,volume,app,datastore,cipher,dmesg,configitem,pprof")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be in techsupport, since without option input it will just return, and it will not generate any output into the techsupport file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it

@naiming-zededa
Copy link
Contributor

also, if you can update this document: https://wiki.lfedge.org/display/EVE/EdgeView+Commands to add the new command there. thanks.

let the user of edgeview do it manually

Signed-off-by: Christoph Ostarek <christoph@zededa.com>
"syscall"
"time"

"github.com/lf-edge/eve/pkg/pillar/base"
"github.com/lf-edge/eve/pkg/pillar/types"
fileutils "github.com/lf-edge/eve/pkg/pillar/utils/file"
"github.com/satori/go.uuid"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the need for this change; the package is referenced by uuid in any case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just done by gopls/goimports.
I've read a bit on golang/go#28428 and it seems it speeds up checking for imports.

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
@naiming-zededa have your issues been resolved?

@eriknordmark eriknordmark merged commit 2f6675b into lf-edge:master Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants