Fix Pkill confusion with proper pkill usage and zboot retry #4005

christoph-zededa · 2024-06-26T08:49:25Z

Under certain cases (#4005 (comment) ) busybox pkill/pgrep reports a newly forked subprocess (before it had the change to exec) of zedbox. This leads that the subprocess gets killed with USR2 which leads to termination of the process if there is no USR2 handler - and there is none for some of the subprocesses of zedbox.
In order to fix this:

only pkill zedbox by using the pidfile of zedbox
for zboot: restart the process if it got killed with USR{1,2}

christoph-zededa · 2024-06-26T08:50:18Z

@OhmSpectator can you have a look if it is okay the way I send USR2 to the memory handler?

eriknordmark · 2024-06-26T08:55:25Z

pkg/dom0-ztools/rootfs/bin/eve

@@ -100,7 +100,7 @@ http_debug_request() {
    fi

    if [ "$running" = "0" ]; then
-        pkill -USR2 zedbox
+        pkill -o -USR2 zedbox


Why would there be more than one zedbox process?

When zedbox tries to execute f.e. qemu-img, then zedbox first does a fork() and then and exec*. In between there are two, aren't there?

I don't know how the golang does the Exec - it has some constraints thus it might (need to) block signals across some of this in the golang runtime. FWIW I know have three observed crashes which indicates that something changed in master over the last two days causing the crashes.

package main import ( "fmt" "os/exec" ) func main() { cmd := exec.Cmd{ Path: "/bin/ls", } out, err := cmd.Output() if err != nil { panic(err) } fmt.Printf("%s\n", out) cmd.Run() cmd.Wait() }

exec.strace.txt

child process is 216343

in line 344 I see:

216337 clone(child_stack=NULL, flags=CLONE_VM|CLONE_VFORK|SIGCHLD <unfinished ...>

and soon after (line 370):

216343 rt_sigaction(SIGUSR2, {sa_handler=SIG_DFL, sa_mask=~[], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x469ce0}, NULL, 8) = 0

the signal handler for USR2 gets removed.

in line 472 the exec syscall is finally invoked.

OhmSpectator · 2024-06-26T10:06:16Z

@OhmSpectator can you have a look if it is okay the way I send USR2 to the memory handler?

Yeah, I think it should be fine. I'll test in now as well.

OhmSpectator · 2024-06-26T10:32:57Z

Unfortunately, the busybox version of pkill does not support the -o option, so this will not work until we add the propcps package to the base image.

rouming · 2024-06-26T10:43:18Z

Can this be solved by the zedbox pidfile? Pkill supports this by -F option

christoph-zededa · 2024-06-26T10:52:27Z

Can this be solved by the zedbox pidfile? Pkill supports this by -F option

unfortunately not all of the pkill implementations on the system support it:

linuxkit-525400123456:~# pkill -h
pkill: unrecognized option: h
BusyBox v1.35.0 (2022-08-01 15:14:44 UTC) multi-call binary.

Usage: pkill [-l|-SIGNAL] [-xfvno] [-s SID|-P PPID|PATTERN]

Send signal to processes selected by regex PATTERN

	-l	List all signals
	-x	Match whole name (not substring)
	-f	Match against entire command line
	-s SID	Match session ID (0 for current)
	-P PPID	Match parent process ID
	-v	Negate the match
	-n	Signal the newest process only
	-o	Signal the oldest process only
	-u EUID Match against effective UID
	-U UID  Match against UID

christoph-zededa · 2024-06-26T10:53:27Z

Unfortunately, the busybox version of pkill does not support the -o option, so this will not work until we add the propcps package to the base image.

should be fixed now - it was a confusion as the busybox pkill does not support this order of parameters ...

@OhmSpectator it should work now

eriknordmark · 2024-06-26T11:08:41Z

Would it be more robust to do the golang equivalent of kill -SIGX $(cat /run/zedbox.pid) as Roman was suggesting?

OhmSpectator · 2024-06-26T11:37:11Z

Ok, at least the latest version of the code works fine.
Could we check this PR with the ztest run that triggered #4002 ?

christoph-zededa · 2024-06-26T11:59:34Z

Would it be more robust to do the golang equivalent of kill -SIGX $(cat /run/zedbox.pid) as Roman was suggesting?

I don't think we would need the golang equivalent, bash would be fine, wouldn't it?

/run/zedbox.pid is everywhere available where we want to run eve http-debug?
I checked inital, debug and memory-monitor containers and it is available there.

christoph-zededa · 2024-06-26T12:58:10Z

Would it be more robust to do the golang equivalent of kill -SIGX $(cat /run/zedbox.pid) as Roman was suggesting?

I don't think we would need the golang equivalent, bash would be fine, wouldn't it?

/run/zedbox.pid is everywhere available where we want to run eve http-debug? I checked inital, debug and memory-monitor containers and it is available there.

I changed it ( memory-handler does not have a PID file).

christoph-zededa · 2024-06-27T09:33:45Z

Build failed because of:

2024-06-26T16:05:22.4560789Z docker: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit.

milan-zededa · 2024-06-27T09:36:26Z

Build failed because of:

2024-06-26T16:05:22.4560789Z docker: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit.

We have new pulls available today. Take them before they run out :)

codecov · 2024-06-27T09:56:07Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 17.51%. Comparing base (3a02e3f) to head (d13f7fa).
Report is 13 commits behind head on master.

❗ Current head d13f7fa differs from pull request most recent head 8c30bf9

Please upload reports for the commit 8c30bf9 to get more accurate results.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #4005   +/-   ##
=======================================
  Coverage   17.51%   17.51%           
=======================================
  Files           3        3           
  Lines         805      805           
=======================================
  Hits          141      141           
  Misses        629      629           
  Partials       35       35

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

we saw that a child process of zedbox got killed with USR2 (no signal handler defaults to SIGTERM). My theory is that the following happened: 1. zedbox called fork() 2. now there are two zedbox processes 3. pkill finds two zedbox processes 4. child process of zedbox calls exec and remove USR2 handler 5. pkill sends USR2 to both PIDs 6. zedbox starts http debug as usual 7. child process (now supposed to do something different) gets killed perhaps fixes lf-edge#4002 Signed-off-by: Christoph Ostarek <christoph@zededa.com>

OhmSpectator

Looks good and is should be useful.
However, I have not tested the latest version.

eriknordmark · 2024-06-27T15:26:38Z

FWIW I wrote a golang program which runs exec.Run while the sigusr2 handler is being invoked, and I don't ever see a failure (it has gone through tens of thousands of iterations today). So I wonder if there is something unique with iinvoking the zboot shell script??

christoph-zededa · 2024-06-28T12:23:54Z

FWIW I wrote a golang program which runs exec.Run while the sigusr2 handler is being invoked, and I don't ever see a failure (it has gone through tens of thousands of iterations today). So I wonder if there is something unique with iinvoking the zboot shell script??

I think I can somehow reproduce it:

package main

import (
	"os/exec"
	"runtime"
)

func main() {
	for i := 0; i < runtime.NumCPU(); i++ {
		go useCPU()
	}

	runLs()
}

func useCPU() {
	var i int
	for {
		i++
	}
}

func runLs() {
	for {
		cmd := exec.Cmd{
			Path: "/bin/ls",
		}

		_, err := cmd.Output()
		if err != nil {
			panic(err)
		}

		//fmt.Printf("%s\n", out)
		cmd.Run()

		cmd.Wait()
	}

}

run this as exec-ls.
And in a different terminal run: while :; do busybox pgrep -l exec-tester; done | tee pids.

Then I get:

$ cat pids | sort | uniq
303320 ./exec-tester
316617 ./exec-tester
316626 ./exec-tester
316635 ./exec-tester
316638 ./exec-tester
316647 ./exec-tester
316824 ./exec-tester
316831 ./exec-tester
316852 ./exec-tester
316862 ./exec-tester
316894 ./exec-tester
316919 ./exec-tester
316928 ./exec-tester

It did not work with normal pgrep.

eriknordmark

DCO is missing on the ubuntu commit, and that needs to be fixed if we are to build and run eden tests etc on the PR

christoph-zededa · 2024-07-01T09:17:08Z

DCO is missing on the ubuntu commit, and that needs to be fixed if we are to build and run eden tests etc on the PR

Oh, I forgot to remove this one.
That's just my environment when I try to run the go tests locally.

Signed-off-by: Christoph Ostarek <christoph@zededa.com>

eriknordmark

LGTM but please update the description and title to mention the retry of zboot commands.

FWIW I've run this with the kernel setting to trigger on lower pressure without any issues since yesterday. I should check the logs if I got the retry messages from those runs.

christoph-zededa · 2024-07-01T11:00:48Z

LGTM but please update the description and title to mention the retry of zboot commands.

Done

FWIW I've run this with the kernel setting to trigger on lower pressure without any issues since yesterday. I should check the logs if I got the retry messages from those runs.

I think you will not see those as this PR also changes how we invoke pkill; unless this failure does not come from /bin/eve.

OhmSpectator · 2024-07-01T11:23:29Z

pkg/pillar/zboot/zboot.go

@@ -64,8 +64,24 @@ func Poweroff(log *base.LogObject) {

 // If log is nil there is no logging
 func execWithRetry(log *base.LogObject, command string, args ...string) ([]byte, error) {
+	retrySignals := map[syscall.Signal]struct{}{
+		syscall.SIGUSR1: {},
+		syscall.SIGUSR2: {},


How did we decide these two signals are enough?

By guessing ;-)

Do you have other signals in mind?

Afaik, these are the only signals that zedbox receives for some kind of IPC outside of wanting zedbox to terminate (and then child processes should be terminated as well).

Those are the only signals we generate AFAIK. I'm assuming SIGPROF, if used by the go runtime for profiling, is handled internally by the runtime without any errors returned by exec. Might be good to check that.

But in general, does it make sense to use the whitelisting approach here? Would handling all the signals that way be a disadvantage? In the future, we can add new signal handling to the system and forget about this place...

Would handling all the signals that way be a disadvantage?

F.e. I think we should not handle SIGSEGV this way, because that can just end up in an endless loop.

Maybe we can use blacklisting in this case?

Blindly restarting a syscall which was interrupted by a signal (aynch or sync) seems risky; makes sense to do it for the signals we know are used as part of normal operation and not others.

Okay, that makes sense. Could we add some kind of reminder to the place in the code where we register the handlers? That way, the next person adding a handler will see that there is one more extra place to be fixed.

@OhmSpectator currently they're in agentlog.go, but I doubt that f.e. a handler for SIGTERM would end up there, too. Do you mean something like christoph-zededa@26df498 ?

@eriknordmark

if used by the go runtime for profiling, is handled internally by the runtime without any errors returned by exec
I checked with strace; yes it is using SIGPROF and it also installs a signal handler:

rt_sigaction(SIGPROF, {sa_handler=0x996920, sa_mask=~[RTMIN RT_1 RT_2], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7fe613a3eaca}, NULL, 8)

(and clones are done with CLONE_SIGHAND)

but isn't it always sending the signal to it's own pid anyways (by design of the syscall)?

timer_create(CLOCK_THREAD_CPUTIME_ID, {sigev_signo=SIGPROF, sigev_notify=SIGEV_THREAD_ID, sigev_notify_thread_id=30794}, [0]) = 0

eriknordmark · 2024-07-01T11:52:18Z

I think you will not see those as this PR also changes how we invoke pkill; unless this failure does not come from /bin/eve.

Yes, if the issue was the window between fork+change signal hander in child process+exec zboot, then I shouldn't see them. FWIW I don't see any "because of signal" in kibana. Will look again in a week and see if any test triggered it.

eriknordmark

Run eden

OhmSpectator

In general, it looks good, and the issue should be fixed.

christoph-zededa requested a review from rouming as a code owner June 26, 2024 08:49

github-actions bot requested a review from eriknordmark June 26, 2024 08:49

eriknordmark reviewed Jun 26, 2024

View reviewed changes

OhmSpectator self-requested a review June 26, 2024 10:34

christoph-zededa force-pushed the pkill_confusion branch from 67408df to d2e0437 Compare June 26, 2024 10:50

github-actions bot requested a review from eriknordmark June 26, 2024 10:50

christoph-zededa force-pushed the pkill_confusion branch from d2e0437 to 6c7dd72 Compare June 26, 2024 12:55

christoph-zededa force-pushed the pkill_confusion branch from 6c7dd72 to 4f98cca Compare June 26, 2024 13:03

christoph-zededa force-pushed the pkill_confusion branch from 4f98cca to d13f7fa Compare June 27, 2024 09:39

github-actions bot requested a review from naiming-zededa June 27, 2024 09:40

christoph-zededa force-pushed the pkill_confusion branch from d13f7fa to 42c9056 Compare June 27, 2024 10:52

OhmSpectator approved these changes Jun 27, 2024

View reviewed changes

rouming approved these changes Jun 27, 2024

View reviewed changes

github-actions bot requested a review from rouming June 28, 2024 17:41

eriknordmark requested changes Jun 29, 2024

View reviewed changes

OhmSpectator mentioned this pull request Jul 1, 2024

bug: fatal: agent zedbox[1553]: zboot partstate IMGA: err signal: user defined signal 2 #4002

Closed

zboot/exec: retry on USR{1,2}

8c30bf9

Signed-off-by: Christoph Ostarek <christoph@zededa.com>

christoph-zededa force-pushed the pkill_confusion branch from 737f9a4 to 8c30bf9 Compare July 1, 2024 09:42

github-actions bot requested a review from eriknordmark July 1, 2024 09:42

eriknordmark approved these changes Jul 1, 2024

View reviewed changes

christoph-zededa changed the title ~~Pkill confusion~~ Fix Pkill confusion with proper pkill usage and zboot retry Jul 1, 2024

OhmSpectator reviewed Jul 1, 2024

View reviewed changes

eriknordmark approved these changes Jul 1, 2024

View reviewed changes

OhmSpectator approved these changes Jul 1, 2024

View reviewed changes

eriknordmark merged commit 0b0e4aa into lf-edge:master Jul 1, 2024
72 of 87 checks passed

Fix Pkill confusion with proper pkill usage and zboot retry #4005

Fix Pkill confusion with proper pkill usage and zboot retry #4005

Conversation

christoph-zededa commented Jun 26, 2024 • edited Loading

christoph-zededa commented Jun 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OhmSpectator commented Jun 26, 2024 • edited Loading

OhmSpectator commented Jun 26, 2024 • edited Loading

rouming commented Jun 26, 2024

christoph-zededa commented Jun 26, 2024

christoph-zededa commented Jun 26, 2024

eriknordmark commented Jun 26, 2024

OhmSpectator commented Jun 26, 2024

christoph-zededa commented Jun 26, 2024

christoph-zededa commented Jun 26, 2024

christoph-zededa commented Jun 27, 2024

milan-zededa commented Jun 27, 2024

codecov bot commented Jun 27, 2024 • edited Loading

Codecov Report

OhmSpectator left a comment

Choose a reason for hiding this comment

eriknordmark commented Jun 27, 2024

christoph-zededa commented Jun 28, 2024

eriknordmark left a comment

Choose a reason for hiding this comment

christoph-zededa commented Jul 1, 2024

eriknordmark left a comment

Choose a reason for hiding this comment

christoph-zededa commented Jul 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eriknordmark commented Jul 1, 2024

eriknordmark left a comment

Choose a reason for hiding this comment

OhmSpectator left a comment

Choose a reason for hiding this comment

christoph-zededa commented Jun 26, 2024 •

edited

Loading

OhmSpectator commented Jun 26, 2024 •

edited

Loading

OhmSpectator commented Jun 26, 2024 •

edited

Loading

codecov bot commented Jun 27, 2024 •

edited

Loading