-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Pkill confusion with proper pkill usage and zboot retry #4005
Conversation
@OhmSpectator can you have a look if it is okay the way I send USR2 to the memory handler? |
pkg/dom0-ztools/rootfs/bin/eve
Outdated
@@ -100,7 +100,7 @@ http_debug_request() { | |||
fi | |||
|
|||
if [ "$running" = "0" ]; then | |||
pkill -USR2 zedbox | |||
pkill -o -USR2 zedbox |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would there be more than one zedbox process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When zedbox tries to execute f.e. qemu-img
, then zedbox first does a fork()
and then and exec*
. In between there are two, aren't there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how the golang does the Exec - it has some constraints thus it might (need to) block signals across some of this in the golang runtime. FWIW I know have three observed crashes which indicates that something changed in master over the last two days causing the crashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
package main
import (
"fmt"
"os/exec"
)
func main() {
cmd := exec.Cmd{
Path: "/bin/ls",
}
out, err := cmd.Output()
if err != nil {
panic(err)
}
fmt.Printf("%s\n", out)
cmd.Run()
cmd.Wait()
}
child process is 216343
in line 344 I see:
216337 clone(child_stack=NULL, flags=CLONE_VM|CLONE_VFORK|SIGCHLD <unfinished ...>
and soon after (line 370):
216343 rt_sigaction(SIGUSR2, {sa_handler=SIG_DFL, sa_mask=~[], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x469ce0}, NULL, 8) = 0
the signal handler for USR2 gets removed.
in line 472 the exec syscall is finally invoked.
Yeah, I think it should be fine. I'll test in now as well. |
Unfortunately, the busybox version of |
Can this be solved by the zedbox pidfile? Pkill supports this by -F option |
67408df
to
d2e0437
Compare
unfortunately not all of the pkill implementations on the system support it:
|
should be fixed now - it was a confusion as the busybox @OhmSpectator it should work now |
Would it be more robust to do the golang equivalent of kill -SIGX $(cat /run/zedbox.pid) as Roman was suggesting? |
Ok, at least the latest version of the code works fine. |
I don't think we would need the golang equivalent, bash would be fine, wouldn't it?
|
d2e0437
to
6c7dd72
Compare
I changed it ( memory-handler does not have a PID file). |
6c7dd72
to
4f98cca
Compare
Build failed because of:
|
We have new pulls available today. Take them before they run out :) |
4f98cca
to
d13f7fa
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #4005 +/- ##
=======================================
Coverage 17.51% 17.51%
=======================================
Files 3 3
Lines 805 805
=======================================
Hits 141 141
Misses 629 629
Partials 35 35 ☔ View full report in Codecov by Sentry. |
we saw that a child process of zedbox got killed with USR2 (no signal handler defaults to SIGTERM). My theory is that the following happened: 1. zedbox called fork() 2. now there are two zedbox processes 3. pkill finds two zedbox processes 4. child process of zedbox calls exec and remove USR2 handler 5. pkill sends USR2 to both PIDs 6. zedbox starts http debug as usual 7. child process (now supposed to do something different) gets killed perhaps fixes lf-edge#4002 Signed-off-by: Christoph Ostarek <christoph@zededa.com>
d13f7fa
to
42c9056
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and is should be useful.
However, I have not tested the latest version.
FWIW I wrote a golang program which runs exec.Run while the sigusr2 handler is being invoked, and I don't ever see a failure (it has gone through tens of thousands of iterations today). So I wonder if there is something unique with iinvoking the zboot shell script?? |
I think I can somehow reproduce it:
run this as Then I get:
It did not work with normal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DCO is missing on the ubuntu commit, and that needs to be fixed if we are to build and run eden tests etc on the PR
Oh, I forgot to remove this one. |
Signed-off-by: Christoph Ostarek <christoph@zededa.com>
737f9a4
to
8c30bf9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but please update the description and title to mention the retry of zboot commands.
FWIW I've run this with the kernel setting to trigger on lower pressure without any issues since yesterday. I should check the logs if I got the retry messages from those runs.
Done
I think you will not see those as this PR also changes how we invoke |
@@ -64,8 +64,24 @@ func Poweroff(log *base.LogObject) { | |||
|
|||
// If log is nil there is no logging | |||
func execWithRetry(log *base.LogObject, command string, args ...string) ([]byte, error) { | |||
retrySignals := map[syscall.Signal]struct{}{ | |||
syscall.SIGUSR1: {}, | |||
syscall.SIGUSR2: {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did we decide these two signals are enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By guessing ;-)
Do you have other signals in mind?
Afaik, these are the only signals that zedbox receives for some kind of IPC outside of wanting zedbox to terminate (and then child processes should be terminated as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are the only signals we generate AFAIK. I'm assuming SIGPROF, if used by the go runtime for profiling, is handled internally by the runtime without any errors returned by exec. Might be good to check that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in general, does it make sense to use the whitelisting approach here? Would handling all the signals that way be a disadvantage? In the future, we can add new signal handling to the system and forget about this place...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would handling all the signals that way be a disadvantage?
F.e. I think we should not handle SIGSEGV
this way, because that can just end up in an endless loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use blacklisting in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blindly restarting a syscall which was interrupted by a signal (aynch or sync) seems risky; makes sense to do it for the signals we know are used as part of normal operation and not others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that makes sense. Could we add some kind of reminder to the place in the code where we register the handlers? That way, the next person adding a handler will see that there is one more extra place to be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OhmSpectator currently they're in agentlog.go
, but I doubt that f.e. a handler for SIGTERM
would end up there, too. Do you mean something like christoph-zededa@26df498 ?
if used by the go runtime for profiling, is handled internally by the runtime without any errors returned by exec
I checked withstrace
; yes it is usingSIGPROF
and it also installs a signal handler:
rt_sigaction(SIGPROF, {sa_handler=0x996920, sa_mask=~[RTMIN RT_1 RT_2], sa_flags=SA_RESTORER|SA_ONSTACK|SA_RESTART|SA_SIGINFO, sa_restorer=0x7fe613a3eaca}, NULL, 8)
(and clone
s are done with CLONE_SIGHAND
)
but isn't it always sending the signal to it's own pid anyways (by design of the syscall)?
timer_create(CLOCK_THREAD_CPUTIME_ID, {sigev_signo=SIGPROF, sigev_notify=SIGEV_THREAD_ID, sigev_notify_thread_id=30794}, [0]) = 0
Yes, if the issue was the window between fork+change signal hander in child process+exec zboot, then I shouldn't see them. FWIW I don't see any "because of signal" in kibana. Will look again in a week and see if any test triggered it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run eden
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, it looks good, and the issue should be fixed.
Under certain cases (#4005 (comment) )
busybox pkill/pgrep
reports a newly forked subprocess (before it had the change to exec) of zedbox. This leads that the subprocess gets killed with USR2 which leads to termination of the process if there is no USR2 handler - and there is none for some of the subprocesses of zedbox.In order to fix this: