Skip to content

Commit

Permalink
Do not restart a crashed VM within the handleModify function.
Browse files Browse the repository at this point in the history
The `handleModify` function is called when `domainmanager` gets a notification
from another agent that DomainConfig for a VM has been modified (see the
`handleDomainModify` handler).

Among other things, the function handles the situation when the VM was already
activated before but is inactive at the moment. The code for this is executed
regardless of the content of the real difference seen by the other agent (for
example, it could be a disconnected VIF). So, any other agent may trigger this
function. In this case, the function reboots the VM.

In fact, the VM should not be started again from this function in several
cases: if there is an internal error in EVE OS; and if the VM was crashed from
inside of the VM itself. For the first case, there is already a condition in
the handling code (`status.State != types.BROKEN`). And for the second case,
there is the handling code in the other function: `maybeRetryBoot`.

The `maybeRetryBoot` function is called with 10-30 seconds intervals for all
the VMs. It checks the state of once-started VMs and restarts the crashed ones.
But in contrast to the code in the `handleModify` function, it takes into
account the value in the `config.timer.reboot` property. This behavior is
considered the right one, as a user may want to wait for some time before a
crashed VM restarts.

So, it is necessary to eliminate the restart of the crashed VM from the
`handleModify` function, which does not take the timer into account.

Inside the function, we can detect that the VM has been crashed by checking the
values in `status.State` and `status.Error`.

It's known that a crashed VM has `status.State` set to `HALTED`. The `Info`
implementation for containerd sets this state (see the "stopped" -> `HALTED`
mapping there). And it's guaranteed that the container has the state set to
"stopped" because of the handler for the `SHUTDOWN` event in
`qmp.go:qmpEventHandler()` that terminates the container.

And it's also known that a crashed VM has `status.Error` set. The
`verifyStatus` function sets it when it schedules a crashed VM for a reboot.

Worth noting that `status.Error` can be set only in the case of 2 values in
`status.State`: `type.BROKEN` and `type.HALTED`.

It is enough, bearing in mind all the facts described above, to add a check for
`status.HasError()` in the condition in the `handleModify` function to
eliminate the restart of crashed VM earlier than in `config.timer.reboot`
seconds.

Signed-off-by: Nikolay Martyanov <ohmspectator@gmail.com>
  • Loading branch information
OhmSpectator authored and eriknordmark committed May 13, 2022
1 parent c42176f commit 32fa17f
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion pkg/pillar/cmd/domainmgr/domainmgr.go
Original file line number Diff line number Diff line change
Expand Up @@ -1874,7 +1874,8 @@ func handleModify(ctx *domainContext, key string,
publishDomainStatus(ctx, status)

changed := false
if config.Activate && !status.Activated && status.State != types.BROKEN {
// if a VM has an error status, it should be restarted in the maybeRetryBoot function, not here
if config.Activate && !status.Activated && status.State != types.BROKEN && !status.HasError() {
log.Functionf("handleModify(%v) activating for %s",
config.UUIDandVersion, config.DisplayName)

Expand Down

0 comments on commit 32fa17f

Please sign in to comment.