Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task is started despite skipped pre-start hooks #18659

Closed
xkmwangi opened this issue Oct 4, 2023 · 2 comments · Fixed by #18662
Closed

Task is started despite skipped pre-start hooks #18659

xkmwangi opened this issue Oct 4, 2023 · 2 comments · Fixed by #18662
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client type/bug

Comments

@xkmwangi
Copy link

xkmwangi commented Oct 4, 2023

Nomad version

Tip

Operating system and Environment details

Linux

Issue

There is a race between allocation termination and task startup that can cause the task to start without the pre-start hooks having been executed. This may cause the driver to fail during startup because the task directory is not initialized. We suspect other failure modes are possible here (eg. the task directory may exist while re-discovering allocations during agent restart but other hooks don’t run), but the driver failure has arisen several times.

Details

The issue arose with the driver failing with the following sequence of events:

  1. Client has an allocation scheduled.
  2. Leader task starts successfully (note the Building Task Directory line).
  3. Client soon after receives a kill for the leader.
  4. Task runner skips pre-start hooks for sidecar because the allocation is terminal.
  5. However, the task runner still starts the sidecar.
  6. The sidecar because the task directory has not been created.

While this involved a leader and sidecar task, the error can occur with a single task.

Verify that skipping the pre-start hooks causes the observed driver failure

Patch nomad to unconditionally skip pre-start hooks.

diff --git a/client/allocrunner/taskrunner/task_runner_hooks.go b/client/allocrunner/taskrunner/task_runner_hooks.go
index 6d71ba7..890134c 100644
--- a/client/allocrunner/taskrunner/task_runner_hooks.go
+++ b/client/allocrunner/taskrunner/task_runner_hooks.go
@@ -190,7 +190,7 @@ func (tr *TaskRunner) emitHookError(err error, hookName string) {
 func (tr *TaskRunner) prestart() error {
        // Determine if the allocation is terminal and we should avoid running
        // prestart hooks.
-       if tr.shouldShutdown() {
+       if true {
                tr.logger.Trace("skipping prestart hooks since allocation is terminal")
                return nil
        }

When attempting to start a simple task,job the allocation fails with:

rpc error: code = Unknown desc = failed to create executor: error creating rpc client for executor plugin: Unrecognized remote plugin message: This usually means the plugin was not compiled for this architecture, the plugin is missing dynamic-link libraries necessary to run, the plugin is not executable by this process due to file permissions, or the plugin failed to negotiate the initial go-plugin protocol handshake
Finding the root cause of the raciness

The prestart hooks are only skipped when tr.sholdShutdown() is true.

// Determine if the allocation is terminal and we should avoid running
// prestart hooks.
if tr.shouldShutdown() {
tr.logger.Trace("skipping prestart hooks since allocation is terminal")
return nil
}

Hence, for the prestart hooks to be skipped and yet the task still run, it must be the case that:

MAIN:
for !tr.shouldShutdown() {
if dead {
break
}
select {
case <-tr.killCtx.Done():
break MAIN
case <-tr.shutdownCtx.Done():
// TaskRunner was told to exit immediately
return
case <-tr.startConditionMetCh:
tr.logger.Debug("lifecycle start condition has been met, proceeding")
// yay proceed
}
// Run the prestart hooks
if err := tr.prestart(); err != nil {
tr.logger.Error("prestart failed", "error", err)
tr.restartTracker.SetStartError(err)
goto RESTART
}
select {
case <-tr.killCtx.Done():
break MAIN
case <-tr.shutdownCtx.Done():
// TaskRunner was told to exit immediately
return
default:
}
// Run the task
if err := tr.runDriver(); err != nil {
tr.logger.Error("running driver failed", "error", err)
tr.restartTracker.SetStartError(err)
goto RESTART
}
// Run the poststart hooks
if err := tr.poststart(); err != nil {
tr.logger.Error("poststart failed", "error", err)
}

  • tr.shouldShutdown() is false as the start in order for the loop to begin, but then becomes true by the point tr.prestart() is reached.
  • We fall through the select, meaning killCtx and shutdownCtx must both not be Done.

For tr.shouldShutdown() to become true, tr.Alloc must have transitioned from non-terminal status to terminal status before

func (tr *TaskRunner) shouldShutdown() bool {
alloc := tr.Alloc()
if alloc.ClientTerminalStatus() {
return true
}
if !tr.IsPoststopTask() && alloc.ServerTerminalStatus() {
return true
}
return false
}

The state of tr.Alloc is updated in

// Update tr.alloc
tr.setAlloc(update, task)

called from
func (ar *allocRunner) handleAllocUpdates() {
for {
select {
case update := <-ar.allocUpdatedCh:
ar.handleAllocUpdate(update)
case <-ar.waitCh:
return
}
}
}
// This method sends the updated alloc to Run for serially processing updates.
// If there is already a pending update it will be discarded and replaced by
// the latest update.
func (ar *allocRunner) handleAllocUpdate(update *structs.Allocation) {
// Detect Stop updates
stopping := !ar.Alloc().TerminalStatus() && update.TerminalStatus()
// Update ar.alloc
ar.setAlloc(update)
// Run update hooks if not stopping or dead
if !update.TerminalStatus() {
if err := ar.update(update); err != nil {
ar.logger.Error("error running update hooks", "error", err)
}
}
// Update task runners
for _, tr := range ar.tasks {
tr.Update(update)
}
// If alloc is being terminated, kill all tasks, leader first
if stopping {
ar.killTasks()
}
}

which is called as a goroutine
func (ar *allocRunner) Run() {
// Close the wait channel on return
defer close(ar.waitCh)
// Start the task state update handler
go ar.handleTaskStateUpdates()
// Start the alloc update handler
go ar.handleAllocUpdates()

In handleAllocUpdate, the state of the task runner is first updated and then tasks are killed (closing KillCtx). However, because handleAllocUpdate runs on a different goroutine as TaskRunner::Run, the following ordering of events is possible and would result in the observed behavior:

  1. The allocation is still non-terminal. A task runner is started and enters into the MAIN loop since tr.shouldShutdown returns false.
  2. An update is received from the server to terminate the allocation. In a separate goroutine, handleAllocUpdate handles this update by calling tr.Update.
  3. In the task runner, we skip the prestart hooks because the allocation is now terminal and tr.shouldShutdown returns true.
  4. tr.Run continues to the select and, since both killCtx and shutdownCtx, are not yet done, we fall through and proceed to run the driver.
  5. ar.killTasks is called

Reproduction steps

Without any other patches applied, add sleep statements to widen the window for the race to trigger.

diff --git a/client/allocrunner/alloc_runner.go b/client/allocrunner/alloc_runner.go
index 80224b0..dd9cf8b 100644
--- a/client/allocrunner/alloc_runner.go
+++ b/client/allocrunner/alloc_runner.go
@@ -972,6 +972,8 @@ func (ar *allocRunner) handleAllocUpdate(update *structs.Allocation) {
                tr.Update(update)
        }
  
+       time.Sleep(1 * time.Minute)
+
        // If alloc is being terminated, kill all tasks, leader first
        if stopping {
                ar.killTasks()
diff --git a/client/allocrunner/taskrunner/task_runner.go b/client/allocrunner/taskrunner/task_runner.go
index aae6a31..4b365a5 100644
--- a/client/allocrunner/taskrunner/task_runner.go
+++ b/client/allocrunner/taskrunner/task_runner.go
@@ -590,6 +590,10 @@ MAIN:
                        // yay proceed
                }
  
+               tr.logger.Info("waiting for allocation to become terminal")
+               time.Sleep(30 * time.Second)
+               tr.logger.Info("waking up", tr.shouldShutdown())
+
                // Run the prestart hooks
                if err := tr.prestart(); err != nil {
                        tr.logger.Error("prestart failed", "error", err)


Then,

  1. Submit a job
  2. Wait for client to receive the job and for the task runner sleep waiting for the allocation to become terminal
  3. Stop the job
  4. Wait for the task runner to wake up
  5. Check the job allocation status to and find:
Recent Events: 
11:46:44-04:00 
Not Restarting 
Error was unrecoverable 

11:46:44-04:00 
Driver Failure
failed to create executor: error creating rpc client for executor plugin: Unrecognized remote plugin message: This usually means the plugin was not compiled for this architecture, the plugin is missing dynamic-link libraries necessary to run, the plugin is not executable by this process due to file permissions, or the plugin failed to negotiate the initial go-plugin protocol handshake

Expected Result

When the allocation is known to be terminal and pre-start hooks are skipped, we don’t proceed to start the task.

Actual Result

Job file (if appropriate)

The task is started without a properly prepared runtime environment, causing the driver to fail.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

[0]

18:46:15.760+0800 [DEBUG] client.driver_mgr.custom-driver: plugin started: driver=custom pid=22454 task_name=workload @module=custom.executor alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c path=/usr/local/home/nomad-user/nomad/data/plugins/custom-driver timestamp="18:46:15.760+0800" 
18:46:15.760+0800 [DEBUG] client.driver_mgr.custom-driver: waiting for RPC address: driver=custom @module=custom.executor alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c path=/usr/local/home/nomad-user/nomad/data/plugins/custom-driver task_name=workload timestamp="18:46:15.760+0800" 
18:46:15.769+0800 [DEBUG] client.driver_mgr.custom-driver: using plugin: driver=custom @module=custom.executor alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c task_name=workload version=2 timestamp="18:46:15.769+0800"
18:46:17.418+0800 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c task=workload type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false 
18:46:19.068+0800 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c task=workload type=Started msg="Task started by client" failed=false 
18:46:19.074+0800 [DEBUG] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c task=sidecar
18:46:19.077+0800 [INFO] client.driver_mgr.custom-driver: starting task: driver=custom @module=custom driver_cfg="{Command:/usr/sbin/server Args:[start -fg -mode prod -rundir /tmp] RunAs:nomad-user}" timestamp="18:46:19.075+0800" 
18:46:19.077+0800 [DEBUG] client.driver_mgr.custom-driver: starting plugin: driver=custom task_name=sidecar @module=custom.executor alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c args=[/usr/local/home/nomad-user/nomad/data/plugins/custom-driver, executor, {"LogFile":"/usr/local/home/nomad-user/nomad/data/alloc/64036223-190b-2bd8-bbe7-3a1349770b1c/sidecar/executor.out","LogLevel":"debug","FSIsolation":true}] path=/usr/local/home/nomad-user/nomad/data/plugins/custom-driver timestamp="18:46:19.075+0800" 
18:46:19.077+0800 [DEBUG] client.driver_mgr.custom-driver: plugin started: driver=custom alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c path=/usr/local/home/nomad-user/nomad/data/plugins/custom-driver pid=22495 task_name=sidecar @module=custom.executor timestamp="18:46:19.075+0800" 
18:46:19.077+0800 [DEBUG] client.driver_mgr.custom-driver: waiting for RPC address: driver=custom @module=custom.executor alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c path=/usr/local/home/nomad-user/nomad/data/plugins/custom-driver task_name=sidecar timestamp="18:46:19.075+0800" 
18:46:19.083+0800 [DEBUG] client.driver_mgr.custom-driver: 18:46:19.083+0800 [ERROR] open /usr/local/home/nomad-user/nomad/data/alloc/64036223-190b-2bd8-bbe7-3a1349770b1c/sidecar/executor.out: no such file or directory: driver=custom alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c task_name=sidecar @module=custom.executor.custom-driver timestamp="18:46:19.083+0800" 
18:46:19.085+0800 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c task=sidecar type="Driver Failure"  
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Oct 4, 2023
@lgfa29 lgfa29 added theme/client stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Oct 4, 2023
@lgfa29
Copy link
Contributor

lgfa29 commented Oct 4, 2023

Thanks for the detailed report @xkmwangi.

I've opened #18662 to address this problem 🙂

@stswidwinski
Copy link
Contributor

Thank you for such a fast turn-around :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client type/bug
Projects
Development

Successfully merging a pull request may close this issue.

3 participants