Task is started despite skipped pre-start hooks #18659

xkmwangi · 2023-10-04T17:58:46Z

Nomad version

Tip

Operating system and Environment details

Linux

Issue

There is a race between allocation termination and task startup that can cause the task to start without the pre-start hooks having been executed. This may cause the driver to fail during startup because the task directory is not initialized. We suspect other failure modes are possible here (eg. the task directory may exist while re-discovering allocations during agent restart but other hooks don’t run), but the driver failure has arisen several times.

Details

The issue arose with the driver failing with the following sequence of events:

Client has an allocation scheduled.
Leader task starts successfully (note the Building Task Directory line).
Client soon after receives a kill for the leader.
Task runner skips pre-start hooks for sidecar because the allocation is terminal.
However, the task runner still starts the sidecar.
The sidecar because the task directory has not been created.

While this involved a leader and sidecar task, the error can occur with a single task.

Verify that skipping the pre-start hooks causes the observed driver failure

Patch nomad to unconditionally skip pre-start hooks.

diff --git a/client/allocrunner/taskrunner/task_runner_hooks.go b/client/allocrunner/taskrunner/task_runner_hooks.go
index 6d71ba7..890134c 100644
--- a/client/allocrunner/taskrunner/task_runner_hooks.go
+++ b/client/allocrunner/taskrunner/task_runner_hooks.go
@@ -190,7 +190,7 @@ func (tr *TaskRunner) emitHookError(err error, hookName string) {
 func (tr *TaskRunner) prestart() error {
        // Determine if the allocation is terminal and we should avoid running
        // prestart hooks.
-       if tr.shouldShutdown() {
+       if true {
                tr.logger.Trace("skipping prestart hooks since allocation is terminal")
                return nil
        }

When attempting to start a simple task,job the allocation fails with:

rpc error: code = Unknown desc = failed to create executor: error creating rpc client for executor plugin: Unrecognized remote plugin message: This usually means the plugin was not compiled for this architecture, the plugin is missing dynamic-link libraries necessary to run, the plugin is not executable by this process due to file permissions, or the plugin failed to negotiate the initial go-plugin protocol handshake

Finding the root cause of the raciness

The prestart hooks are only skipped when tr.sholdShutdown() is true.

nomad/client/allocrunner/taskrunner/task_runner_hooks.go

Lines 195 to 201 in 73e372a

    
           // Determine if the allocation is terminal and we should avoid running 
        
           // prestart hooks. 
        
           if tr.shouldShutdown() { 
        
           	tr.logger.Trace("skipping prestart hooks since allocation is terminal") 
        
           	return nil 
        
           }

Hence, for the prestart hooks to be skipped and yet the task still run, it must be the case that:

nomad/client/allocrunner/taskrunner/task_runner.go

Lines 570 to 614 in 73e372a

    
           MAIN: 
        
           	for !tr.shouldShutdown() { 
        
           		if dead { 
        
           			break 
        
           		} 
        
           		select { 
        
           		case <-tr.killCtx.Done(): 
        
           			break MAIN 
        
           		case <-tr.shutdownCtx.Done(): 
        
           			// TaskRunner was told to exit immediately 
        
           			return 
        
           		case <-tr.startConditionMetCh: 
        
           			tr.logger.Debug("lifecycle start condition has been met, proceeding") 
        
           			// yay proceed 
        
           		} 
        
           		// Run the prestart hooks 
        
           		if err := tr.prestart(); err != nil { 
        
           			tr.logger.Error("prestart failed", "error", err) 
        
           			tr.restartTracker.SetStartError(err) 
        
           			goto RESTART 
        
           		} 
        
           		select { 
        
           		case <-tr.killCtx.Done(): 
        
           			break MAIN 
        
           		case <-tr.shutdownCtx.Done(): 
        
           			// TaskRunner was told to exit immediately 
        
           			return 
        
           		default: 
        
           		} 
        
           		// Run the task 
        
           		if err := tr.runDriver(); err != nil { 
        
           			tr.logger.Error("running driver failed", "error", err) 
        
           			tr.restartTracker.SetStartError(err) 
        
           			goto RESTART 
        
           		} 
        
           		// Run the poststart hooks 
        
           		if err := tr.poststart(); err != nil { 
        
           			tr.logger.Error("poststart failed", "error", err) 
        
           		}

tr.shouldShutdown() is false as the start in order for the loop to begin, but then becomes true by the point tr.prestart() is reached.
We fall through the select, meaning killCtx and shutdownCtx must both not be Done.

For tr.shouldShutdown() to become true, tr.Alloc must have transitioned from non-terminal status to terminal status before

nomad/client/allocrunner/taskrunner/task_runner.go

Lines 734 to 746 in 73e372a

    
           func (tr *TaskRunner) shouldShutdown() bool { 
        
           	alloc := tr.Alloc() 
        
           	if alloc.ClientTerminalStatus() { 
        
           		return true 
        
           	} 
        
           	if !tr.IsPoststopTask() && alloc.ServerTerminalStatus() { 
        
           		return true 
        
           	} 
        
           	return false 
        
           }

The state of tr.Alloc is updated in

nomad/client/allocrunner/taskrunner/task_runner.go

Lines 1410 to 1411 in 73e372a

    
           // Update tr.alloc 
        
           tr.setAlloc(update, task)

called from

nomad/client/allocrunner/alloc_runner.go

Lines 956 to 995 in 73e372a

    
           func (ar *allocRunner) handleAllocUpdates() { 
        
           	for { 
        
           		select { 
        
           		case update := <-ar.allocUpdatedCh: 
        
           			ar.handleAllocUpdate(update) 
        
           		case <-ar.waitCh: 
        
           			return 
        
           		} 
        
           	} 
        
           } 
        
           // This method sends the updated alloc to Run for serially processing updates. 
        
           // If there is already a pending update it will be discarded and replaced by 
        
           // the latest update. 
        
           func (ar *allocRunner) handleAllocUpdate(update *structs.Allocation) { 
        
           	// Detect Stop updates 
        
           	stopping := !ar.Alloc().TerminalStatus() && update.TerminalStatus() 
        
           	// Update ar.alloc 
        
           	ar.setAlloc(update) 
        
           	// Run update hooks if not stopping or dead 
        
           	if !update.TerminalStatus() { 
        
           		if err := ar.update(update); err != nil { 
        
           			ar.logger.Error("error running update hooks", "error", err) 
        
           		} 
        
           	} 
        
           	// Update task runners 
        
           	for _, tr := range ar.tasks { 
        
           		tr.Update(update) 
        
           	} 
        
           	// If alloc is being terminated, kill all tasks, leader first 
        
           	if stopping { 
        
           		ar.killTasks() 
        
           	} 
        
           }

which is called as a goroutine

nomad/client/allocrunner/alloc_runner.go

Lines 321 to 330 in 3d63bc6

    
           func (ar *allocRunner) Run() { 
        
           	// Close the wait channel on return 
        
           	defer close(ar.waitCh) 
        
           	// Start the task state update handler 
        
           	go ar.handleTaskStateUpdates() 
        
           	// Start the alloc update handler 
        
           	go ar.handleAllocUpdates()

In handleAllocUpdate, the state of the task runner is first updated and then tasks are killed (closing KillCtx). However, because handleAllocUpdate runs on a different goroutine as TaskRunner::Run, the following ordering of events is possible and would result in the observed behavior:

The allocation is still non-terminal. A task runner is started and enters into the MAIN loop since tr.shouldShutdown returns false.
An update is received from the server to terminate the allocation. In a separate goroutine, handleAllocUpdate handles this update by calling tr.Update.
In the task runner, we skip the prestart hooks because the allocation is now terminal and tr.shouldShutdown returns true.
tr.Run continues to the select and, since both killCtx and shutdownCtx, are not yet done, we fall through and proceed to run the driver.
ar.killTasks is called

Reproduction steps

Without any other patches applied, add sleep statements to widen the window for the race to trigger.

diff --git a/client/allocrunner/alloc_runner.go b/client/allocrunner/alloc_runner.go
index 80224b0..dd9cf8b 100644
--- a/client/allocrunner/alloc_runner.go
+++ b/client/allocrunner/alloc_runner.go
@@ -972,6 +972,8 @@ func (ar *allocRunner) handleAllocUpdate(update *structs.Allocation) {
                tr.Update(update)
        }
  
+       time.Sleep(1 * time.Minute)
+
        // If alloc is being terminated, kill all tasks, leader first
        if stopping {
                ar.killTasks()
diff --git a/client/allocrunner/taskrunner/task_runner.go b/client/allocrunner/taskrunner/task_runner.go
index aae6a31..4b365a5 100644
--- a/client/allocrunner/taskrunner/task_runner.go
+++ b/client/allocrunner/taskrunner/task_runner.go
@@ -590,6 +590,10 @@ MAIN:
                        // yay proceed
                }
  
+               tr.logger.Info("waiting for allocation to become terminal")
+               time.Sleep(30 * time.Second)
+               tr.logger.Info("waking up", tr.shouldShutdown())
+
                // Run the prestart hooks
                if err := tr.prestart(); err != nil {
                        tr.logger.Error("prestart failed", "error", err)

Then,

Submit a job
Wait for client to receive the job and for the task runner sleep waiting for the allocation to become terminal
Stop the job
Wait for the task runner to wake up
Check the job allocation status to and find:

Recent Events: 
11:46:44-04:00 
Not Restarting 
Error was unrecoverable 

11:46:44-04:00 
Driver Failure
failed to create executor: error creating rpc client for executor plugin: Unrecognized remote plugin message: This usually means the plugin was not compiled for this architecture, the plugin is missing dynamic-link libraries necessary to run, the plugin is not executable by this process due to file permissions, or the plugin failed to negotiate the initial go-plugin protocol handshake

Expected Result

When the allocation is known to be terminal and pre-start hooks are skipped, we don’t proceed to start the task.

Actual Result

Job file (if appropriate)

The task is started without a properly prepared runtime environment, causing the driver to fail.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

[0]

18:46:15.760+0800 [DEBUG] client.driver_mgr.custom-driver: plugin started: driver=custom pid=22454 task_name=workload @module=custom.executor alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c path=/usr/local/home/nomad-user/nomad/data/plugins/custom-driver timestamp="18:46:15.760+0800" 
18:46:15.760+0800 [DEBUG] client.driver_mgr.custom-driver: waiting for RPC address: driver=custom @module=custom.executor alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c path=/usr/local/home/nomad-user/nomad/data/plugins/custom-driver task_name=workload timestamp="18:46:15.760+0800" 
18:46:15.769+0800 [DEBUG] client.driver_mgr.custom-driver: using plugin: driver=custom @module=custom.executor alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c task_name=workload version=2 timestamp="18:46:15.769+0800"
18:46:17.418+0800 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c task=workload type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false 
18:46:19.068+0800 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c task=workload type=Started msg="Task started by client" failed=false 
18:46:19.074+0800 [DEBUG] client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c task=sidecar
18:46:19.077+0800 [INFO] client.driver_mgr.custom-driver: starting task: driver=custom @module=custom driver_cfg="{Command:/usr/sbin/server Args:[start -fg -mode prod -rundir /tmp] RunAs:nomad-user}" timestamp="18:46:19.075+0800" 
18:46:19.077+0800 [DEBUG] client.driver_mgr.custom-driver: starting plugin: driver=custom task_name=sidecar @module=custom.executor alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c args=[/usr/local/home/nomad-user/nomad/data/plugins/custom-driver, executor, {"LogFile":"/usr/local/home/nomad-user/nomad/data/alloc/64036223-190b-2bd8-bbe7-3a1349770b1c/sidecar/executor.out","LogLevel":"debug","FSIsolation":true}] path=/usr/local/home/nomad-user/nomad/data/plugins/custom-driver timestamp="18:46:19.075+0800" 
18:46:19.077+0800 [DEBUG] client.driver_mgr.custom-driver: plugin started: driver=custom alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c path=/usr/local/home/nomad-user/nomad/data/plugins/custom-driver pid=22495 task_name=sidecar @module=custom.executor timestamp="18:46:19.075+0800" 
18:46:19.077+0800 [DEBUG] client.driver_mgr.custom-driver: waiting for RPC address: driver=custom @module=custom.executor alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c path=/usr/local/home/nomad-user/nomad/data/plugins/custom-driver task_name=sidecar timestamp="18:46:19.075+0800" 
18:46:19.083+0800 [DEBUG] client.driver_mgr.custom-driver: 18:46:19.083+0800 [ERROR] open /usr/local/home/nomad-user/nomad/data/alloc/64036223-190b-2bd8-bbe7-3a1349770b1c/sidecar/executor.out: no such file or directory: driver=custom alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c task_name=sidecar @module=custom.executor.custom-driver timestamp="18:46:19.083+0800" 
18:46:19.085+0800 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=64036223-190b-2bd8-bbe7-3a1349770b1c task=sidecar type="Driver Failure"

The text was updated successfully, but these errors were encountered:

lgfa29 · 2023-10-04T20:47:06Z

Thanks for the detailed report @xkmwangi.

I've opened #18662 to address this problem 🙂

stswidwinski · 2023-10-05T01:29:01Z

Thank you for such a fast turn-around :)

xkmwangi added the type/bug label Oct 4, 2023

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Oct 4, 2023

lgfa29 mentioned this issue Oct 4, 2023

client: ensure task only runs with prestart hooks #18662

Merged

lgfa29 added theme/client stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Oct 4, 2023

lgfa29 closed this as completed in #18662 Oct 5, 2023

Nomad - Community Issues Triage automation moved this from Needs Triage to Done Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task is started despite skipped pre-start hooks #18659

Task is started despite skipped pre-start hooks #18659

xkmwangi commented Oct 4, 2023

lgfa29 commented Oct 4, 2023

stswidwinski commented Oct 5, 2023

Task is started despite skipped pre-start hooks #18659

Task is started despite skipped pre-start hooks #18659

Comments

xkmwangi commented Oct 4, 2023

Nomad version

Operating system and Environment details

Issue

Details

Verify that skipping the pre-start hooks causes the observed driver failure

Finding the root cause of the raciness

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

lgfa29 commented Oct 4, 2023

stswidwinski commented Oct 5, 2023