provide -no-shutdown-delay flag on job/alloc stop to bypass kill timeout #11448

tgross · 2021-11-04T19:09:34Z

Some operators use long kill timeouts so that tasks have a long window of time to shutdown cleanly. During incident response, these long kill timeouts can make it hard to mitigate the incident. Provide an -immediate flag on nomad job stop and nomad alloc stop that bypasses the kill timeout. We'll still run allocrunner/taskrunner prestop hooks (because the client state needs that). Maybe an open question would be what should we do with tasks with the poststop lifecycle configuration?

The text was updated successfully, but these errors were encountered:

tgross · 2021-11-29T17:17:08Z

Some interesting bits to figure out here:

Neither nomad job stop or nomad alloc stop communicates with the clients. Like everything else they update the state store and then create evals for the scheduler to act upon:
- nomad job stop applies a JobDeregisterRequestType to the FSM, which either deletes the job entirely from the state store (on -purge) or marks the job's Stop field true. In either code path, we create a new eval which the scheduler uses to sync the job and the clients.
- nomad alloc stop applies a UpdateAllocsDesiredTransitions to the FSM, which updates the alloc with a DesiredTransition to "migrate", "reschedule", or "force reschedule". We create a new eval which the scheduler uses to sync the job and the clients.
There is both a group-level shutdown_delay and a task-level shutdown_delay. We'll need to override both behaviors, but it looks like get merged for the task runner anyways.

My initial thought here for implementation was to update the job so that the shutdown_delay is set to zero. But we end up persisting that change in the jobspec. In the case where the operator runs nomad job run again, they'll have the jobspec file with the original shutdown_delay value locally and that's fine. But in the web UI we have a Start button that sets the definition from .Stop = true to .Stop = false before submitting the job. This would restart the job with the modified shutdown_delay value, which would be bad. (Also, we'd hit the same problem with some future hypothetical nomad job restart command.)

I think the approach here will be to add a new DesiredTransition field (Immediate?) that's set on all allocations touched by the RPC that's been given the -immediate flag. That'll get passed along as usual as part of the Allocation struct to the client, which can read that value at any place it'd be looking at Task.ShutdownDelay. I'll give that a try and see how it looks.

(Also, we might want to name this flag -no-shutdown-delay rather than -immediate so that it's not conflated with the eval priority.)

tgross · 2021-11-29T21:29:47Z

Getting close to the end of my day here so leaving notes for myself: I've got an almost-working branch of nomad alloc stop -immediate up at f-immediate-stop but I've run into a problem with the group shutdown_delay. It turns out the group group service hook does not get merged with the task shutdown_delay but serves a different purpose entirely.

The group service hook waiting on the shutdown_delay was introduced in Nomad 0.10.4 #6746 to allow time for tasks to drain their connections after being deregistered from Consul but before being shut down.

Currently alloc runner hooks don't get their Update hook fired for terminal allocs (ref), so I can't simply add an isImmediate flag to the group service hook and then update its value. But looking at the existing alloc hooks, there are potentially others that could benefit from the group shutdown_delay hook, so I'll probably push that TerminalStatus() check down into the hooks so they can make that decision themselves or react differently in the case where they're terminal.

Edit: a much easier approach will be to expose the allocation update to the alloc runner hook PreKill method which is only implemented by the group service hook so there's very little disruption. I've pushed up a hacky version of that to my branch and should be able to wrap it up Weds.

github-actions · 2022-10-13T02:44:25Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added type/enhancement theme/cli theme/api HTTP API and SDK issues theme/client labels Nov 4, 2021

tgross self-assigned this Nov 29, 2021

tgross mentioned this issue Dec 1, 2021

provide -no-shutdown-delay flag for job/alloc stop #11596

Merged

tgross changed the title ~~provide -immediate flag on job/alloc stop to bypass kill timeout~~ provide -no-shutdown-delay flag on job/alloc stop to bypass kill timeout Dec 1, 2021

tgross mentioned this issue Dec 7, 2021

Evaluation broker control API #11638

Closed

tgross closed this as completed in #11596 Dec 13, 2021

tgross added this to the 1.2.4 milestone Dec 13, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provide -no-shutdown-delay flag on job/alloc stop to bypass kill timeout #11448

provide -no-shutdown-delay flag on job/alloc stop to bypass kill timeout #11448

tgross commented Nov 4, 2021

tgross commented Nov 29, 2021 •

edited

Loading

tgross commented Nov 29, 2021 •

edited

Loading

github-actions bot commented Oct 13, 2022

provide -no-shutdown-delay flag on job/alloc stop to bypass kill timeout #11448

provide -no-shutdown-delay flag on job/alloc stop to bypass kill timeout #11448

Comments

tgross commented Nov 4, 2021

tgross commented Nov 29, 2021 • edited Loading

tgross commented Nov 29, 2021 • edited Loading

github-actions bot commented Oct 13, 2022

tgross commented Nov 29, 2021 •

edited

Loading

tgross commented Nov 29, 2021 •

edited

Loading