Task Lifecycle PostStop Hook #8193

jazzyfresh · 2020-06-17T23:19:36Z

Design

Poststop is intended to be a cleanup hook: i.e. after main tasks have become terminal.

Poststop tasks run every time:

whether a batch job is exiting normally,
an operator is stopping or updating the job,
or the main tasks are exiting due to failure.

If the poststop task fails, the allocation is restarted (along with all non-poststop tasks) and typical allocation failure logic applies. The idea is that poststop tasks can gives users the ability to catch and hande errors internally within an allocation. For some use cases (eg Artifact upload below), the poststop hook is a critical part of the overall allocation.

Potential Future Features:

Only running poststop tasks under certain conditions may be added in the future
- Environment variables containing the exit statuses of main tasks

Use Case: Cleanup

A service job writes files to a host volume that should be cleaned up on shutdown. The poststop task removes these files.

Since cleanup is a best effort, the cleanup script itself silently ignores errors to prevent causing the allocation to be failed if cleanup fails.

Use Case: Artifact upload

A batch job that runs Packer to build a local VM image needs to upload that image to remote storage after building.

While in this use case the poststop uploader does not need to run if Packer fails, it's fine if it attempts to anyway as it will just fail to find the local VM image to upload.

Use Case: Event Signalling / Task Group Dependencies

Use a PostStop task to signal some external service via API request etc

Implementation

jazzyfresh · 2020-06-30T18:01:07Z

answering questions from #8194 (review)

Should they run if the main tasks fail? Clean up tasks should probably run all the time

PostStop run after all main tasks dead, regardless of what caused the main tasks to die (complete, kill or failure). If we want differentiation in behavior, we could introduce a PostFail case.

Should they run if nomad job stop is invoked? I think current implementation would probably not run them

PostStop should run if nomad job stop is invoked in both service and batch job cases.

Should sidecars run concurrently with post-stop tasks? Having side-cars run until the very end makes sense.

Yes, that makes the most sense to me.

logging case this makes sense
proxy this does not make sense
can we add a flag to sidecars? (possibly too confusing, if there isn't a clear use case or community request then don't)

sample use cases

consul service registration (poststart) & deregistration (poststop)
- in this case, we would want poststop to run after every time the main task dies, even if it's restarting
- this doesn't make sense, bc there can be multiple main tasks
- possibly would make sense to implement a new feature that works with lifecycle, but is tied to a specific task; also could add a restart
cleanup use case: allocation is about to die
- this is the target use case for poststop

poststop cannot be sidecars

sidecars should run through poststop?
- i think this should be configurable with a flag
- why wouldn't you ?
  - proxy use case: you expose the main service after it starts, but dont want to expose it after it dies
  - better solution here: restart hook

jazzyfresh · 2020-08-21T15:54:23Z

MVP

The Cleanup use case

Right before allocation is going to die (i.e. all the main tasks have stopped & are not being restarted anymore) => Do some stuff

poststop restart
- investigate task-specific restart stanza (this should just work!)
- use default restart policy for job type, but allow it to be configurable
poststop failure
- do we reschedule the whole allocation?
- the default behavior for when a task fails should work here (difference between use cases for service & batch jobs)
allocation migration/node drain - DO run poststop if allocation is migrating
- if people don't like this, we can expose IS_MIGRATING=1 as an env var, users can tweak behavior of tasks
  - reverse artifact stanza: uploading data to places

jazzyfresh · 2020-08-25T19:56:49Z

Rebasing Shenanigans

I broke nomad client shutdown

I have been rebasing this branch off of the poststart branch and (while poststop does not seem to work yet) there is also an interesting shutdown bug!

It appears to get caught just after garbage collection

while waiting

for allocrunners to be destroyed (destroy themselves?).

(Note: it is `ar.Destroy()` rather than `ar.Shutdown()`, since I am running in dev mode.)

I fixed nomad client shutdown

Using the goroutine crash logs (`ctrl + \` SIGQUIT dumps the goroutines' stack traces),

I was able to track down the tricky bit of code

where I basically said

you know all the tasks that have poststop?
yeah so, don't shut them down (ever)

So those allocs never get shut down, & as a result the clients hangs around forever waiting for them to get shut down.

I removed the `poststop` part of the conditional from `killTasks()` and the client shuts down without a problem

jazzyfresh · 2020-08-26T20:34:14Z

Debugging

While I was debugging the prior problem with the client shutdown, I gained some insight on what could be going wrong with poststop functionality

[Resolved] Batch Jobs: poststop tasks are stuck in pending

taskStateUpdated() is getting called several times as the main task transitions from running to dead
there must be something wrong with the conditional logic for removing tasks from mainTasksRunning...
- any task in a dead state will not be removed from the set, this is just the opposite of what we want
insight needed: TaskStateDead is a terminal state that indicates a task will not be restarting further

[In progress] Service Jobs: poststop tasks need to run after a `nomad stop` command

something something this code

^ ar.taskHookCoordinator.taskStateUpdated(states) is called outside of this whole for-loop

poststop tasks never have a chance to run because they receive the kill signal along with the main tasks

jazzyfresh · 2020-10-29T18:10:17Z

NOMAD_E2E=1 go test -v . -run 'TestE2E/Lifecycle/\*lifecycle\.LifecycleE2ETest/TestBatchJob'

jazzyfresh · 2020-10-29T18:52:02Z

github-actions · 2022-10-28T02:42:03Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jazzyfresh self-assigned this Jun 17, 2020

jazzyfresh added the theme/dependencies Pull requests that update a dependency file label Jun 17, 2020

jazzyfresh closed this as completed Jun 17, 2020

jazzyfresh reopened this Jun 17, 2020

jazzyfresh mentioned this issue Jun 17, 2020

lifecycle: add poststop hook #8194

Merged

jazzyfresh added 0.12-beta1 and removed 0.12-beta1 labels Jun 19, 2020

jazzyfresh added this to the 0.12.0-beta2 milestone Jun 19, 2020

schmichael modified the milestones: 0.12.0-beta2, 0.12.0 Jun 26, 2020

jazzyfresh removed this from the 0.12.0 milestone Jul 1, 2020

jazzyfresh mentioned this issue Jul 6, 2020

Task Lifecycle PostStart Hook #8366

Closed

10 tasks

jazzyfresh added this to the 0.12.4 milestone Aug 31, 2020

schmichael linked a pull request Aug 31, 2020 that will close this issue

lifecycle: add poststop hook #8194

Merged

jazzyfresh modified the milestones: 0.12.4, 0.13 Sep 1, 2020

jazzyfresh changed the title ~~task lifecycle: add poststop hook~~ Task Lifecycle PostStop Hook Sep 1, 2020

jazzyfresh mentioned this issue Sep 2, 2020

StopTimeout for Tasks #8817

Closed

jazzyfresh closed this as completed in #8194 Nov 12, 2020

github-actions bot locked as resolved and limited conversation to collaborators Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task Lifecycle PostStop Hook #8193

Task Lifecycle PostStop Hook #8193

jazzyfresh commented Jun 17, 2020 •

edited

Loading

jazzyfresh commented Jun 30, 2020 •

edited

Loading

jazzyfresh commented Aug 21, 2020

jazzyfresh commented Aug 21, 2020

jazzyfresh commented Aug 25, 2020 •

edited

Loading

jazzyfresh commented Aug 26, 2020

jazzyfresh commented Oct 29, 2020

jazzyfresh commented Oct 29, 2020 •

edited

Loading

github-actions bot commented Oct 28, 2022

Task Lifecycle PostStop Hook #8193

Task Lifecycle PostStop Hook #8193

Comments

jazzyfresh commented Jun 17, 2020 • edited Loading

Design

Use Case: Cleanup

Use Case: Artifact upload

Use Case: Event Signalling / Task Group Dependencies

Implementation

jazzyfresh commented Jun 30, 2020 • edited Loading

jazzyfresh commented Aug 21, 2020

sample use cases

poststop cannot be sidecars

jazzyfresh commented Aug 21, 2020

MVP

jazzyfresh commented Aug 25, 2020 • edited Loading

Rebasing Shenanigans

I broke nomad client shutdown

It appears to get caught just after garbage collection

while waiting

for allocrunners to be destroyed (destroy themselves?).

(Note: it is ar.Destroy() rather than ar.Shutdown(), since I am running in dev mode.)

I fixed nomad client shutdown

Using the goroutine crash logs (ctrl + \ SIGQUIT dumps the goroutines' stack traces),

I was able to track down the tricky bit of code

where I basically said

So those allocs never get shut down, & as a result the clients hangs around forever waiting for them to get shut down.

I removed the poststop part of the conditional from killTasks() and the client shuts down without a problem

jazzyfresh commented Aug 26, 2020

Debugging

[Resolved] Batch Jobs: poststop tasks are stuck in pending

[In progress] Service Jobs: poststop tasks need to run after a nomad stop command

jazzyfresh commented Oct 29, 2020

jazzyfresh commented Oct 29, 2020 • edited Loading

TODO

technical work

design work

github-actions bot commented Oct 28, 2022

jazzyfresh commented Jun 17, 2020 •

edited

Loading

jazzyfresh commented Jun 30, 2020 •

edited

Loading

jazzyfresh commented Aug 25, 2020 •

edited

Loading

(Note: it is `ar.Destroy()` rather than `ar.Shutdown()`, since I am running in dev mode.)

Using the goroutine crash logs (`ctrl + \` SIGQUIT dumps the goroutines' stack traces),

I removed the `poststop` part of the conditional from `killTasks()` and the client shuts down without a problem

[In progress] Service Jobs: poststop tasks need to run after a `nomad stop` command

jazzyfresh commented Oct 29, 2020 •

edited

Loading