Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task Lifecycle PostStop Hook #8193

Closed
11 of 16 tasks
jazzyfresh opened this issue Jun 17, 2020 · 8 comments · Fixed by #8194
Closed
11 of 16 tasks

Task Lifecycle PostStop Hook #8193

jazzyfresh opened this issue Jun 17, 2020 · 8 comments · Fixed by #8194
Assignees
Labels
theme/dependencies Pull requests that update a dependency file
Milestone

Comments

@jazzyfresh
Copy link
Contributor

jazzyfresh commented Jun 17, 2020

Design

Poststop is intended to be a cleanup hook: i.e. after main tasks have become terminal.

Poststop tasks run every time:

  • whether a batch job is exiting normally,
  • an operator is stopping or updating the job,
  • or the main tasks are exiting due to failure.

If the poststop task fails, the allocation is restarted (along with all non-poststop tasks) and typical allocation failure logic applies. The idea is that poststop tasks can gives users the ability to catch and hande errors internally within an allocation. For some use cases (eg Artifact upload below), the poststop hook is a critical part of the overall allocation.

Potential Future Features:

  • Only running poststop tasks under certain conditions may be added in the future
    • Environment variables containing the exit statuses of main tasks

Use Case: Cleanup

A service job writes files to a host volume that should be cleaned up on shutdown. The poststop task removes these files.

Since cleanup is a best effort, the cleanup script itself silently ignores errors to prevent causing the allocation to be failed if cleanup fails.

Use Case: Artifact upload

A batch job that runs Packer to build a local VM image needs to upload that image to remote storage after building.

While in this use case the poststop uploader does not need to run if Packer fails, it's fine if it attempts to anyway as it will just fail to find the local VM image to upload.

Use Case: Event Signalling / Task Group Dependencies

Use a PostStop task to signal some external service via API request etc

Implementation

  • add poststop case to structs & api packages
  • add poststop case to client/allocrunner/task_hook_coordinator.go
  • add poststop case to alloc resources in structs.go for scheduler allocation placement
  • poststop works for batch jobs
  • poststop works for service jobs
    • nomad stop job --> main tasks stop --> poststop tasks run --> tasks finish --> alloc dies
  • poststop tasks do not restart if they complete successfully
  • account for poststop hooks in scheduler for allocation resource addition (structs.go in the AllocatedResources methods)
  • you can kill poststop tasks (after the job has already received the kill signal)
  • you can fail poststop tasks if they stall
  • unit tests
  • e2e tests
    • e2e test that sleeps forever & is interruptable by a signal
  • docs
@jazzyfresh jazzyfresh self-assigned this Jun 17, 2020
@jazzyfresh jazzyfresh added the theme/dependencies Pull requests that update a dependency file label Jun 17, 2020
@jazzyfresh jazzyfresh reopened this Jun 17, 2020
@jazzyfresh jazzyfresh added this to the 0.12.0-beta2 milestone Jun 19, 2020
@schmichael schmichael modified the milestones: 0.12.0-beta2, 0.12.0 Jun 26, 2020
@jazzyfresh
Copy link
Contributor Author

jazzyfresh commented Jun 30, 2020

answering questions from #8194 (review)

Should they run if the main tasks fail? Clean up tasks should probably run all the time

PostStop run after all main tasks dead, regardless of what caused the main tasks to die (complete, kill or failure). If we want differentiation in behavior, we could introduce a PostFail case.

Should they run if nomad job stop is invoked? I think current implementation would probably not run them

PostStop should run if nomad job stop is invoked in both service and batch job cases.

Should sidecars run concurrently with post-stop tasks? Having side-cars run until the very end makes sense.

Yes, that makes the most sense to me.

  • logging case this makes sense
  • proxy this does not make sense
  • can we add a flag to sidecars? (possibly too confusing, if there isn't a clear use case or community request then don't)

More questions

  • what happens in a restart?
  • what happens in an upgrade?
  • what happens in a restore?
  • what happens in a migration?
  • what happens if the PostStop tasks fail?
    • do we restart the whole allocation or just the PostStop tasks in that alloc?
  • what if the host dies?
    • poststop won't run, but then users should be made aware of this
    • this implies the use case should be something very local to the host & allocation

TODO:

  • community followup for poststop use cases
    • Pre/Post hooks for tasks  #1061
    • cassandra flush prestart example - how would this even work? ask user for more clarity
    • defer adding this feature until we have more clarity around usability
  • file ticket for task lifecycle poststart hook
    • more immediate use case for poststart, prioritize this over poststop

cc @notnoop @schmichael

@jazzyfresh jazzyfresh removed this from the 0.12.0 milestone Jul 1, 2020
@jazzyfresh
Copy link
Contributor Author

what happens in a restart?

sample use cases

  • consul service registration (poststart) & deregistration (poststop)
    • in this case, we would want poststop to run after every time the main task dies, even if it's restarting
    • this doesn't make sense, bc there can be multiple main tasks
    • possibly would make sense to implement a new feature that works with lifecycle, but is tied to a specific task; also could add a restart
  • cleanup use case: allocation is about to die
    • this is the target use case for poststop

poststop cannot be sidecars

  • sidecars should run through poststop?
    • i think this should be configurable with a flag
    • why wouldn't you ?
      • proxy use case: you expose the main service after it starts, but dont want to expose it after it dies
      • better solution here: restart hook

@jazzyfresh
Copy link
Contributor Author

MVP

The Cleanup use case

Right before allocation is going to die (i.e. all the main tasks have stopped & are not being restarted anymore) => Do some stuff

  • poststop restart
    • investigate task-specific restart stanza (this should just work!)
    • use default restart policy for job type, but allow it to be configurable
  • poststop failure
    • do we reschedule the whole allocation?
    • the default behavior for when a task fails should work here (difference between use cases for service & batch jobs)
  • allocation migration/node drain - DO run poststop if allocation is migrating
    • if people don't like this, we can expose IS_MIGRATING=1 as an env var, users can tweak behavior of tasks
      • reverse artifact stanza: uploading data to places

@jazzyfresh
Copy link
Contributor Author

jazzyfresh commented Aug 25, 2020

Rebasing Shenanigans

I broke nomad client shutdown

I have been rebasing this branch off of the poststart branch and (while poststop does not seem to work yet) there is also an interesting shutdown bug!

It appears to get caught just after garbage collection

Screenshot from 2020-08-25 11-57-42

while waiting

Screenshot from 2020-08-25 12-10-19

for allocrunners to be destroyed (destroy themselves?).

Screenshot from 2020-08-25 12-15-52

(Note: it is ar.Destroy() rather than ar.Shutdown(), since I am running in dev mode.)

I fixed nomad client shutdown

Using the goroutine crash logs (ctrl + \ SIGQUIT dumps the goroutines' stack traces),

Screenshot from 2020-08-25 12-28-25

I was able to track down the tricky bit of code

Screenshot from 2020-08-25 12-29-31

where I basically said

you know all the tasks that have poststop?
yeah so, don't shut them down (ever)

Screenshot from 2020-08-25 12-33-00

So those allocs never get shut down, & as a result the clients hangs around forever waiting for them to get shut down.

I removed the poststop part of the conditional from killTasks() and the client shuts down without a problem

Screenshot from 2020-08-25 12-38-09

@jazzyfresh
Copy link
Contributor Author

Debugging

While I was debugging the prior problem with the client shutdown, I gained some insight on what could be going wrong with poststop functionality

[Resolved] Batch Jobs: poststop tasks are stuck in pending

  • taskStateUpdated() is getting called several times as the main task transitions from running to dead

  • there must be something wrong with the conditional logic for removing tasks from mainTasksRunning...

    • any task in a dead state will not be removed from the set, this is just the opposite of what we want
      Screenshot from 2020-08-26 13-23-29
  • insight needed: TaskStateDead is a terminal state that indicates a task will not be restarting further

[In progress] Service Jobs: poststop tasks need to run after a nomad stop command

something something this code
Screenshot from 2020-08-25 13-53-22
^ ar.taskHookCoordinator.taskStateUpdated(states) is called outside of this whole for-loop

  • poststop tasks never have a chance to run because they receive the kill signal along with the main tasks

@jazzyfresh jazzyfresh added this to the 0.12.4 milestone Aug 31, 2020
@schmichael schmichael linked a pull request Aug 31, 2020 that will close this issue
@jazzyfresh jazzyfresh modified the milestones: 0.12.4, 0.13 Sep 1, 2020
@jazzyfresh jazzyfresh changed the title task lifecycle: add poststop hook Task Lifecycle PostStop Hook Sep 1, 2020
@jazzyfresh
Copy link
Contributor Author

NOMAD_E2E=1 go test -v . -run 'TestE2E/Lifecycle/\*lifecycle\.LifecycleE2ETest/TestBatchJob'

@jazzyfresh
Copy link
Contributor Author

jazzyfresh commented Oct 29, 2020

TODO

technical work

  • e2e tests
    • successful case
    • failure case
    • nomad job stop/nomad alloc signal case
  • unit tests
    • resource allocation in scheduler (for the changes to structs/structs.go)
  • signal if the main tasks failed/succeeded in an environment variable
  • refactor would be NICE (not a blocker)
    • if it's doing the right behavior & we have thorough tests, we can merge
    • ideally the code is centralized in the TaskHookCoordinator, not sprinkled throughout the AllocRunner/TaskRunner

design work

  • write up behavior/design specifics --> call them out in the PR description
  • need to get into beta early for community feedback

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
theme/dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants