Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(pre-)Stop/Kill action/command #9872

Open
sofixa opened this issue Jan 22, 2021 · 8 comments
Open

(pre-)Stop/Kill action/command #9872

sofixa opened this issue Jan 22, 2021 · 8 comments

Comments

@sofixa
Copy link
Contributor

sofixa commented Jan 22, 2021

Currently Nomad supports defining a kill signal, and it'd be pretty useful to be able to define pre- and stop/kill actions/commands ( we can already do post-stop via tasks with lifecycle > hook > poststop).

The main use case i see for this is shutting down complex software/tasks that needs actions performed on it for a graceful shutdown, e.g. ScyllaDB recommend running a command (nodetool drain) and then shutting down gracefully before killing the Docker container.

It could also be useful in order to do more graceful drains, for example when doing rolling upgrades (e.g. failing the healthcheck to make the instance inaccessible from Consul/LB before actually shutting it down).

In theory it could be achieved with an additional hook (prestop), but that might cause some issues ( e.g. in ScyllaDB's case, the prestop task would need to contain all the tools and configuration to be able to run commands on the ScyllaDB running in the main task; and it won't work for the specific case, since they recommend shutting down gracefully via supervisord after draining, and i don't think one can call supervisorctl remotely).

Adrian

@tgross tgross changed the title [Feature request] (pre-)Stop/Kill action/command (pre-)Stop/Kill action/command Jan 22, 2021
@shcherbachev
Copy link

Hi!

We would love to have the ability to configure a pre-stop script. It will help us implement smoother upgrades that require load-balancer reconfiguration. Right now our load-balancer coupled with consul services and consul-template will detect that a service is no longer running on a node and will forward traffic to a different node. But this happens after a small delay and only after the job has been terminated.

If we had a pre-stop script we can switch the traffic on the load-balancer before the job started to die. This way we won't have to wait for consul to detect and propagate changes.
Once it's back online we will use the poststart hook to reconfigure the load-balancer to use the local service again.

Also, the prestop hook is the only one missing in the family: prestart, poststart,poststop are there. Personally, I would add it for the sake of symmetry.

  • Alex

@mikeblum
Copy link

mikeblum commented Dec 8, 2021

This is impacting our ability to kick off connection draining for our HAProxy containers running in Nomad - similar to @shcherbachev's use-case. I'll take a look at the code for post-stop and pre-start to get an idea of how pre-stop might work. Stay tuned!

@mikeblum
Copy link

mikeblum commented Dec 12, 2021

Hi @tgross

Forked and setup a Nomad dev environment (very smooth on-boarding. The contrib guide was excellent). After reviewing how pre-start and the other lifecycle hooks are implemented I have a few questions on the scope of pre-stop:

For reference here are the docs for lifecycle hooks: https://www.nomadproject.io/docs/job-specification/lifecycle#lifecycle-stanza

blog: https://www.hashicorp.com/blog/hashicorp-nomad-task-dependencies

1. Should we support pre-stop for sidecar tasks?

This section of the structs code points to sidecar support for pre-start. If we implemented pre-stop for a sidecar would we expect this to block stopping the parent task? Or would this be considered a non-blocking optional failure such that a pre-stop task with sidecar enabled:

based off of https://www.nomadproject.io/docs/job-specification/lifecycle#init-task-pattern

  task "halt-telemetry" {
    lifecycle {
      hook = "prestop"
      sidecar = true
    }

    driver = "exec"
    config {
      command = "sh"
      args = ["-c", "while nc -z telemetry.service.local.consul 8080; do sleep 1; done"]
    }
  }

  task "main-app" {
    ...
  }

image

A use case I could think of would be making sure any buffered logs or other crucial data has been shipped off-box to the telemetry service of choice.

2. Are there any UI components we need to update?

Pre-start / Post-stop task hooks have this UX which is quite nice when there are several lifecycle tasks.

image

Could this PR be just encompass the Go side changes?

3. How should task kill timeouts be handled?

Example from nomad job init:

# Controls the timeout between signalling a task it will be killed
# and killing the task. If not set a default is used.
kill_timeout = "20s"

In the example.nomad the kill_timeout applies to the main task - I imagine we'll want to support this for pre-stop just like it works for post-stop today but I'm wondering if there are weird implications to having a kill_timeout on the main and/or pre-stop tasks - who wins?

Related issues:

Task Lifecycle PostStart Hook: #8366

I'll keep digging into the code but figured I'd pose these higher level Qs to get the 🤔 going.

@liemlhdbeatvn
Copy link

Our use case is exactly the same with shcherbachev , is there any progress on this?

@jrasell
Copy link
Member

jrasell commented Aug 15, 2022

Hi @liemlhdbeatvn and others on this issue; this is unfortunately not currently on our near-term roadmap. The team will provide updates as soon as there are any.

@ljb2of3
Copy link

ljb2of3 commented Aug 24, 2022

I just wanted to drop in and say a prestop feature would be very useful for my use case as well.

Due to the architecture of the system I'm working on, it takes about 10 minutes for traffic to stop flowing to a task once it's removed from our load balancer. It would be great if I could have a prestop job that removes it from the load balancer, then sleeps for 10 minutes before allowing the main task to be stopped.

@ljb2of3
Copy link

ljb2of3 commented Aug 24, 2022

Of course, as I continue reading the docs... it appears that shutdown_delay will actually meet my needs. @shcherbachev and @liemlhdbeatvn would this work for you as well?

https://www.nomadproject.io/docs/job-specification/group#shutdown_delay

With that in mind, I'd still vote that prestop be added for completeness.

@aparfeno
Copy link

Hello,
First - thank you for great product. I am adopting it for my use case of micro-service based warehouse management system.
I wanted to add another voice for this feature.
My use case is:
I have stateful server-client interactions (dialogs with hand-held devices) which I am organizing with sticky sessions.
At time of rolling upgrade, I want to gracefully "transition" these stateful sessions from node that's shutting down, to a new one. This involves warning the user to quickly finish his tasks, waiting for him to do that (i.e. reach parts of code that are safe from business point of view to kill user's session), and then moving the session by way of asking client to forget sticky cookie, etc.

It is a complicated song-and-dance. So far I've run into two problems:

  1. Nomad cancels consul registration when kill signal is dispatched - that's too soon for me
  2. In windows/Java I can't catch Ctrl-Break signal, and nomad isn't respecting kill_signal in windows -there is ticket for that).

So far I am considering all kinds of crutches to go around the problems above.
Instead, these can be solved cleanly if I could tell my app through a pre-stop script that it is time to shutdown. It would interact with users, deal with Consul appropriately, etc.

Thank you,
Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

10 participants