Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting allocations does not seems to respect lifecycle and shudown_delay constraints #10578

Open
scyd-cb opened this issue May 13, 2021 · 2 comments
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/task lifecycle type/bug

Comments

@scyd-cb
Copy link

scyd-cb commented May 13, 2021

Nomad version

Nomad 1.0.2

Operating system and Environment details

CentOS 8

Issue

when restarting running allocation via GUI or CLI:

  1. shudown_delay for the task is not applicable when it is killed
  2. logic seems to be stopping and starting for each task without applying lifecycle rules (pre-start tasks first...etc) nor leader flag.

Reproduction steps

  1. have one task groups with prestarts tasks, leader tasks and shutdown_delay tasks.
  2. restarts the allocation

Expected Result

Expecting Nomad to :

  1. Stop all the tasks applying shutdown_delay if specified (like a standard allocation stop)
  2. Once all tasks stopped/dead , start the tasks applying lifecycle rules and leader tag.

Actual Result

no order in restarting tasks.

Job file (if appropriate)

Screen Shot 2021-05-12 at 9 50 04 PM

If possible please post relevant logs in the issue.

2021-05-13T04:53:29.705Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=main_task2 reason= delay=0s
	client.alloc_runner.task_runner: running exited hook: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader name=stats_hook start="2021-05-13 04:53:26.707639186 +0000 UTC m=+35.630485729"
	client.alloc_runner.task_runner: finished exited hooks: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader name=stats_hook end="2021-05-13 04:53:26.7076722 +0000 UTC m=+35.630518742" duration=33.013µs
	client.alloc_runner.task_runner: running exited hook: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader name=consul_services start="2021-05-13 04:53:26.707691538 +0000 UTC m=+35.630538077"
	client.alloc_runner.task_runner: finished exited hooks: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader name=consul_services end="2021-05-13 04:53:26.707709892 +0000 UTC m=+35.630556439" duration=18.362µs
	client.alloc_runner.task_runner: finished exited hooks: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader end="2021-05-13 04:53:26.707724459 +0000 UTC m=+35.630571004" duration=187.402µs
	client.alloc_runner.task_runner: restarting task: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader reason= delay=0s
	client.alloc_runner.task_runner: setting task state: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader state=pending event=Restarting
	client.alloc_runner.task_runner: running pre kill hooks: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart start="2021-05-13 04:53:26.709188246 +0000 UTC m=+35.632034802"
	client.alloc_runner.task_runner: running prekill hook: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart name=consul_services start="2021-05-13 04:53:26.709235553 +0000 UTC m=+35.632082116"
	client.alloc_runner.task_runner: finished prekill hook: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart name=consul_services end="2021-05-13 04:53:26.709265597 +0000 UTC m=+35.632112194" duration=30.078µs
	client.alloc_runner.task_runner: finished pre kill hooks: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart end="2021-05-13 04:53:26.709286298 +0000 UTC m=+35.632132852" duration=98.05µs
	client.alloc_runner: handling task state update: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 done=false
	client.alloc_runner.task_runner: running prestart hooks: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader start="2021-05-13 04:53:26.710412782 +0000 UTC m=+35.633259319"
	client.alloc_runner.task_runner: skipping done prestart hook: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader name=validate
	client.alloc_runner.task_runner: running prestart hook: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader name=task_dir start="2021-05-13 04:53:26.710592955 +0000 UTC m=+35.633439509"
	client.alloc_runner.task_runner: finished prestart hook: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader name=task_dir end="2021-05-13 04:53:26.710622261 +0000 UTC m=+35.633468809" duration=29.3µs
	client.alloc_runner.task_runner: running prestart hook: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader name=logmon start="2021-05-13 04:53:26.711265971 +0000 UTC m=+35.634112520"

2021-05-13T04:53:27.708Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart reason= delay=0s
:                                                              :
:                                                              :						
2021-05-13T04:53:28.705Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=main_task1 reason= delay=0s
:                                                              :
:                                                              :
2021-05-13T04:53:29.705Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=main_task2 reason= delay=0s
:                                                              :
:                                                              :
2021-05-13T04:53:30.705Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=main_task3 reason= delay=0s
:                                                              :
:                                                              :
2021-05-13T04:53:31.706Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30 reason= delay=0s

This prevent us from restarting our allocations, since it needs to be started and shutdown in some orders.

Thanks for reviewing my ticket.

@drewbailey
Copy link
Contributor

Hi @SCYD,

This looks related/duplicate of #9464 #9841. I'll leave this issue open since I'm not sure if the others are using shutdown_delay which is something we'll take a look at when addressing those issues.

@drewbailey drewbailey added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label May 13, 2021
@scyd-cb
Copy link
Author

scyd-cb commented May 13, 2021

@drewbailey thanks for accepting this issue, for summary, allocation restart should honored those 3 features:

  • lifecycle (Specifies when a task is run within the lifecycle of a task group)
  • shutdown_delay (give in flight requests time to complete before shutting down)
  • leader flag (If set to true, when the leader task completes, all other tasks within the task group will be gracefully shutdown.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/task lifecycle type/bug
Projects
None yet
Development

No branches or pull requests

3 participants