Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long running jobs exiting on docker 23+ #17499

Closed
srisch opened this issue Jun 12, 2023 · 2 comments
Closed

Long running jobs exiting on docker 23+ #17499

srisch opened this issue Jun 12, 2023 · 2 comments
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/docker type/bug
Milestone

Comments

@srisch
Copy link

srisch commented Jun 12, 2023

Nomad version

Nomad v1.5.5
BuildDate 2023-05-05T12:50:14Z
Revision 3d63bc6

Operating system and Environment details

Ubuntu 20.04

Issue

Currently a bug seems to exist with docker version 23+ that long running jobs being shutdown are exiting before the respected shutdown times. Docker 20.10 behaves as expected.

Reproduction steps

Create a long running job, and set kill_timeout = "1h"
Scale your service down to 0 to emulate it gracefully shutting down. It should be terminated in around 10 minutes as it seems that the request to the docker shutdown endpoint is timing out.

Expected Result

Gracefully stops

Actual Result

Job is terminated early

Nomad Client logs (if appropriate)

2023-05-25T18:46:19.871Z [TRACE] client.alloc_runner.task_runner: Kill requested: alloc_id=84a4b704-5390-4916-db6c-0995657c72c3 task=sometask                                                                                                                                                                                                                           │
│2023-05-25T18:46:19.871Z [TRACE] client.alloc_runner.task_runner: Kill event: alloc_id=84a4b704-5390-4916-db6c-0995657c72c3 task=sometask event_type=Killing event_reason=""                                                                                                                                                                                            │
│2023-05-25T18:46:19.871Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=84a4b704-5390-4916-db6c-0995657c72c3 task=sometask type=Killing msg="Sent interrupt. Waiting 1h0m0s before force killing" failed=false                                                                                                                                           │
│2023-05-25T18:46:19.871Z [DEBUG] client.alloc_runner.task_runner: waiting before killing task: alloc_id=84a4b704-5390-4916-db6c-0995657c72c3 task=sometask shutdown_delay=10s                                                                                                                                                                                           │
│2023-05-25T18:46:19.884Z [TRACE] client.alloc_runner.task_runner: setting task state: alloc_id=84a4b704-5390-4916-db6c-0995657c72c3 task=sometask state=pending                                                                                                                                                                                                         │
│2023-05-25T18:46:19.884Z [TRACE] client.alloc_runner: handling task state update: alloc_id=84a4b704-5390-4916-db6c-0995657c72c3 done=false                                                                                                                                                                                                                                                   │
│2023-05-25T18:46:19.884Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=84a4b704-5390-4916-db6c-0995657c72c3 task=sometask type="Waiting for shutdown delay" msg="Waiting for shutdown_delay of 10s before killing the task." failed=false                                                                                                               │
│2023-05-25T18:46:19.891Z [TRACE] client.alloc_runner: sending updated alloc: alloc_id=84a4b704-5390-4916-db6c-0995657c72c3 client_status=pending desired_status=""                                                                                                                                                                                                                           │
│2023-05-25T18:46:19.891Z [TRACE] client.alloc_runner: handling task state update: alloc_id=84a4b704-5390-4916-db6c-0995657c72c3 done=false                                                                                                                                                                                                                                                   │
│2023-05-25T18:46:19.891Z [TRACE] client.alloc_runner: sending updated alloc: alloc_id=84a4b704-5390-4916-db6c-0995657c72c3 client_status=pending desired_status=""                                                                                                                                                                                                                           │
│2023-05-25T18:51:29.892Z [ERROR] client.alloc_runner.task_runner: failed to kill task: alloc_id=84a4b704-5390-4916-db6c-0995657c72c3 task=sometask backoff=5s error="Failed to stop container 95aa9b6282da2bd9de67616eb612f9bbc6f5937078bb9c9d613fe6d85dfb8be6: Post \"http://unix.sock/containers/95aa9b6282da2bd9de67616eb612f9bbc6f5937078bb9c9d613fe6d85dfb8be6/stop│
│2023-05-25T18:56:34.893Z [ERROR] client.alloc_runner.task_runner: failed to kill task: alloc_id=84a4b704-5390-4916-db6c-0995657c72c3 task=sometask backoff=20s error="Failed to stop container 95aa9b6282da2bd9de67616eb612f9bbc6f5937078bb9c9d613fe6d85dfb8be6: Post \"http://unix.sock/containers/95aa9b6282da2bd9de67616eb612f9bbc6f5937078bb9c9d613fe6d85dfb8be6/sto│
│2023-05-25T18:58:46.613Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=84a4b704-5390-4916-db6c-0995657c72c3 task=sometask type=Terminated msg="Exit Code: 0" failed=false                                                                                                                                                                               │
│2023-05-25T18:58:46.616Z [TRACE] client.alloc_runner: handling task state update: alloc_id=84a4b704-5390-4916-db6c-0995657c72c3 done=false                                                                                                                                                                                                                                                   │
│2023-05-25T18:58:46.616Z [TRACE] client.alloc_runner: sending updated alloc: alloc_id=84a4b704-5390-4916-db6c-0995657c72c3 client_status=pending desired_status=""                                                                                                                                                                                                                           │
│2023-05-25T18:58:47.618Z [DEBUG] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=84a4b704-5390-4916-db6c-0995657c72c3 task=sometask error="container stopped"                                                                                                                        │
│2023-05-25T18:58:48.637Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=84a4b704-5390-4916-db6c-0995657c72c3 task=sometask type=Killed msg="Task successfully killed" failed=false                  

I've also opened an issue with docker about this, as it seems there was an API change made to the stop endpoint and i'm not sure if it's expected behavior or not.
moby/moby#45731

@tgross
Copy link
Member

tgross commented Jun 20, 2023

Hi @srisch this does sound an awful lot like #17023, so crosslinking that.

@tgross tgross added theme/driver/docker hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jun 20, 2023
@tgross tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Jun 20, 2023
@tgross
Copy link
Member

tgross commented Jun 28, 2023

Fixed in #17731, which shipped in the recent Nomad 1.6.0-beta.1 and will get backported to 1.5.x and 1.4.x once we go GA.

@tgross tgross closed this as completed Jun 28, 2023
Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Jun 28, 2023
@tgross tgross added this to the 1.6.0 milestone Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/docker type/bug
Projects
Development

No branches or pull requests

2 participants