-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tasks using Docker drivers frequently failed to restart #16338
Comments
I'm seeing the same error on Nomad 1.4.5 and 1.5.0 on Debian 11. I've noticed that all init/pause containers are vanishing approximately 10 minutes after they're started, after which point the error occurs on attempted container restart. I believe these are the containers that are missing in the error message. At that point I also see docker stats (and by extension the Nomad resource utilization page) become zero for any container using bridge network ports. Rolling back to Nomad 1.4.4 fixes this. Will see if I can grab something pertinent from the logs when I have more time. |
Searching the logs for the missing container ID i get:
|
Indeed, it might well be linked to the pause containers : they are missing on my install too. And all my jobs are using bridge networking |
Here's a simple reproducer job
Just run it, wait ~10min (for the pause container to vanish), and try to restart the connect task from the GUI. |
Could be some regression introduced with #15732 |
Hi @dani and thanks for raising this issue. I was able to reproduce this locally using the job spec you provided running both Nomad and Consul in dev modes. here is the
here is the same command after some time has passed, the job is still running at this point:
I redirected the logs from Nomad to a file; when searching for the pause container ID within them:
The most interesting line seems to be I'll raise this internally, so we can get this looked into straight away. |
I believe we understand the bug now - for an immediate workaround you should be able to disable dangling container reconciliation https://developer.hashicorp.com/nomad/docs/drivers/docker#enabled to stop this from happening. Basically when we added the We'll get a proper fix out shortly. |
Nomad version
Nomad v1.5.0
BuildDate 2023-03-01T10:11:42Z
Revision fc40c49
Operating system and Environment details
AlmaLinux 8.7 x86_64
Nomad installed from the pre-built standalone binary
Docker CE 23.0.1 (from the official Docker repo)
Issue
Since I've updated Nomad from 1.4.4 to 1.5.0, I have frequent errors when a task is to be restarted :
It can be triggered by a template re-rendered with the default restart value for on_change, or by manually restarting a task from the GUI. The task is correctly stopped, but when starting it again, it fails.
Reproduction steps
Not sure yet what the exact conditions to reproduce are. It doesn't seem to be 100% of the time. But I can reproduce it easily by restarting from the GUI an connect sidecar.
Expected Result
The task is correctly restarted
Actual Result
The task fails to be started and is marked as failed
Job file (if appropriate)
Still trying to find a simple reproducer (can trigger it easily with internal jobs which I can't share)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
Sample logs when the error is triggered. In this example, I restarted the connect proxy sidecar task (connect-proxy-pharma-ws-stub) from the GUI
The text was updated successfully, but these errors were encountered: