Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker driver removing container image on task failure #8552

Closed
stevenscg opened this issue Jul 28, 2020 · 10 comments
Closed

Docker driver removing container image on task failure #8552

stevenscg opened this issue Jul 28, 2020 · 10 comments

Comments

@stevenscg
Copy link

Nomad version

Nomad v0.12.0 (8f7fbc8)

Operating system and Environment details

CentOS 7.8.2003

Issue

As of v0.12.0, container images are incorrectly removed from the host when a task fails and docker.cleanup.image is set to false.

If the task starts, runs and then exits properly, this issue does not appear to occur.

The same jobs did not appear to exhibit this behavior on any of the prior releases. Version 0.11.3 was the most recent used prior to 0.12.0.

As shown in the config example, we have also recently tried quoting the client options keys and values per the documentation without any apparent change in the behavior described by this issue. We typically just quote the values in client options.

The use case for retaining the container images is a development environment where the same image is used for several days at a time. The setup scripts for this environment take care of building the container images, so users of this environment see jobs that fail to start randomly now that would have continued to work with past versions of Nomad.

Reproduction steps

Config file:

client {
  enabled = true
  options {
    driver.whitelist = "docker"
    "docker.cleanup.image" = "false"
  }
}

plugin "docker" {
  config {
    volumes {
      enabled = true
    }
  }
}

We are not certain what kind of task failure causes this behavior at this time. A simple task that exits with code 1 or similar may be sufficient.

@notnoop
Copy link
Contributor

notnoop commented Jul 28, 2020

Thank you for reporting the issue. We'll investigate and follow up!

FWIW, between 0.9.0 and 0.11.2, Nomad wasn't GCing any images! We fixed image GC in #7947 in 0.11.2 - so I'm little surprised you didn't encounter this in 0.11.3 - so we'll need to dig in further.

@stevenscg
Copy link
Author

@notnoop Noted. Thanks. I believe we were running 0.11.3 most recently because we tend to keep very current for this development environment. However, it could have been 0.11.2.

@stevenscg
Copy link
Author

I believe that this issue is still present on v0.12.3.

@stevenscg
Copy link
Author

@notnoop Is there any known or potential workaround for this kind of issue that I could try in the interim?

@stevenscg
Copy link
Author

I spent some time with versions 0.11.4 and 0.11.2 on my test instance where this problem was occurring. These versions have the same behavior as 0.12.3, so I'm not entirely sure what's happening.

As a workaround, I tried using a long cleanup image delay which would be fine for my use case. It did not seem to make any difference in the errant behavior.

client {
  enabled = true
  ....
  options {
    "docker.auth.config" = "/etc/docker/config.json"
    "driver.whitelist" = "docker"
    "docker.cleanup.image" = "true"
    "docker.cleanup.image.delay" = "30m"
  }
}
docker version
Client: Docker Engine - Community
 Version:           19.03.12
 API version:       1.40
 Go version:        go1.13.10
 Git commit:        48a66213fe
 Built:             Mon Jun 22 15:46:54 2020
 OS/Arch:           linux/amd64
 Experimental:      false

consul version
Consul v1.8.3
Revision a9322b9c7
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

@stevenscg
Copy link
Author

Still seeing this on v0.12.4.

@notnoop
Copy link
Contributor

notnoop commented Sep 15, 2020

Hi @stevenscg, I'm very very sorry for taking a very long to investigate this. The issue seems to be mixing the old deprecated syntax with the new one. If plugin "docker" { config { ... } } is present in the config, the docker driver ignores the old client options fields. As you pointed out in the work around, once you use the options exclusively, the options are interpreted as expected.

I'd suggest adopting the new plugin config syntax with config like the following:

client {
  enabled = true
  options {
    driver.whitelist       = "docker"
  }
}

plugin "docker" {
  config {
    volumes {
      enabled = true
    }

    gc {
      image = false
    }
  }
}

Let me know if that addresses the issue!

@stevenscg
Copy link
Author

@notnoop Thanks for the info! I had moved to the plugin syntax for "volumes" but not yet "gc", so I think this will fix my particular issue. I'll drop back in a few days and close it if it all checks out.

@stevenscg
Copy link
Author

This all looks good, thanks! Closing.

@github-actions
Copy link

github-actions bot commented Nov 2, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants