-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
driver/docker: Don't pull InfraImage if it exists #13265
Conversation
I just realized this is likely a dupe of #11380. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR @tbehling it's looking great!
It would be nice to change a bit of the logic around the tag checking before merging, please let me know if you have the time to take this change on, otherwise I am happy to push this.
This PR also needs a changelog entry: https://github.com/hashicorp/nomad/blob/main/contributing/CHANGELOG.md#developer-guide
I look forward to this getting merged so I can close that. This one seems a little more comprehensive. |
c2218a6
to
4cd988c
Compare
74d66b1
to
402e594
Compare
Would someone be able to review this again please :) Would solve a big issue we're having! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tbehling and thanks for working on the changes for this PR. I have pushed commit e3a60b4 in order to resolve a couple of issue I found locally.
The first is that my previous comment regarding checking the error returned from client.InspectImage
was a half truth. I checked the library and when the image is not found an error is returned. This therefore meant the infra-image could never be downloaded because the driver would always error and exit without trying to download it. I have reverted this behaviour as seen elsewhere in the Docker driver. I plan to raise a followup PR to wrap this client call in a helper so other committers do not need to be aware of this and we can handle the error type returned to give better errors and logs to operators.
The second was that although we were incrementing the infra-image reference count, we were not decrementing the count when the network was destroyed. I have therefore modified the DestroyNetwork
function call to perform a best-effort image reference decrement.
@tgross seeing as I have made a couple of significant changes to this PR, i'd appreciate another set of eyes on this PR.
drivers/docker/network.go
Outdated
if tag != "latest" { | ||
dockerImage, err := client.InspectImage(d.config.InfraImage) | ||
if err != nil { | ||
d.logger.Debug("InspectImage failed for infra_image container pull", | ||
"image", d.config.InfraImage, "error", err) | ||
} else if dockerImage != nil { | ||
// Image exists, so no pull is attempted; just increment its reference count | ||
d.coordinator.IncrementImageReference(dockerImage.ID, d.config.InfraImage, allocID) | ||
return nil | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a (narrow) time-of-check-time-of-use race here. If we call InspectImage
and then a concurrent task shutdown happens before we IncrementImageReference
, we could end up removing the image and it would no longer exist by the time we get to PullImage
below.
It looks like the dockerCoordinator
has a mutex around accessing images. We might want to look at locking around this whole block. We can release after this block because by that point we'll have incremented the ref count and we don't have to worry about the image getting deleted out from under us (except outside of Nomad, of course).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good shout!
I've pushed a new commit to handle this with some comments around why it doesn't use defer in particular which would make it a little easier to grok: 922445e
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR tweaks the behavior of Nomad's Docker driver during network setup.
When Nomad sets up a new Docker network, it establishes the network namespace by running a "no-op" container, internally called
InfraImage
. By default the container's image isgcr.io/google_containers/pause-amd64:3.1
(whereamd64
is the GOARCH).Every time a network is set up, that container image is pulled from the Google Container Registry (or alternate image, if configured). In a cluster with a high rate of Nomad job creation, the Docker pull is seen to intermittently fail (about 0.4%), perhaps due to rate limiting or normal transient network problems.
Because this image is versioned (tagged
3.1
, currently), it is not necessary to run a Docker pull on every use. This PR proposes a refinement to the Docker driver'sCreateNetwork
function to first check if the image is present locally before pulling it. The expectation is to improve resiliency of Nomad job launches, and to remove a bit of excess load from GCR (or other configured registry).Fixes #11014
Fixes #11380
Fixes #10318