Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

driver/docker: Don't pull InfraImage if it exists #13265

Merged
merged 5 commits into from
Jul 7, 2022

Conversation

tbehling
Copy link
Contributor

@tbehling tbehling commented Jun 7, 2022

This PR tweaks the behavior of Nomad's Docker driver during network setup.

When Nomad sets up a new Docker network, it establishes the network namespace by running a "no-op" container, internally called InfraImage. By default the container's image is gcr.io/google_containers/pause-amd64:3.1 (where amd64 is the GOARCH).

Every time a network is set up, that container image is pulled from the Google Container Registry (or alternate image, if configured). In a cluster with a high rate of Nomad job creation, the Docker pull is seen to intermittently fail (about 0.4%), perhaps due to rate limiting or normal transient network problems.

Because this image is versioned (tagged 3.1, currently), it is not necessary to run a Docker pull on every use. This PR proposes a refinement to the Docker driver's CreateNetwork function to first check if the image is present locally before pulling it. The expectation is to improve resiliency of Nomad job launches, and to remove a bit of excess load from GCR (or other configured registry).

Fixes #11014
Fixes #11380
Fixes #10318

drivers/docker/network.go Outdated Show resolved Hide resolved
@tbehling
Copy link
Contributor Author

tbehling commented Jun 7, 2022

I just realized this is likely a dupe of #11380.

@jrasell jrasell removed the request for review from nickethier June 7, 2022 06:53
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @tbehling it's looking great!

It would be nice to change a bit of the logic around the tag checking before merging, please let me know if you have the time to take this change on, otherwise I am happy to push this.

This PR also needs a changelog entry: https://github.com/hashicorp/nomad/blob/main/contributing/CHANGELOG.md#developer-guide

drivers/docker/network.go Outdated Show resolved Hide resolved
drivers/docker/network.go Outdated Show resolved Hide resolved
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jun 7, 2022
@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jun 7, 2022
@tgross tgross assigned jrasell and unassigned tbehling Jun 7, 2022
@phreakocious
Copy link
Contributor

I just realized this is likely a dupe of #11380.

I look forward to this getting merged so I can close that. This one seems a little more comprehensive.

@jrasell jrasell added this to the 1.3.2 milestone Jun 7, 2022
@tbehling tbehling requested a review from a team June 29, 2022 14:56
@tbehling tbehling requested a review from a team as a code owner June 29, 2022 14:56
@tbehling tbehling force-pushed the tbehling/avoid-pull-infra-image branch from c2218a6 to 4cd988c Compare June 29, 2022 14:59
@tbehling tbehling changed the title drivers/docker: Don't pull InfraImage if it exists driver/docker: Don't pull InfraImage if it exists Jun 29, 2022
tbehling added a commit that referenced this pull request Jun 29, 2022
@TurtleIntercept
Copy link

Would someone be able to review this again please :) Would solve a big issue we're having!

Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tbehling and thanks for working on the changes for this PR. I have pushed commit e3a60b4 in order to resolve a couple of issue I found locally.

The first is that my previous comment regarding checking the error returned from client.InspectImage was a half truth. I checked the library and when the image is not found an error is returned. This therefore meant the infra-image could never be downloaded because the driver would always error and exit without trying to download it. I have reverted this behaviour as seen elsewhere in the Docker driver. I plan to raise a followup PR to wrap this client call in a helper so other committers do not need to be aware of this and we can handle the error type returned to give better errors and logs to operators.

The second was that although we were incrementing the infra-image reference count, we were not decrementing the count when the network was destroyed. I have therefore modified the DestroyNetwork function call to perform a best-effort image reference decrement.

@tgross seeing as I have made a couple of significant changes to this PR, i'd appreciate another set of eyes on this PR.

Comment on lines 142 to 152
if tag != "latest" {
dockerImage, err := client.InspectImage(d.config.InfraImage)
if err != nil {
d.logger.Debug("InspectImage failed for infra_image container pull",
"image", d.config.InfraImage, "error", err)
} else if dockerImage != nil {
// Image exists, so no pull is attempted; just increment its reference count
d.coordinator.IncrementImageReference(dockerImage.ID, d.config.InfraImage, allocID)
return nil
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a (narrow) time-of-check-time-of-use race here. If we call InspectImage and then a concurrent task shutdown happens before we IncrementImageReference, we could end up removing the image and it would no longer exist by the time we get to PullImage below.

It looks like the dockerCoordinator has a mutex around accessing images. We might want to look at locking around this whole block. We can release after this block because by that point we'll have incremented the ref count and we don't have to worry about the image getting deleted out from under us (except outside of Nomad, of course).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good shout!

I've pushed a new commit to handle this with some comments around why it doesn't use defer in particular which would make it a little easier to grok: 922445e

@jrasell jrasell added the backport/1.3.x backport to 1.3.x release line label Jul 7, 2022
@jrasell jrasell merged commit 295021c into main Jul 7, 2022
Nomad - Community Issues Triage automation moved this from In Progress to Done Jul 7, 2022
@jrasell jrasell deleted the tbehling/avoid-pull-infra-image branch July 7, 2022 15:44
@github-actions
Copy link

github-actions bot commented Nov 5, 2022

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging this pull request may close these issues.

network bridge mode,Can I use a local mirror for pause-amd64? Allow using a local infra_image
5 participants