Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generated /etc/hosts not shared between tasks #10809

Closed
kainoaseto opened this issue Jun 24, 2021 · 7 comments · Fixed by #10823
Closed

generated /etc/hosts not shared between tasks #10809

kainoaseto opened this issue Jun 24, 2021 · 7 comments · Fixed by #10823
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/docker theme/networking type/bug
Milestone

Comments

@kainoaseto
Copy link

Nomad version

Nomad v1.1.2 (60638a0)

Operating system and Environment details

AWS Linux 2
Docker version:
Client:
Version: 19.03.6-ce
API version: 1.40
Go version: go1.13.4
Git commit: 369ce74
Built: Fri May 29 04:01:26 2020
OS/Arch: linux/amd64
Experimental: false

Server:
Engine:
Version: 19.03.6-ce
API version: 1.40 (minimum version 1.12)
Go version: go1.13.4
Git commit: 369ce74
Built: Fri May 29 04:01:57 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.3.2
GitCommit: ff48f57fc83a8c44cf4ad5d672424a98ba37ded6
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683

Issue

In the latest Nomad release v1.1.2 the changelog mentions the improvement from GH-10766:
"docker: Tasks using network.mode = "bridge" that don't set their network_mode will receive a /etc/hosts file that includes the pause container's hostname and any extra_hosts."

I found that this is actually a breaking change from previous behavior in Nomad v1.1.1 when the network_mode is not set.

Reproduction steps

The scenario that I found this to be an issue in is where:

  • no network_mode is defined for two tasks in a Task Group using a network namespace
  • in one task, task A, the /etc/hosts file is updated at runtime from the docker entrypoint
  • the other task, task B, uses the updated entries in the /etc/hosts file to resolve service-to-service mesh routing

In Nomad version <= v1.1.1 the /etc/hosts file is updated in both containers but with this new change in v1.1.2 the hosts file is overwritten in Task B (or is not shared anymore?) which breaks out current network setup for routing between containers.

Workaround

To fix this, we are removing the manual updating of /etc/hosts from Task A and instead using the extra_hosts feature with the docker driver so both tasks are updated (this is the way we should have done it from the get go but this was a forcing function). We're currently making this change to restore networking in our environment but it came to us as a surprise when we updated and all new deployments could no longer talk to other services.

Expected Result

The CHANGELOG.md reflects the breaking change or a change is implemented that adjusts this behavior to continue shared R/W access of /etc/hosts between the pause, task A, and task B container

Actual Result

surprised_pikachu.jpg upon updating to the latest Nomad when the /etc/hosts no longer appears to be shared between network namespaced containers

@tgross
Copy link
Member

tgross commented Jun 25, 2021

Hi @kainoaseto, this does look like an unfortunate breaking change which we should document better so folks don't get caught out by it. Do you have a minimal example job that exercises the behavior and one that shows the workaround? That might help us fix up the upgrade guide here.

@tgross tgross self-assigned this Jun 25, 2021
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jun 25, 2021
@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jun 25, 2021
@tgross tgross changed the title Nomad 1.1.2 - Breaking Change with GH-10766 unexpected/undocumented breaking change with /etc/hosts Jun 25, 2021
@kainoaseto
Copy link
Author

Hi @tgross thanks for getting back to me on this! Sure thing, this job just runs two http-echo tasks but if you exec into both on /bin/sh and modify the /etc/hosts file the problem should be observed (won't propagate to both).

I've uploaded the job file here: https://github.com/kainoaseto/nomad-jobs/blob/main/network-namespace/job.hcl

Unfortunately after digging in more, the workaround ended up not being a viable upgrade path for us. The goal for upgrading was to implement a forwards compatible way of setting /etc/hosts on current job files so we could avoid downtime and re-deploying all jobs.

Our requirements around the upgrade:

  • non-downtime release, existing jobs continue to function once the cluster upgrade is completed
  • we can only modify the proxy(task a) task's ability to update the /etc/hosts file or the overall job config before upgrading, but not the app's(task b) runtime to modify the /etc/hosts file (like we can with the proxy)

We tried the following workarounds:

  1. Use the extra_hosts feature, unfortunately this errors out on 1.1.1 so we can't preload jobs to prepare for the upgrade without forcing downtime by redeploying all jobs
  2. Template the /etc/hosts file, but the hook in Nomad 1.1.2 that modifies the /etc/hosts file runs after templating so it overwrites the changes

With that we've hit the extent of our Nomad skills to solve this and provide a non-downtime upgrade path from 1.1.1->1.1.2. If you have any thoughts we could try it would be greatly appreciated! Otherwise it seems we might be stuck on 1.1.1

@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/waiting-reply theme/docs Documentation issues and enhancements labels Jun 28, 2021
@tgross
Copy link
Member

tgross commented Jun 28, 2021

Thanks @kainoaseto I think I have a better understanding of the problem now. What we should have done when we created the /etc/hosts was to put the generated file in the allocation directory, not the task directory. That way it can be shared easily between tasks in the same allocation.

The patch in #10823 fixes the problem you've described here. On a build with that patch, I ran the job you provided and modified the /etc/hosts file, and the change is now propagated across tasks:

$ nomad alloc exec -task task-a-proxy 848 /bin/sh -c 'echo 192.168.1.256 wintermute >> /etc/hosts'

$ nomad alloc exec -task task-b-app 848 cat /etc/hosts
# this file was generated by Nomad
127.0.0.1 localhost
::1 localhost
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

# this entry is the IP address and hostname of the allocation
# shared with tasks in the task group's network
172.26.64.201 7d0f0c1b523f
192.168.1.256 wintermute

Something I wanted to check on before getting that PR reviewed and merged was this:

Use the extra_hosts feature, unfortunately this errors out on 1.1.1 so we can't preload jobs to prepare for the upgrade without forcing downtime by redeploying all jobs

Can you describe what you're seeing here? I want to make sure there's not a second bug.

@tgross tgross changed the title unexpected/undocumented breaking change with /etc/hosts generated /etc/hosts not shared between tasks Jun 28, 2021
@tgross tgross added this to the 1.1.3 milestone Jun 28, 2021
@tgross
Copy link
Member

tgross commented Jun 28, 2021

Oh, you're probably seeing the error message "Conflicting options: custom host-to-IP mapping and the network mode."? See also #6322 for that, for which we don't currently have a workaround.

@kainoaseto
Copy link
Author

Thank you @tgross for making that patch so quickly! That seems like it'll take care of our problems and yes that's the exact error message I was getting. No problem, just prevented recourse to work around this but for a 1.1.1 -> 1.1.2 upgrade.

Nomad - Community Issues Triage automation moved this from In Progress to Done Jun 30, 2021
@tgross
Copy link
Member

tgross commented Jun 30, 2021

#10823 fixes this and will ship in the upcoming Nomad 1.1.3

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/driver/docker theme/networking type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants