Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart of Nomad Client causes port forwarding issues upon restart of running Connect jobs #7537

Closed
spuder opened this issue Mar 28, 2020 · 4 comments · Fixed by #7643
Closed
Assignees

Comments

@spuder
Copy link
Contributor

spuder commented Mar 28, 2020

Nomad version

Nomad: 0.10.4
Consul: 1.7.0
Consul ACLs: Enabled

Issue

Rotating a consul token causes nomad agent to be unable to use consul connect for any new jobs, until you reboot the agent's OS.

Reproduction steps

  1. Setup consul-connect on a cluster with ACLs enabled.
  2. Create the following policy and apply it to the nomad servers
agent_prefix "" {
    policy = "write"
}

node_prefix "" {
    policy = "write"
}

service_prefix "" {
    policy = "write"
}

acl = "write"

  1. Create a Token with the policy. Save the TokenID to the Nomad Agents in /etc/nomad/config.json
    (I've since learned that this is not best practice. I"m leaving it here for consistency)
{
  "consul": {
    "token": "123456"
  }
...

4 Restart the Nomad Agents

service nomad restart
  1. Start the countdash job as shown in the nomad documentation
job "countdash" {
   datacenters = ["dc1"]
   group "api" {
     network {
       mode = "bridge"
     }

     service {
       name = "count-api"
       port = "9001"

       connect {
         sidecar_service {}
       }
     }

     task "web" {
       driver = "docker"
       config {
         image = "hashicorpnomad/counter-api:v1"
       }
     }
   }

   group "dashboard" {
     network {
       mode ="bridge"
       port "http" {
         static = 9002
         to     = 9002
       }
     }

     service {
       name = "count-dashboard"
       port = "9002"

       connect {
         sidecar_service {
           proxy {
             upstreams {
               destination_name = "count-api"
               local_bind_port = 8080
             }
           }
         }
       }
     }

     task "dashboard" {
       driver = "docker"
       env {
         COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
       }
       config {
         image = "hashicorpnomad/counter-dashboard:v1"
       }
     }
   }
 }
  1. While the job is running, create a new token. Save that new token to the Nomad Agent
    /etc/nomad/config.json
{
  "consul": {
    "token": "9876543"
  }
...
  1. Restart the nomad agent
service nomad restart
  1. Attempt to redeploy/restart the countdash job (or start a new job that uses consul-connect)

You will find that the job will start successfully, however you will be unable to connect to the dashboard. Also any future job that requires consul connect will also fail on that agent.

Expected Behavior

Rotating a token should not require draining or rebooting a nomad agent

Actual behavior

When a token changes, nomad agent enters an unusable state that requires a reboot to fix

Work Around

After rotating a consul token, do a rolling reboot of the entire nomad agent cluster

Recovering

The following attempts to revive the now broken nomad agent are unsuccessful

service nomad restart
service consul restart
service docker restart
iptables -F CNI-FORWARD

The only way I've been able to recover the nomad agent is to physically reboot the machine.

I don't have a quick method to reproduce locally, but I have recorded a video of me reproducing it. I have reproduced it twice now in my environment.

https://youtu.be/OrVhA-gh4nM (Recommend watching at 4k)

Additional information

I will attempt to reproduce again and capture the logs. A couple of interesting log enteries that happen at about the same time.

Mar 28 21:54:06 sb-sand-nomadagent1 nomad[13102]:     2020-03-28T21:54:06.667Z [INFO]  client.gc: marking allocation for GC: alloc_id=8e258d0e-b5af-8cdc-1272-8cad2e84dc36
Mar 28 21:54:06 sb-sand-nomadagent1 nomad[13102]:     2020-03-28T21:54:06.674Z [ERROR] client.alloc_runner.runner_hook: failed to cleanup network for allocation, resources may have leaked: alloc_id=8e258d0e-b5af-8cdc-1272-8cad2e84dc36 alloc=8e258d0e-b5af-8cdc-1272-8cad2e84dc36 error="cni plugin not initialized"
Mar 28 21:56:22 sb-sand-nomadagent1 nomad[13481]:     2020-03-28T21:56:22.978Z [ERROR] client.alloc_runner.task_runner.task_hook.envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: alloc_id=fd23333b-4d04-163c-c223-ad8c7a9b1eb4 task=connect-proxy-count-api error="exit status 1" stderr="==> Failed looking up sidecar proxy info for _nomad-task-fd23333b-4d04-163c-c223-ad8c7a9b1eb4-group-api-count-api-9001: Unexpected response code: 403 (ACL not found)
Mar 28 21:56:22 sb-sand-nomadagent1 nomad[13481]: "
Mar 28 21:56:22 sb-sand-nomadagent1 nomad[13481]:     2020-03-28T21:56:22.979Z [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=fd23333b-4d04-163c-c223-ad8c7a9b1eb4 task=connect-proxy-count-api error="prestart hook "envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"
Mar 28 21:56:22 sb-sand-nomadagent1 nomad[13481]:     2020-03-28T21:56:22.979Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fd23333b-4d04-163c-c223-ad8c7a9b1eb4 task=connect-proxy-count-api reason="Restart within policy" delay=15.298581092s
Mar 28 21:56:24 sb-sand-nomadagent1 nomad[13481]:     2020-03-28T21:56:24.078Z [ERROR] client.alloc_runner.task_runner.task_hook.envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: alloc_id=1f1cc796-98b6-8d01-ccfa-3052ede8df49 task=connect-proxy-count-dashboard error="exit status 1" stderr="==> Failed looking up sidecar proxy info for _nomad-task-1f1cc796-98b6-8d01-ccfa-3052ede8df49-group-dashboard-count-dashboard-9002: Unexpected response code: 403 (ACL not found)
Mar 28 21:56:24 sb-sand-nomadagent1 nomad[13481]: "

** Update 1 **

The biggest work around to this issue is to avoid putting tokens in /etc/nomad/config.json. It is better to create a dedicated policy and apply the token to that policy.

@shoenig
Copy link
Member

shoenig commented Apr 3, 2020

Thank you for taking the time to report this, @spuder .

I think have been able to reproduce the underlying bad behavior, which is that sometimes after restarting Nomad server, something causes it to become unable to manage the network namespaces necessary for Connect. Quite possibly related to #7536 (again, thanks!).

I'm working on a minimal reproduction to help track down the problem.

@shoenig shoenig changed the title Rotating token for consul-connect makes agent unusable Restart of Nomad Client causes port forwarding issues upon restart of running Connect jobs Apr 6, 2020
@shoenig
Copy link
Member

shoenig commented Apr 6, 2020

We've narrowed this down to a problem with our use of the go-cni plugin. I've updated the title to better reflect what's happening - reproduction is as simple as

  1. run nomad with with a useable connect configuration (no ACLs required) (e.g. sudo nomad agent -dev-connect)
  2. run a connect job that makes use of static port forwarding (e.g. nomad job init -connect -short && nomad job run example.nomad)
  3. restart nomad agent
  4. stop the job (first cni plugin error messages appear)
  5. start the job (static port mapping no longer works)

@nickethier
Copy link
Member

Hey @spuder thanks again for the detailed bug report. We believe we’ve fixed this in the release candidate for 0.11 that was just announced. https://releases.hashicorp.com/nomad/0.11.0-rc1/

@github-actions
Copy link

github-actions bot commented Nov 9, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants