Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to open netns when using Consul Connect: Unknown FS magic 1021994 #8371

Closed
liuzhen opened this issue Jul 7, 2020 · 11 comments
Closed

Comments

@liuzhen
Copy link

liuzhen commented Jul 7, 2020

Nomad version

Nomad v0.11.3 (8918fc8)

Operating system and Environment details

Red Hat Enterprise Linux Server release 7.5 (Maipo)
LANG=en_US.UTF-8PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/consul

Issue

failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure bridge network: failed to open netns "/var/run/docker/netns/48b267c7ac4f": unknown FS magic on "/var/run/docker/netns/48b267c7ac4f": 1021994

Reproduction steps

setup cni-plugins-v0.8.6
create job

Job file (if appropriate)

job "cm-api-${env}" {
  datacenters = ["a", "b", "c"]
  type = "service"

  constraint {
    attribute = "$${node.class}"
    value = "${env}"
  }

  # only one group
  group "cm-api" {
    restart {
      attempts = 3
      interval = "30m"

      delay = "15s"
    }

    network {
      mode = "bridge"
    }

    service {
      name = "cm-api"
      port = 8100

      connect {
        sidecar_service {}
      }
    }

    task "cm-api" {
      env {
        RELOAD_INTERVAL = 3600
      }

      driver = "docker"

      config {
        image = "private-registry/cm-api:${version}"
        force_pull = true
        network_mode = "host"
        auth {
          username = ""
          password = ""
        }
        auth_soft_fail = true
      }

      resources {
        network {
          port "metrics" { }
        }
      }

      service {
        name = "cm-api"
        port = "metrics"
        tags = ["$${node.class}"]

        check {
          type = "http"
          port = "metrics"
          path = "/metrics"
          interval = "30s"
          timeout = "10s"

          check_restart {
            limit = 3
            grace = "90s"
          }
        }
      }
    }
  }
}

Nomad Client logs (if appropriate)

If possible please post relevant logs in the issue.

Logs and other artifacts may also be sent to: nomad-oss-debug@hashicorp.com

Please link to your Github issue in the email and reference it in the subject
line:

To: nomad-oss-debug@hashicorp.com

Subject: GH-1234: Errors garbage collecting allocs

Emails sent to that address are readable by all HashiCorp employees but are not publicly visible.

Nomad Server logs (if appropriate)

@tgross
Copy link
Member

tgross commented Jul 7, 2020

Hi @liuzhen!

I was looking to see if I could find that error message and saw you'd opened containernetworking/plugins#507 in the CNI plugins project. That led me to this closed issue in the same project: containernetworking/plugins#69 which suggests this could be a problem with how the file system of the environment has been mounted (or possibly permissions to that filesystem). Is your Nomad client running as root on the host? Or is there anything unusual about how the file system mounts are built in your environment?

@liuzhen
Copy link
Author

liuzhen commented Jul 8, 2020

Hi, I saw that issue. We run nomad as root with Systemd.

# ps -ef |grep nomad
root     11634     1  0 Jul03 ?        00:39:57 /opt/nomad/nomad agent -config=/opt/nomad/nomad.d
root     11954 11634  0 Jul07 ?        00:00:47 /opt/nomad/nomad logmon
root     12093 11634  0 Jul07 ?        00:00:01 /opt/nomad/nomad docker_logger
root     16759 16685  0 08:38 pts/0    00:00:00 grep --color=auto nomad
root     24400 11634  0 Jul07 ?        00:00:32 /opt/nomad/nomad logmon
root     25515 11634  0 Jul07 ?        00:00:00 /opt/nomad/nomad docker_logger

and for the filesystems:

# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       3.9G  156M  3.5G   5% /
devtmpfs         24G     0   24G   0% /dev
tmpfs            24G  8.0K   24G   1% /tmp
tmpfs            24G  2.4G   22G  10% /run
tmpfs            24G     0   24G   0% /sys/fs/cgroup
/dev/sda5       4.8G  1.7G  2.9G  37% /usr
/dev/sda6        53G  8.5G   42G  17% /opt
/dev/sda3       3.9G  569M  3.1G  16% /var
...

/opt is where nomad is installed, /var is for docker
I don't see anything unusual.

@tgross
Copy link
Member

tgross commented Jul 8, 2020

We can look into this @liuzhen but I recommend re-opening containernetworking/plugins#507 as well to see if that project might have a better idea of what's going on.

@mw866
Copy link

mw866 commented Sep 22, 2020

I encountered the same error by following the official Consul Connect example at https://www.nomadproject.io/docs/integrations/consul-connect

After some research, I believe the issue is related to how Nomad handles network namespaces in my specific environment. However, there are not enough logs for me to investigate further.

$ cat /proc/version
Linux version 5.4.0-1015-raspi (buildd@bos02-arm64-074) (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)) #15-Ubuntu SMP Fri Jul 10 05:34:24 UTC 2020

$ nomad --version
Nomad v0.12.5 (514b0d667b57068badb43795103fb7dd3a9fbea7)

$ /opt/cni/bin/bridge
CNI bridge plugin v0.8.7

The error messages:

$ nomad monitor
2020-09-22T03:57:28.657Z [ERROR] client.alloc_runner.runner_hook: failed to cleanup network for allocation, resources may have leaked: alloc_id=62794dd4-7f42-2b7e-0053-74305e8d4321 alloc=62794dd4-7f42-2b7e-0053-74305e8d4321 error="unknown FS magic on "/var/run/docker/netns/20bca007e2cd": 1021994"

$ nomad alloc status ce2aeb64

ID                   = ce2aeb64-141c-15c4-3e7e-6dab880811ae
Eval ID              = ca71ee92
Name                 = countdash.dashboard[0]
Node ID              = d1967cec
Node Name            = ubuntu
Job ID               = countdash
Job Version          = 0
Client Status        = failed
Client Description   = Failed tasks
Desired Status       = stop
Desired Description  = alloc was rescheduled because it failed
Created              = 27m10s ago
Modified             = 10m44s ago
Deployment ID        = 9b960fd5
Deployment Health    = unhealthy
Replacement Alloc ID = 0ae4bbe4

Allocation Addresses (mode = "bridge")
Label                           Dynamic  Address
*http                           yes      192.168.0.123:9002 -> 9002
*connect-proxy-count-dashboard  yes      192.168.0.123:20092 -> 20092

Task "connect-proxy-count-dashboard" (prestart sidecar) is "dead"
Task Resources
CPU      Memory   Disk     Addresses
250 MHz  128 MiB  300 MiB  

Task Events:
Started At     = N/A
Finished At    = 2020-09-22T07:44:22Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type            Description
2020-09-22T07:44:22Z  Killing         Sent interrupt. Waiting 5s before force killing
2020-09-22T07:44:22Z  Not Restarting  Error was unrecoverable
2020-09-22T07:44:22Z  Driver Failure  Failed to start container a83d9f23882dca94175553c01a3e4e2dfaacffe6954d7ad1fba5f4eed13d3127: API error (409): cannot join network of a non running container: dbcf5bf83f899cf2f5ee9e07dbac5d9dc50a025a205ab8bdc215fb6f754daa8c
2020-09-22T07:44:21Z  Task Setup      Building Task Directory
2020-09-22T07:44:17Z  Received        Task received by client

Task "dashboard" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  300 MiB  300 MiB  

Task Events:
Started At     = N/A
Finished At    = 2020-09-22T07:44:22Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type                 Description
2020-09-22T07:44:22Z  Killing              Sent interrupt. Waiting 5s before force killing
2020-09-22T07:44:22Z  Sibling Task Failed  Task's sibling "connect-proxy-count-dashboard" failed
2020-09-22T07:44:17Z  Received             Task received by client

@mw866
Copy link

mw866 commented Sep 30, 2020

I found the issue despite the lack of logs/error messages.
The demo image hashicorpnomad/counter-api supports only linux/amd64 arch only whereas I use linux/arm64

@bert2002
Copy link
Contributor

bert2002 commented Dec 9, 2020

Hi,
I am encountering the same problem on s390x architecture.

# cat /proc/version 
Linux version 4.19.0-12-s390x (debian-kernel@lists.debian.org) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.152-1 (2020-10-18)

# nomad --version
Nomad v0.12.9 (34271252dcaf5af22d6ed59fb0e166216e5a8b69)

(not used here, but for reference)
# /opt/cni/bin/bridge
CNI bridge plugin v0.8.7

# docker --version
Docker version 18.06.3-ce, build d7080c1

With consul connect enabled I run into the same problem with cannot join network of a non running container, but when trying to run only a task (container is build on the same machine and is working when run without nomad) I run into this:

# nomad alloc status edb0aab7
ID                   = edb0aab7-94b2-0fff-3033-5c4fd312e698
Eval ID              = 303babe0
Name                 = demo.ux[0]
Node ID              = b1b420ab
Node Name            = demo
Job ID               = demo
Job Version          = 0
Client Status        = failed
Client Description   = Failed tasks
Desired Status       = stop
Desired Description  = alloc was rescheduled because it failed
Created              = 4m1s ago
Modified             = 3m46s ago
Deployment ID        = 9191967e
Deployment Health    = unhealthy
Replacement Alloc ID = 59a12e2d

Task "ux" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  300 MiB  300 MiB  

Task Events:
Started At     = N/A
Finished At    = 2020-12-09T03:21:02Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type           Description
2020-12-09T03:21:02Z  Setup Failure  failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: failed to open netns "/var/run/docker/netns/3693600416de": unknown FS magic on "/var/run/docker/netns/3693600416de": 1021994
2020-12-09T03:20:58Z  Received       Task received by client

I have the same setup running on an amd64 (different docker image though) machine and it is working fine. The filesystem is ext4 with the only difference that on the s390x machine we have stripe=2048 enabled.

This only happens when network mode is set to bridge. When configured to host the task is working.
Example:

job "demo" {
  datacenters = [ "demo" ]
  type = "service"

  group "ux" {
    count = 1

    network {
      mode = "bridge"
    }

    task "filebeat" {
      driver = "docker"

      resources {
        memory = 100
        cpu = 50
      }

      env {
        "BEATSNAME" = "filebeat"
      }

      config {
        image = "beats:7.9.1"

        mounts = [
          {
            type = "bind"
            target = "/Beats/filebeat.yml"
            source = "/opt/demo/filebeat/filebeat.yml"
          }
        ]
      }
    }
  }
}

The problem is that consul connect needs bridge mode to work!

@tgross tgross added the theme/consul/connect Consul Connect integration label Dec 9, 2020
@blake
Copy link
Member

blake commented Jan 3, 2021

I ran into both the "Unknown FS magic" and "cannot join network of a non running container" errors when using Nomad 0.12.9 on a Raspberry Pi 4 (Arm64).

I determined this is happening because Nomad 0.12.x defaults to using gcr.io/google_containers/pause-amd64:3.0for the pause container which is not compatible with the Pi's CPU architecture. My environment needs to use gcr.io/google_containers/pause-arm64:3.0.

I was successfully able to start a container with a Connect sidecar after configuring infra_image with the proper image value.

HCL

plugin "docker" {
  config {
    infra_image = "gcr.io/google_containers/pause-arm64:3.0"
  }
}

JSON

{
  "plugin": {
    "docker": {
      "config": {
        "infra_image": "gcr.io/google_containers/pause-arm64:3.0"
      }
    }
  }
}

This config should not be needed in Nomad 1.0 (see PR: #8957), although I have not yet tested that version in my environment.

@nickethier
Copy link
Member

Can folks test this with Nomad 1.0.1 to see if you're still seeing this error message? I think @blake's comment is spot on and likely the problem with most comments I see.

@blake
Copy link
Member

blake commented Jan 12, 2021

@nickethier I upgraded to Nomad 1.0.1 last week, removed the infra_image config override, and am still successfully able to deploy sidecars.

This seems to have been the issue all along in my environment.

@tgross
Copy link
Member

tgross commented Jan 19, 2021

Glad to hear that!

@tgross tgross closed this as completed Jan 19, 2021
Nomad - Community Issues Triage automation moved this from Triaged to Done Jan 19, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants