Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to exec inside a running task with exec driver #13538

Open
mr-karan opened this issue Jul 1, 2022 · 15 comments
Open

Unable to exec inside a running task with exec driver #13538

mr-karan opened this issue Jul 1, 2022 · 15 comments

Comments

@mr-karan
Copy link
Contributor

mr-karan commented Jul 1, 2022

Nomad version

Nomad v1.3.1 (2b054e38e91af964d1235faa98c286ca3f527e56)

Operating system and Environment details

Distributor ID:	Pop
Description:	Pop!_OS 22.04 LTS
Release:	22.04
Codename:	jammy

Issue

I've a fairly simple job that is running with exec driver. However alloc exec fails with an EOF error

nomad alloc exec -i -t -task app 82ff23f0 /bin/sh

failed to exec into task: unexpected EOF

Reproduction steps

  • Run a single node Nomad with this config
sudo nomad agent -config=nomad.hcl
datacenter = "dc1"
data_dir   = "/opt/nomad/data"

log_level = "INFO"

bind_addr = "0.0.0.0"

server {
  enabled          = true
  bootstrap_expect = 1
}

client {
  enabled = true

  reserved {
    cores          = 2
    memory         = 1024
    disk           = 1024
    reserved_ports = "22"
  }

  meta {
    env   = "dev"
    stack = "personal"
  }

  chroot_env {
    # /bin tools
    "/bin/bash"       = "/bin/bash"
    "/bin/cat"        = "/bin/cat"
    "/bin/cp"         = "/bin/cp"
    "/bin/du"         = "/bin/du"
    "/bin/env"        = "/bin/env"
    "/bin/ip"         = "/bin/ip"
    "/bin/ls"         = "/bin/ls"
    "/bin/mkdir"      = "/bin/mkdir"
    "/bin/mv"         = "/bin/mv"
    "/bin/nano"       = "/bin/nano"
    "/bin/nc.openbsd" = "/bin/nc"
    "/bin/ps"         = "/bin/ps"
    "/bin/rm"         = "/bin/rm"
    "/bin/sh"         = "/bin/sh"

    # /etc configurations and tools
    "/etc" = "/etc"

    # /lib and /lib64 configuration
    "/lib"   = "/lib"
    "/lib64" = "/lib64"

    # DNS
    "/run/systemd/resolve/resolv.conf" = "/etc/resolv.conf"

    # /usr/bin tools
    "/usr/bin/clear"         = "/bin/clear"
    "/usr/bin/curl"          = "/bin/curl"
    "/usr/bin/dig"           = "/bin/dig"
    "/usr/bin/groups"        = "/bin/groups"
    "/usr/bin/head"          = "/bin/head"
    "/usr/bin/htop"          = "/bin/htop"
    "/usr/bin/pkill"         = "/bin/pkill"
    "/usr/bin/reset"         = "/bin/reset"
    "/usr/bin/tail"          = "/bin/tail"
    "/usr/bin/telnet.netkit" = "/bin/telnet"
    "/usr/bin/vim.basic"     = "/bin/vim"
    "/usr/bin/sleep"         = "/bin/sleep"
    "/usr/bin/mount"         = "/bin/mount"
    "/usr/bin/grep"          = "/bin/grep"
    "/usr/bin/python3.10"    = "/bin/python3"
  }
}

plugin "docker" {
  config {
    allow_privileged = true
    volumes {
      enabled = true
    }
    extra_labels = ["job_name", "job_id", "task_group_name", "task_name", "namespace", "node_name", "node_id"]
  }
}

plugin "raw_exec" {
  config {
    enabled = true
  }
}
  • Run the sleep.nomad (job file below)
nomad run sleep.nomad

Expected Result

Since the alloc status is running, you should be able to exec inside it:

nomad job status sleep                      
ID            = sleep
Name          = sleep
Submit Date   = 2022-07-01T09:47:47+05:30
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
app         0       0         1        0       1         0     0

Latest Deployment
ID          = e48eee53
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
app         1        1       1        0          2022-07-01T09:57:58+05:30

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
82ff23f0  20b51682  app         2        run      running   7m53s ago   7m41s ago
6dc95c01  20b51682  app         0        stop     complete  10m42s ago  7m49s ago
nomad alloc status --stats 82ff23f0                            
ID                  = 82ff23f0-ecf8-6bbc-a373-d5ecdccb1fac
Eval ID             = 3ef99f9e
Name                = sleep.app[0]
Node ID             = 20b51682
Node Name           = pop-os
Job ID              = sleep
Job Version         = 2
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 8m19s ago
Modified            = 8m7s ago
Deployment ID       = e48eee53
Deployment Health   = healthy

Allocation Addresses (mode = "bridge")
Label         Dynamic  Address
*python-http  yes      192.168.29.76:29844 -> 8888

Task "app" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/100 MHz  260 KiB/300 MiB  300 MiB  

Memory Stats
Cache  Swap     Usage
0 B    260 KiB  260 KiB

CPU Stats
Percent  System Mode  Throttled Periods  Throttled Time  User Mode
0.00%    0.00%        0                  0               0.00%

Task Events:
Started At     = 2022-07-01T04:17:48Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2022-07-01T09:47:48+05:30  Started     Task started by client
2022-07-01T09:47:47+05:30  Task Setup  Building Task Directory
2022-07-01T09:47:47+05:30  Received    Task received by client

Actual Result

However, when you actually try to exec:

nomad alloc exec -i -t -task app 82ff23f0 /bin/sh              
failed to exec into task: unexpected EOF
nomad alloc exec -i -t -task app 82ff23f0 /bin/bash
failed to exec into task: unexpected EOF
nomad alloc exec -i -t -task app 82ff23f0 ls
failed to exec into task: unexpected EOF

Job file (if appropriate)

job "sleep" {
  datacenters = ["dc1"]
  type        = "service"

  group "app" {
    count = 1
    network {
      mode = "bridge"
      port "python-http" {
        to = "8888"
      }
    }

    task "app" {
      driver = "exec"

      config {
        command = "bash"
        args    = ["-c", "sleep infinity"]
      }
    }
  }
}

Nomad Client logs (if appropriate)

    2022-07-01T09:58:22.515+0530 [INFO]  client: task exec session starting: exec_id=89a44a1f-dc1d-9e3c-6b26-815631f1bcd9 alloc_id=82ff23f0-ecf8-6bbc-a373-d5ecdccb1fac task=app command=["/bin/bash"] tty=true access_token_name="" access_token_id=""
    2022-07-01T09:58:22.551+0530 [INFO]  client: task exec session ended with an error: error="rpc error: code = Unknown desc = failed to start command: container_linux.go:380: starting container process caused: open /dev/ptmx: operation not permitted" code=0xc00177e048
    2022-07-01T09:58:22.551+0530 [ERROR] http: request failed: method=GET path="/v1/client/allocation/82ff23f0-ecf8-6bbc-a373-d5ecdccb1fac/exec?command=%5B%22%2Fbin%2Fbash%22%5D&region=global&task=app&tty=true" error="rpc error: code = Unknown desc = failed to start command: container_linux.go:380: starting container process caused: open /dev/ptmx: operation not permitted" code=500
    2022-07-01T09:58:39.084+0530 [INFO]  client: task exec session starting: exec_id=aa724247-a2cf-c929-795f-66969a3d9520 alloc_id=82ff23f0-ecf8-6bbc-a373-d5ecdccb1fac task=app command=["/bin/ls"] tty=true access_token_name="" access_token_id=""
    2022-07-01T09:58:39.118+0530 [INFO]  client: task exec session ended with an error: error="rpc error: code = Unknown desc = failed to start command: container_linux.go:380: starting container process caused: open /dev/ptmx: operation not permitted" code=0xc0011521e0
    2022-07-01T09:58:39.118+0530 [ERROR] http: request failed: method=GET path="/v1/client/allocation/82ff23f0-ecf8-6bbc-a373-d5ecdccb1fac/exec?command=%5B%22%2Fbin%2Fls%22%5D&region=global&task=app&tty=true" error="rpc error: code = Unknown desc = failed to start command: container_linux.go:380: starting container process caused: open /dev/ptmx: operation not permitted" code=500

Looks like it's trying to open /dev/ptmx for some reason that I am unsure of. nomad is running as sudo here (as noted in the above command to run the server+client node)

@mr-karan
Copy link
Contributor Author

mr-karan commented Jul 1, 2022

This seems to be an issue specific to the entries I've listed in chroot_env. With the default list from here, I don't see EOFs anymore.

  chroot_env {
    "/bin/"           = "/bin/"
    "/etc/"           = "/etc/"
    "/lib"            = "/lib"
    "/lib32"          = "/lib32"
    "/lib64"          = "/lib64"
    "/run/resolvconf" = "/run/resolvconf"
    "/sbin"           = "/sbin"
    "/usr"            = "/usr"
  }

/dev/ptmx is unavailable on my host so I am not sure what I'm missing in my list.

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jul 5, 2022
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jul 5, 2022
@tgross tgross self-assigned this Jul 5, 2022
@tgross
Copy link
Member

tgross commented Jul 5, 2022

Hi @mr-karan! I tried to reproduce this problem and I wasn't able to, but this was on machine running on Ubuntu 22.04 rather than Pop_OS because I couldn't find a Vagrant box I trusted with Pop. The only obvious difference was that for my chroot_env I couldn't include the /run/systemd/resolve/resolv.conf mapping. (I also checked this on Ubuntu 20.04 and got the same results.)

/dev/ptmx is unavailable on my host so I am not sure what I'm missing in my list.

That /dev/ptmx is needed to create the psuedo-tty (see ptmx(4)). On my host, that looks like this:

$ ls -lah /dev/ptmx
crw-rw-rw- 1 root tty 5, 2 Jul  5 18:44 /dev/ptmx

But according to the kernel docs for the devpts filesystem:

As an option instead of placing a /dev/ptmx device node at /dev/ptmx it is possible to place a symlink to /dev/pts/ptmx at /dev/ptmx or to bind mount /dev/ptx/ptmx to /dev/ptmx. If you opt for using the devpts filesystem in this manner devpts should be mounted with the ptmxmode=0666, or chmod 0666 /dev/pts/ptmx should be called.

If I look at /dev/pts:

$ ls -lah /dev/pts
total 0
drwxr-xr-x  2 root    root      0 Jul  5 18:13 .
drwxr-xr-x 19 root    root   3.9K Jul  5 18:25 ..
crw--w----  1 vagrant tty  136, 0 Jul  5 18:42 0
crw--w----  1 root    tty  136, 1 Jul  5 18:42 1
crw--w----  1 vagrant tty  136, 2 Jul  5 18:44 2
c---------  1 root    root   5, 2 Jul  5 18:13 ptmx

And then I'll look inside my container:

$ nomad alloc exec 4e4 /bin/bash
nobody@test:/$ ls -lah /dev
...
lrwxrwxrwx  1 root root    8 Jul  5 18:43 ptmx -> pts/ptmx
drwxr-xr-x  2 root root    0 Jul  5 18:43 pts
...

The shared executor mounts /dev/pts as devpts, and I now I suspect we have a bug where that won't work if you've got a symlink there. But you say you don't have a /dev/ptmx at all? You're getting the error "not permitted" as though you don't have permissions, rather than the "no such file or directory" that I'd expect (see docker/cli#2067 for an example of that).

That changing the chroot_env helps is very strange. If it were simply unset I would think that somehow when we're setting it we're overwriting something in the task directory build. But that you're getting positive results just from changing the contents has me stumped for the moment.

  • Can you provide ls -lah /dev/pts and ls -lah /dev/ptmx on the host?
  • Can you provide ls -lah /dev inside the container both with and without the changed chroot_env. You should be able to get this info by running the ls as the command and grabbing the logs.

@tgross tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Jul 5, 2022
@tgross tgross moved this from In Progress to Triaging in Nomad - Community Issues Triage Jul 8, 2022
@tgross
Copy link
Member

tgross commented Aug 15, 2022

This one has been open for a bit without the information we'd need to figure things out. I'm going to close this for now but please feel free to reopen if you have more info!

@tgross tgross closed this as not planned Won't fix, can't repro, duplicate, stale Aug 15, 2022
Nomad - Community Issues Triage automation moved this from Triaging to Done Aug 15, 2022
@mr-karan
Copy link
Contributor Author

mr-karan commented Sep 6, 2022

@tgross Hi, can we please re-open this? Sorry this skipped my attention. I am still facing the above issue. This time, I was able to reproduce this on Ubuntu 22.04 (Minimal version) as well.

  1. On the host:
$ ls -lah /dev/pts              
total 0
drwxr-xr-x  2 root   root      0 Aug  8 16:57 .
drwxr-xr-x 15 root   root   3.2K Aug  9 06:58 ..
crw--w----  1 ubuntu tty  136, 0 Sep  6 09:45 0
c---------  1 root   root   5, 2 Aug  8 16:57 ptmx

$ ls -lah /dev/ptmx
crw-rw-rw- 1 root tty 5, 2 Sep  6 09:46 /dev/ptmx

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04 LTS
Release:	22.04
Codename:	jammy

2a. With the default chroot_env also I am facing this issue. So I think that changing this chroot_env to default was just a fluke that it worked. The problem still persists even with default chroot_env as I can see.

$ nomad alloc exec -i=false -t=false -namespace=app -task front aad9a260 ls -laht /dev/
total 4.0K
drwxr-xr-x  5 root 0  340 Sep  6 07:30 .
lrwxrwxrwx  1 root 0   11 Sep  6 07:30 core -> /proc/kcore
lrwxrwxrwx  1 root 0   13 Sep  6 07:30 fd -> /proc/self/fd
crw-rw-rw-  1 root 0 1, 7 Sep  6 07:30 full
crw-rw-rw-  1 root 0 1, 3 Sep  6 07:30 null
lrwxrwxrwx  1 root 0    8 Sep  6 07:30 ptmx -> pts/ptmx
drwxr-xr-x  2 root 0    0 Sep  6 07:30 pts
crw-rw-rw-  1 root 0 1, 8 Sep  6 07:30 random
drwxrwxrwt  2 root 0   40 Sep  6 07:30 shm
lrwxrwxrwx  1 root 0   15 Sep  6 07:30 stderr -> /proc/self/fd/2
lrwxrwxrwx  1 root 0   15 Sep  6 07:30 stdin -> /proc/self/fd/0
lrwxrwxrwx  1 root 0   15 Sep  6 07:30 stdout -> /proc/self/fd/1
crw-rw-rw-  1 root 0 5, 0 Sep  6 07:30 tty
crw-rw-rw-  1 root 0 1, 9 Sep  6 07:30 urandom
crw-rw-rw-  1 root 0 1, 5 Sep  6 07:30 zero
drwxrwxrwt  2 root 0   40 Sep  6 07:30 mqueue
drwxrwxrwx 16 root 0 4.0K Sep  6 07:30 ..

$ nomad alloc exec -i=false -t=false -namespace=app -task front aad9a260 ls -laht /dev/pts
total 0
drwxr-xr-x 2 root 0    0 Sep  6 07:30 .
drwxr-xr-x 5 root 0  340 Sep  6 07:30 ..
crw-rw-rw- 1 root 0 5, 2 Sep  6 07:30 ptmx

$ nomad alloc exec -i=false -t=false -namespace=app -task front aad9a260 ls -laht /dev/ptmx
lrwxrwxrwx 1 root 0 8 Sep  6 07:30 /dev/ptmx -> pts/ptmx

2b. With the modified chroot_env:

$ nomad alloc exec -i=false -t=false -namespace=ns -task app 98a5651c ls -laht /dev/
total 4.0K
crw-rw-rw-  1 root 5 5, 0 Sep  6 08:12 tty
crw-rw-rw-  1 root 0 1, 9 Sep  6 08:12 urandom
crw-rw-rw-  1 root 0 1, 7 Sep  6 08:12 full
crw-rw-rw-  1 root 0 1, 3 Sep  6 08:12 null
crw-rw-rw-  1 root 0 1, 8 Sep  6 08:12 random
crw-rw-rw-  1 root 0 1, 5 Sep  6 08:12 zero
drwxr-xr-x  5 root 0  340 Sep  6 08:02 .
drwxrwxrwx 14 root 0 4.0K Sep  6 08:02 ..
lrwxrwxrwx  1 root 0   11 Sep  6 08:02 core -> /proc/kcore
lrwxrwxrwx  1 root 0   13 Sep  6 08:02 fd -> /proc/self/fd
lrwxrwxrwx  1 root 0    8 Sep  6 08:02 ptmx -> pts/ptmx
drwxr-xr-x  2 root 0    0 Sep  6 08:02 pts
drwxrwxrwt  2 root 0   40 Sep  6 08:02 shm
lrwxrwxrwx  1 root 0   15 Sep  6 08:02 stderr -> /proc/self/fd/2
lrwxrwxrwx  1 root 0   15 Sep  6 08:02 stdin -> /proc/self/fd/0
lrwxrwxrwx  1 root 0   15 Sep  6 08:02 stdout -> /proc/self/fd/1
drwxrwxrwt  2 root 0   40 Sep  6 08:02 mqueue

$ nomad alloc exec -i=false -t=false -namespace=ns -task app 98a5651c ls -laht /dev/pts
total 0
crw--w---- 1 nobody 5 136, 1 Sep  6 10:12 1
crw--w---- 1 nobody 5 136, 0 Sep  6 10:12 0
drwxr-xr-x 2 root   0      0 Sep  6 08:02 .
drwxr-xr-x 5 root   0    340 Sep  6 08:02 ..
crw-rw-rw- 1 root   0   5, 2 Sep  6 08:02 ptmx

$ nomad alloc exec -i=false -t=false -namespace=ns -task app 98a5651c ls -laht /dev/ptmx
lrwxrwxrwx 1 root 0 8 Sep  6 08:02 /dev/ptmx -> pts/ptmx

Strangely, this issue isn't limited to just exec driver. I am able to replicate this with raw_exec driver too:

image

But here, I get a proper error message instead of simply EOF:

nomad alloc exec -i -t -namespace=logging -task vector 75501004 ls -laht /dev
failed to exec into task: rpc error: code = Unknown desc = failed to open a tty: open /dev/ptmx: operation not permitted
Contents of /dev inside this raw_exec app
nomad alloc exec -i=false -t=false -namespace=logging -task vector 75501004 ls -laht /dev
total 4.0K
crw-rw-rw-  1 root tty       5,   2 Sep  6 09:47 ptmx
drwxr-xr-x  2 root root        2.9K Aug  9 06:58 char
drwxr-xr-x 15 root root        3.2K Aug  9 06:58 .
drwxr-xr-x 18 root root         360 Aug  9 06:58 cpu
crw-rw-rw-  1 root tty       5,   0 Aug  8 17:10 tty
crw--w----  1 root tty       4,  64 Aug  8 17:08 ttyS0
drwxr-xr-x  2 root root         280 Aug  8 17:08 block
brw-rw----  1 root disk      7,   5 Aug  8 17:08 loop5
brw-rw----  1 root disk      7,   4 Aug  8 17:08 loop4
crw-rw----  1 root tty       7,   2 Aug  8 17:08 vcs2
crw-rw----  1 root tty       7,   3 Aug  8 17:08 vcs3
crw-rw----  1 root tty       7,   4 Aug  8 17:08 vcs4
crw-rw----  1 root tty       7,   5 Aug  8 17:08 vcs5
crw-rw----  1 root tty       7,   6 Aug  8 17:08 vcs6
crw-rw----  1 root tty       7, 128 Aug  8 17:08 vcsa
crw-rw----  1 root tty       7, 129 Aug  8 17:08 vcsa1
crw-rw----  1 root tty       7, 130 Aug  8 17:08 vcsa2
crw-rw----  1 root tty       7, 131 Aug  8 17:08 vcsa3
crw-rw----  1 root tty       7, 132 Aug  8 17:08 vcsa4
crw-rw----  1 root tty       7, 133 Aug  8 17:08 vcsa5
crw-rw----  1 root tty       7, 134 Aug  8 17:08 vcsa6
crw-rw----  1 root tty       7,  64 Aug  8 17:08 vcsu
crw-rw----  1 root tty       7,  65 Aug  8 17:08 vcsu1
crw-rw----  1 root tty       7,  66 Aug  8 17:08 vcsu2
crw-rw----  1 root tty       7,  67 Aug  8 17:08 vcsu3
crw-rw----  1 root tty       7,  68 Aug  8 17:08 vcsu4
crw-rw----  1 root tty       7,  69 Aug  8 17:08 vcsu5
crw-rw----  1 root tty       7,  70 Aug  8 17:08 vcsu6
brw-rw----  1 root disk    259,   2 Aug  8 17:08 nvme0n1p14
crw--w----  1 root tty       4,   0 Aug  8 17:08 tty0
crw--w----  1 root tty       4,  13 Aug  8 17:08 tty13
crw--w----  1 root tty       4,  14 Aug  8 17:08 tty14
crw--w----  1 root tty       4,  15 Aug  8 17:08 tty15
crw--w----  1 root tty       4,  16 Aug  8 17:08 tty16
crw--w----  1 root tty       4,  17 Aug  8 17:08 tty17
crw--w----  1 root tty       4,  18 Aug  8 17:08 tty18
crw--w----  1 root tty       4,  19 Aug  8 17:08 tty19
crw--w----  1 root tty       4,   2 Aug  8 17:08 tty2
crw--w----  1 root tty       4,  20 Aug  8 17:08 tty20
crw--w----  1 root tty       4,  21 Aug  8 17:08 tty21
crw--w----  1 root tty       4,  22 Aug  8 17:08 tty22
crw--w----  1 root tty       4,  23 Aug  8 17:08 tty23
crw--w----  1 root tty       4,  24 Aug  8 17:08 tty24
crw--w----  1 root tty       4,  25 Aug  8 17:08 tty25
crw--w----  1 root tty       4,  26 Aug  8 17:08 tty26
crw--w----  1 root tty       4,  27 Aug  8 17:08 tty27
crw--w----  1 root tty       4,  28 Aug  8 17:08 tty28
crw--w----  1 root tty       4,  29 Aug  8 17:08 tty29
crw--w----  1 root tty       4,   3 Aug  8 17:08 tty3
crw--w----  1 root tty       4,  30 Aug  8 17:08 tty30
crw--w----  1 root tty       4,  31 Aug  8 17:08 tty31
crw--w----  1 root tty       4,  32 Aug  8 17:08 tty32
crw--w----  1 root tty       4,  33 Aug  8 17:08 tty33
crw--w----  1 root tty       4,  34 Aug  8 17:08 tty34
crw--w----  1 root tty       4,  35 Aug  8 17:08 tty35
crw--w----  1 root tty       4,  36 Aug  8 17:08 tty36
crw--w----  1 root tty       4,  37 Aug  8 17:08 tty37
crw--w----  1 root tty       4,  38 Aug  8 17:08 tty38
crw--w----  1 root tty       4,  39 Aug  8 17:08 tty39
crw--w----  1 root tty       4,   4 Aug  8 17:08 tty4
crw--w----  1 root tty       4,  40 Aug  8 17:08 tty40
crw--w----  1 root tty       4,  41 Aug  8 17:08 tty41
crw--w----  1 root tty       4,  42 Aug  8 17:08 tty42
crw--w----  1 root tty       4,  43 Aug  8 17:08 tty43
crw--w----  1 root tty       4,  44 Aug  8 17:08 tty44
crw--w----  1 root tty       4,  45 Aug  8 17:08 tty45
crw--w----  1 root tty       4,  46 Aug  8 17:08 tty46
crw--w----  1 root tty       4,  47 Aug  8 17:08 tty47
crw--w----  1 root tty       4,  48 Aug  8 17:08 tty48
crw--w----  1 root tty       4,  49 Aug  8 17:08 tty49
crw--w----  1 root tty       4,   5 Aug  8 17:08 tty5
crw--w----  1 root tty       4,  50 Aug  8 17:08 tty50
crw--w----  1 root tty       4,  51 Aug  8 17:08 tty51
crw--w----  1 root tty       4,  52 Aug  8 17:08 tty52
crw--w----  1 root tty       4,  53 Aug  8 17:08 tty53
crw--w----  1 root tty       4,  54 Aug  8 17:08 tty54
crw--w----  1 root tty       4,  55 Aug  8 17:08 tty55
crw--w----  1 root tty       4,  56 Aug  8 17:08 tty56
crw--w----  1 root tty       4,  57 Aug  8 17:08 tty57
crw--w----  1 root tty       4,  58 Aug  8 17:08 tty58
crw--w----  1 root tty       4,  59 Aug  8 17:08 tty59
crw--w----  1 root tty       4,   6 Aug  8 17:08 tty6
crw--w----  1 root tty       4,  60 Aug  8 17:08 tty60
crw--w----  1 root tty       4,  61 Aug  8 17:08 tty61
crw--w----  1 root tty       4,  62 Aug  8 17:08 tty62
crw--w----  1 root tty       4,  63 Aug  8 17:08 tty63
crw--w----  1 root tty       4,   7 Aug  8 17:08 tty7
crw--w----  1 root tty       4,   8 Aug  8 17:08 tty8
crw--w----  1 root tty       4,   9 Aug  8 17:08 tty9
crw-------  1 root root      5,   3 Aug  8 17:08 ttyprintk
crw-rw----  1 root tty       7,   0 Aug  8 17:08 vcs
crw-rw----  1 root tty       7,   1 Aug  8 17:08 vcs1
crw-r--r--  1 root root     10, 235 Aug  8 17:08 autofs
crw--w----  1 root tty       5,   1 Aug  8 17:08 console
crw-------  1 root root     10, 124 Aug  8 17:08 cpu_dma_latency
crw-------  1 root root     10, 126 Aug  8 17:08 ecryptfs
crw-rw-rw-  1 root root     10, 229 Aug  8 17:08 fuse
crw-------  1 root root     10, 228 Aug  8 17:08 hpet
crw-------  1 root root     10, 183 Aug  8 17:08 hwrng
brw-rw----  1 root disk      7,   0 Aug  8 17:08 loop0
brw-rw----  1 root disk      7,   1 Aug  8 17:08 loop1
brw-rw----  1 root disk      7,   2 Aug  8 17:08 loop2
brw-rw----  1 root disk      7,   3 Aug  8 17:08 loop3
brw-rw----  1 root disk      7,   6 Aug  8 17:08 loop6
brw-rw----  1 root disk      7,   7 Aug  8 17:08 loop7
crw-rw----  1 root disk     10, 237 Aug  8 17:08 loop-control
crw-------  1 root root     10, 227 Aug  8 17:08 mcelog
crw-------  1 root root    108,   0 Aug  8 17:08 ppp
crw-------  1 root root     10,   1 Aug  8 17:08 psaux
crw-rw-r--  1 root root     10, 242 Aug  8 17:08 rfkill
crw-------  1 root root     10, 231 Aug  8 17:08 snapshot
crw--w----  1 root tty       4,   1 Aug  8 17:08 tty1
crw--w----  1 root tty       4,  10 Aug  8 17:08 tty10
crw--w----  1 root tty       4,  11 Aug  8 17:08 tty11
crw--w----  1 root tty       4,  12 Aug  8 17:08 tty12
crw-rw----  1 root kvm      10, 125 Aug  8 17:08 udmabuf
crw-------  1 root root     10, 223 Aug  8 17:08 uinput
crw-rw-rw-  1 root root      1,   9 Aug  8 17:08 urandom
crw-------  1 root root     10, 127 Aug  8 17:08 vga_arbiter
crw-rw-rw-  1 root root      1,   5 Aug  8 17:08 zero
crw-rw-rw-  1 root root      1,   7 Aug  8 17:08 full
crw-r--r--  1 root root      1,  11 Aug  8 17:08 kmsg
crw-r-----  1 root kmem      1,   1 Aug  8 17:08 mem
crw-rw-rw-  1 root root      1,   3 Aug  8 17:08 null
crw-r-----  1 root kmem      1,   4 Aug  8 17:08 port
crw-rw-rw-  1 root root      1,   8 Aug  8 17:08 random
brw-rw----  1 root disk    259,   1 Aug  8 17:08 nvme0n1p1
brw-rw----  1 root disk    259,   3 Aug  8 17:08 nvme0n1p15
brw-rw----  1 root disk    259,   0 Aug  8 17:08 nvme0n1
lrwxrwxrwx  1 root root           4 Aug  8 17:08 rtc -> rtc0
crw-rw----  1 root dialout   4,  65 Aug  8 17:08 ttyS1
crw-rw----  1 root dialout   4,  66 Aug  8 17:08 ttyS2
crw-rw----  1 root dialout   4,  67 Aug  8 17:08 ttyS3
crw-------  1 root root    241,   0 Aug  8 17:08 ng0n1
crw-------  1 root root    242,   0 Aug  8 17:08 nvme0
crw-------  1 root root    248,   0 Aug  8 17:08 rtc0
drwxrwxrwt  3 root root          60 Aug  8 16:57 shm
drwxr-xr-x 19 root root        4.0K Aug  8 16:57 ..
drwxr-xr-x  3 root root         180 Aug  8 16:57 input
drwxr-xr-x  7 root root         140 Aug  8 16:57 disk
crw-rw----  1 root kvm      10, 238 Aug  8 16:57 vhost-net
crw-rw----  1 root kvm      10, 241 Aug  8 16:57 vhost-vsock
crw-------  1 root root     10, 234 Aug  8 16:57 btrfs-control
crw-------  1 root root     10, 203 Aug  8 16:57 cuse
crw-------  1 root root     10, 144 Aug  8 16:57 nvram
crw-------  1 root root     10, 249 Aug  8 16:57 zfs
drwxr-xr-x  2 root root           0 Aug  8 16:57 hugepages
lrwxrwxrwx  1 root root          28 Aug  8 16:57 log -> /run/systemd/journal/dev-log
lrwxrwxrwx  1 root root          12 Aug  8 16:57 initctl -> /run/initctl
lrwxrwxrwx  1 root root          11 Aug  8 16:57 core -> /proc/kcore
lrwxrwxrwx  1 root root          13 Aug  8 16:57 fd -> /proc/self/fd
lrwxrwxrwx  1 root root          15 Aug  8 16:57 stderr -> /proc/self/fd/2
lrwxrwxrwx  1 root root          15 Aug  8 16:57 stdin -> /proc/self/fd/0
lrwxrwxrwx  1 root root          15 Aug  8 16:57 stdout -> /proc/self/fd/1
drwxr-xr-x  2 root root           0 Aug  8 16:57 pts
brw-------  1 root root    259,   1 Aug  8 16:57 root
drwxr-xr-x  2 root root          60 Aug  8 16:57 mapper
drwxr-xr-x  2 root root          60 Aug  8 16:57 vfio
drwxr-xr-x  2 root root          60 Aug  8 16:57 net
drwxr-xr-x  2 root root          60 Aug  8 16:57 dma_heap
drwxrwxrwt  2 root root          40 Aug  8 16:57 mqueue

So, I found a bit of oddity here, (not sure what I am seeing is correct or not). The raw_exec task doesn't have a symlink but exec task does have a symlink. However, both of these alloc exec commands are failing when -i=true -t=true.

# raw_exec
$ nomad alloc exec -i=false -t=false -namespace=logging -task vector 75501004 ls -laht /dev/ptmx
crw-rw-rw- 1 root tty 5, 2 Sep  6 10:01 /dev/ptmx

# exec
$ nomad alloc exec -i=false -t=false -namespace=ns -task app 98a5651c ls -laht /dev/ptmx
lrwxrwxrwx 1 root 0 8 Sep  6 08:02 /dev/ptmx -> pts/ptmx

NOTE: This seems to be an intermittent issue, doesn't always happen so it can be a bit hard to debug this. I just restarted alloc and I am able to exec in it normally. These are the commands I ran from within the container (exec driver) after I restarted alloc:

bash-5.1$ ls -laht /dev/
total 4.0K
drwxrwxrwx 17 root 0 4.0K Sep  6 11:35 ..
drwxr-xr-x  5 root 0  340 Sep  6 11:35 .
lrwxrwxrwx  1 root 0   11 Sep  6 11:35 core -> /proc/kcore
lrwxrwxrwx  1 root 0   13 Sep  6 11:35 fd -> /proc/self/fd
lrwxrwxrwx  1 root 0    8 Sep  6 11:35 ptmx -> pts/ptmx
drwxr-xr-x  2 root 0    0 Sep  6 11:35 pts
drwxrwxrwt  2 root 0   40 Sep  6 11:35 shm
lrwxrwxrwx  1 root 0   15 Sep  6 11:35 stderr -> /proc/self/fd/2
lrwxrwxrwx  1 root 0   15 Sep  6 11:35 stdin -> /proc/self/fd/0
lrwxrwxrwx  1 root 0   15 Sep  6 11:35 stdout -> /proc/self/fd/1
drwxrwxrwt  2 root 0   40 Sep  6 11:35 mqueue
crw-rw-rw-  1 root 5 5, 0 Aug  8 17:10 tty
crw-rw-rw-  1 root 0 1, 9 Aug  8 17:08 urandom
crw-rw-rw-  1 root 0 1, 5 Aug  8 17:08 zero
crw-rw-rw-  1 root 0 1, 7 Aug  8 17:08 full
crw-rw-rw-  1 root 0 1, 3 Aug  8 17:08 null
crw-rw-rw-  1 root 0 1, 8 Aug  8 17:08 random

bash-5.1$ ls -laht /dev/pts
total 0
crw--w---- 1 nobody 5 136, 0 Sep  6 11:36 0
crw-rw-rw- 1 root   0   5, 2 Sep  6 11:36 ptmx
drwxr-xr-x 2 root   0      0 Sep  6 11:35 .
drwxr-xr-x 5 root   0    340 Sep  6 11:35 ..

bash-5.1$ ls -laht /dev/ptmx
lrwxrwxrwx 1 root 0 8 Sep  6 11:35 /dev/ptmx -> pts/ptmx
bash-5.1$ 

Please let me know any additional details you'd like me to mention, I'll be happy to provide.

@tgross tgross reopened this Sep 6, 2022
Nomad - Community Issues Triage automation moved this from Done to Needs Triage Sep 6, 2022
@tgross
Copy link
Member

tgross commented Sep 6, 2022

Re-opening, but possible dupe of #12877?

@mr-karan
Copy link
Contributor Author

mr-karan commented Sep 6, 2022

Possible but as I noted above, this isn't just limited to raw_exec that #12877 mentions. The error message does seem familiar though (failed to open a tty: open /dev/ptmx: operation not permitted)

@tgross
Copy link
Member

tgross commented Sep 12, 2022

I've had a look at #14372 and I'm reasonably confident after a conversion with my colleagues that this issue will be covered by that fix as well.

@mr-karan
Copy link
Contributor Author

@tgross Awesome, good to know that! We can close this if you want :)

@tgross
Copy link
Member

tgross commented Sep 13, 2022

Let's keep open until we've verified that.

@tgross tgross added this to the 1.4.x milestone Sep 29, 2022
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Oct 3, 2022
@mr-karan
Copy link
Contributor Author

Bump. Was hoping a fix for this would arrive in 1.4.x. Not being able to exec inside tasks is quite a bummer for us. Is there any more debugging information I can help to provide here?

Thanks!

@tgross
Copy link
Member

tgross commented Nov 17, 2022

I don't think so; we're fairly certain this is blocked on #14373 and #14372 but haven't had a chance to complete that work. I know we're in the process of updating our nightly E2E environment to include cgroups v2 so that'll likely help expedite getting this resolved.

@tgross tgross removed this from the 1.4.x milestone Feb 15, 2023
@mriddell-foundry
Copy link

Just curious is there any updates on this issue / is there any workaround which does not involve restarting the allocation?

@tgross
Copy link
Member

tgross commented Mar 13, 2023

Hi unfortunately no. I still haven't been able to reproduce either, but we have a fairly strong suspicion it's related to cgroups v2 issues. We've got someone planning to dig into that as part of our next major release cycle.

@tgross
Copy link
Member

tgross commented May 18, 2023

I suspect this issue #17200 has revealed the problem we're running into here, although I never got a reproduction so it's hard to be sure. Unfortunately the exec driver doesn't have options to expand the list of allowed devices. But if someone who can reliably reproduce this issue wanted to try to comment out executor_universal_linux.go#L78-L81, that'd confirm that's what we're seeing here and give us a direction to fix it.

@tgross
Copy link
Member

tgross commented Jun 15, 2023

This might be fixed by #17535, but because I never was able to repro this I can't really be sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

3 participants