Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change in behaviour in exec driver capabilities between 1.5.6 and 1.6.0-beta #17780

Closed
sofixa opened this issue Jun 30, 2023 · 4 comments · Fixed by #17881
Closed

Change in behaviour in exec driver capabilities between 1.5.6 and 1.6.0-beta #17780

sofixa opened this issue Jun 30, 2023 · 4 comments · Fixed by #17881

Comments

@sofixa
Copy link
Contributor

sofixa commented Jun 30, 2023

Nomad version

Nomad v1.5.6+ent
BuildDate 2023-05-19T18:47:58Z
Revision bf13f39

vs

Nomad v1.6.0-beta.1+ent
BuildDate 2023-06-27T16:32:15Z
Revision b26558f

Operating system and Environment details

Ubuntu 22.04, same hardware, same OS, same setup.

Issue

I'm deploying Vault with the exec task driver, and on 1.5.6 it works fine; on 1.6.0-beta it fails to start due to not being able to mlock:

Error initializing core: Failed to lock memory: cannot allocate memory

This usually means that the mlock syscall is not available.
Vault uses mlock to prevent memory from being swapped to
disk. This requires root privileges as well as a machine
that supports mlock. Please enable mlock on your system or
disable Vault from using it. To disable Vault from using it,
set the `disable_mlock` configuration option in your configuration
file

Expected Result

It should work the same way on 1.5.6 and 1.6.0-beta. It seems to me it should only work with cap_add = ["ipc_lock"] which isn't a default capability, but doesn't seem like it's required on 1.5.6.

Job file (if appropriate)

Note: a Nomad Variable nomad/jobs/vault with a key of license is needed because I'm testing with Vault enterprise.

variable "vault_version" {
  type = string
  default = "1.14.0+ent"
}

job "vault" {
  datacenters = ["*"]
  type = "service"
  group "vault" {
    count = 1
    network {
     mode = "bridge"
      port "api" {
        static = 8200
        to = 8200
      }
      port "cluster" {
        static = 8201
        to = 8201
      }
    
    }

    task "vault.service" {
      driver = "exec"
      resources {
        cpu = 20
        memory = 512
      }
      artifact {
        source      = "https://releases.hashicorp.com/vault/${var.vault_version}/vault_${var.vault_version}_${attr.kernel.name}_${attr.cpu.arch}.zip"
        destination = "${NOMAD_ALLOC_DIR}/artifacts/"
      }
      template {
        data        = <<EOH
        ui = true
        storage "raft" {
          path = "{{ env "NOMAD_SECRETS_DIR" }}"
          node_id = "{{ env "NOMAD_ALLOC_NAME" | replaceAll "[" "" | replaceAll "]" "" | replaceAll "." "_" }}"
        }
        listener "tcp" {
          address = "0.0.0.0:8200" #"{{ env "NOMAD_ADDR_api" }}"
          tls_disable = 1
        }
        cluster_addr = "http://0.0.0.0:8201" #"http://{{ env "NOMAD_ADDR_cluster" }}"
        api_addr = "http://0.0.0.0:8200" #"http://{{ env "NOMAD_ADDR_api" }}"
        license_path = "{{ env "NOMAD_SECRETS_DIR" }}/license"
        EOH
        destination = "${NOMAD_ALLOC_DIR}/configuration/vault.hcl"
      }
      template {
        data = <<EOH
          {{ with nomadVar "nomad/jobs/vault" }}{{ .license }}{{ end }}
        EOH
        destination = "${NOMAD_SECRETS_DIR}/license"
      }

      config {
        command = "${NOMAD_ALLOC_DIR}/artifacts/vault"
        args = ["server", "-config=${NOMAD_ALLOC_DIR}/configuration/vault.hcl"]
      # needed for 1.6.0-beta
      #  cap_add = ["ipc_lock"]
      }
      service {
        name = "vault-api"
        port = "api"
        provider = "nomad"
        check {
          name = "healthy"
          type = "http"
          path = "/v1/sys/health?sealedcode=210&standbycode=210&performancestandbycode=211&uninintcode=212"
          interval = "10s"
          timeout  = "5s"
        }

        check {
          name = "active"
          on_update = "ignore"
          type = "http"
          path = "/v1/sys/health"
          interval = "10s"
          timeout  = "5s"
        }
        check {
          name = "active_or_standby"
          on_update = "ignore"
          type = "http"
          path = "/v1/sys/health?perfstandbyok=true"
          interval = "10s"
          timeout  = "5s"
        }
      }
      service {
        name = "vault-cluster"
        port = "cluster"
        provider = "nomad"
        check {
          type = "tcp"
          interval = "10s"
          timeout  = "5s"
        }
      }
    }
  }
}

Plugin config in the client config file needed to be able to add the capabilities (mandatory on 1.6):

    plugin "exec" {
      config {
         allow_caps = ["audit_write", "chown", "dac_override", "fowner", "fsetid", "kill", "mknod", "net_bind_service", "setfcap", "setgid", "setpcap", "setuid", "sys_chroot", "ipc_lock"]
     }
    }
@schmichael schmichael added this to the 1.6.0 milestone Jun 30, 2023
@schmichael
Copy link
Member

Using an OSS version of your jobspec vault.nomad.hcl I can reproduce that it does run without modification in Nomad v1.5.6, but fails with the Error initializing core: Failed to lock memory: cannot allocate memory error in v1.6.0-beta.1.

There are at least 2 factors that affect mlock access: capabilities and ulimits. If RLIMIT_MEMLOCK is nonzero then the process must have the CAP_IPC_LOCK/ipc_lock capability.

Using a simple batch job to dump capabilities and ulimits, shows that while neither 1.5 nor 1.6 have the ipc_lock capability by default, Nomad v1.5 does not set RLIMIT_MEMLOCK, while v1.6 does.

# v1.5.6
>> /proc
CapInh:	00000000a80405fb
CapPrm:	00000000a80405fb
CapEff:	00000000a80405fb
CapBnd:	00000000a80405fb
CapAmb:	00000000a80405fb
>> getpcaps
0: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap=eip
>> prlimit
RESOURCE   DESCRIPTION                             SOFT      HARD UNITS
AS         address space limit                unlimited unlimited bytes
CORE       max core file size                         0         0 bytes
CPU        CPU time                           unlimited unlimited seconds
DATA       max data size                      unlimited unlimited bytes
FSIZE      max file size                      unlimited unlimited bytes
LOCKS      max number of file locks held      unlimited unlimited locks
MEMLOCK    max locked-in-memory address space unlimited unlimited bytes
MSGQUEUE   max bytes in POSIX mqueues            819200    819200 bytes
NICE       max nice prio allowed to raise             0         0 
NOFILE     max number of open files                1024   1048576 files
NPROC      max number of processes               255364    255364 processes
RSS        max resident set size              unlimited unlimited bytes
RTPRIO     max real-time priority                     0         0 
RTTIME     timeout for real-time tasks        unlimited unlimited microsecs
SIGPENDING max number of pending signals         255364    255364 signals
STACK      max stack size                       8388608 unlimited bytes
# 1.6.0
# ... identical except for:
MEMLOCK    max locked-in-memory address space 8384729088 8384729088 bytes

So adding the ipc_lock capability works around the newly added memlock limit.

So where does this new memlock limit come from? Nomad doesn't set ulimits directly, and the ~8gb doesn't match the 512mb limit set in the jobspec!

+       github.com/opencontainers/runtime-spec v1.1.0-rc.3

We use this package in exec and while it changed substantially I cannot find anywhere it would have started setting this limit.

Nomad v1.5.6 used Go 1.20.4 while v1.6 uses 1.20.5, but I see nothing in Go's changelog to suggest it would have started setting memlock: https://github.com/golang/go/issues?q=milestone%3AGo1.20.5+label%3ACherryPickApproved

This is quite the mystery. I'll ask the team and keep digging.

@tgross
Copy link
Member

tgross commented Jul 10, 2023

I've run git bisect between 1.5.6 and the current head of main and landed on the following, which is from #17535

$ git bisect bad
a1a52416063626c04f5411d86ac58e19f3838ef5 is the first bad commit
commit a1a52416063626c04f5411d86ac58e19f3838ef5
Author: Patric Stout <github@truebrain.nl>
Date:   Thu Jun 15 16:39:36 2023 +0200

    Fix DevicesSets being removed when cpusets are reloaded with cgroup v2 (#17535)

    * Fix DevicesSets being removed when cpusets are reloaded with cgroup v2

    This meant that if any allocation was created or removed, all
    active DevicesSets were removed from all cgroups of all tasks.

    This was most noticeable with "exec" and "raw_exec", as it meant
    they no longer had access to /dev files.

    * e2e: add test for verifying cgroups do not interfere with access to devices

    ---------

    Co-authored-by: Seth Hoenig <shoenig@duck.com>

 .changelog/17535.txt                   |  3 ++
 client/lib/cgutil/cpuset_manager_v2.go |  3 +-
 e2e/isolation/devices_test.go          | 54 ++++++++++++++++++++++++++++++++++
 e2e/isolation/input/cgroup_devices.hcl | 41 ++++++++++++++++++++++++++
 4 files changed, 100 insertions(+), 1 deletion(-)
 create mode 100644 .changelog/17535.txt
 create mode 100644 e2e/isolation/devices_test.go
 create mode 100644 e2e/isolation/input/cgroup_devices.hcl

@schmichael
Copy link
Member

Thank you @tgross! That led right to where the behavior changed and Nomad broke mlock without explicitly allowing the ipc_lock capability:

Prior to #17535 libcontainer unset the memlock limit to workaround an eBPF issue. You can actually see the infinite memlock limit in #12877 (excellent debugging by that reporter!).

There is a lot I still don't understand here though:

  1. Is there any benefit to setting the memlock limit? At 8gb on my system, which was well over the 512mb allotted to the task, it seems completely arbitrary and unrelated to Nomad's memory guarantees.
  2. Why does libcontainer claim this limit is not inherited into the container when at least in Nomad's case it clearly i!
  3. Similarly: why does libcontainer claim it is impossible to start a container which has this flag set with regard to the SkipDevices parameter set in Fix DevicesSets being removed when cpusets are reloaded with cgroup v2 #17535? Clearly Nomad is able to start containers with that flag set.

There are 2 explanations for the strange libcontainer comments above (2 &3):

  1. The comments are wrong.
  2. The comments are right, but not for the way Nomad uses libcontainer. The comments may be in reference to the specific ways runc uses libcontainer.

Either way I think answering does memlock limit matter? (1 above) is the only question that needs answering for Nomad v1.6.0.

The Docker driver maintains the 8gb default max locked memory limit, so mlock and therefore Vault would fail to run under the docker driver without explicitly allowing the ipc_lock capability escape hatch.

@schmichael
Copy link
Member

After chatting with @tgross and @shoenig I feel fairly confident that The comments are right, but not for the way Nomad uses libcontainer. (2 above) applies: Docker, which uses this code as well, has limits set and therefore requires explicitly adding the ipc_lock capability to allow Vault to call mlock.

Since all of Vault's docs point users toward ipc_lock (and ignores the max locked memory limit alternative), I think that's what Nomad and the exec driver should do as well.

This means that the exec driver in v1.6.0 will be imposing a further restriction on tasks than before (RLIMIT_MEM_LOCK will be maintained instead of being unset). I will document this backward incompatibility in the upgrade guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants