nomad 1.3.0: node drain stuck at csi controller #12835

iSchluff · 2022-05-02T13:17:55Z

Nomad version

Nomad v1.3.0-beta.1 (2eba643)

Operating system and Environment details

Ubuntu 20.04.4 LTS
Linux 5.4.0-109-generic #123-Ubuntu SMP Fri Apr 8 09:10:54 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Issue

I don't know whether this is related to #12324 but it seems that nomad is not trying to stop the controller allocation at all with 1.3.0.

Reproduction steps

Run csi controller as service job and nodes as system jobs. Issue single node drain for the node running the controller.

Expected Result

Should continue draining.

Actual Result

Drain gets stuck with the controller and csi node left running.

Job file (if appropriate)

csi controller jobfile

variable "version" {
  type        = string
  description = "Ceph csi container version"
}

variable "cluster_id" {
  type        = string
  description = "cluster ID for the Ceph monitor"
}

variable "monitor_nodes" {
  type        = list(string)
  description = "Ceph monitor node addresses"
}

job "ceph-csi-plugin-controller" {
  datacenters = ["dc1"]
  type        = "service"

  constraint {
    operator = "distinct_hosts"
    value    = true
  }

  group "controller" {
    task "ceph-controller" {
      driver = "docker"
      config {
        image        = "quay.io/cephcsi/cephcsi:${var.version}"
        network_mode = "host"
        args = [
          "--type=cephfs",
          "--controllerserver=true",
          "--drivername=cephfs.csi.ceph.com",
          "--endpoint=unix://csi/csi.sock",
          "--nodeid=${node.unique.name}",
          "--instanceid=${node.unique.name}-controller",
          "--pidlimit=-1",
          "--logtostderr=true",
          "--v=5",
        ]
        volumes = [
          "./local/config.json:/etc/ceph-csi-config/config.json"
        ]
        mount {
          type   = "tmpfs"
          target = "/tmp/csi/keys"
          tmpfs_options {
            size = 1000000
          }
        }
      }

      resources {
        cpu    = 1024
        memory = 512
      }

      template {
        data = jsonencode([{
          "clusterID" = var.cluster_id,
          "monitors"  = var.monitor_nodes
        }])
        destination = "local/config.json"
      }

      csi_plugin {
        id        = "ceph-csi"
        type      = "controller"
        mount_dir = "/csi"
      }
    }
  }
}

csi node jobfile

variable "version" {
  type        = string
  description = "Ceph csi container version"
}

variable "cluster_id" {
  type        = string
  description = "cluster ID for the Ceph monitor"
}

variable "monitor_nodes" {
  type        = list(string)
  description = "Ceph monitor node addresses"
}

job "ceph-csi-plugin-nodes" {
  priority    = 94
  datacenters = ["dc1"]
  type        = "system"
  group "nodes" {
    task "ceph-node" {
      driver = "docker"
      config {
        image        = "quay.io/cephcsi/cephcsi:${var.version}"
        network_mode = "host"
        privileged   = true
        args = [
          "--type=cephfs",
          "--drivername=cephfs.csi.ceph.com",
          "--nodeserver=true",
          "--endpoint=unix://csi/csi.sock",
          "--nodeid=${node.unique.name}",
          "--instanceid=${node.unique.name}-nodes",
          "--pidlimit=-1",
          "--logtostderr=true",
          "--v=5",
        ]
        volumes = [
          "./local/config.json:/etc/ceph-csi-config/config.json",
          "/lib/modules:/lib/modules"
        ]
        mount {
          type   = "tmpfs"
          target = "/tmp/csi/keys"
          tmpfs_options {
            size = 1000000
          }
        }
      }

      resources {
        cpu    = 256
        memory = 256
      }

      template {
        data = jsonencode([{
          "clusterID" = var.cluster_id,
          "monitors"  = var.monitor_nodes
        }])
        destination = "local/config.json"
      }

      csi_plugin {
        id        = "ceph-csi"
        type      = "node"
        mount_dir = "/csi"
      }
    }
  }
}

# nomad node drain -self -enable -yes
2022-05-02T12:15:48Z: Ctrl-C to stop monitoring: will not cancel the node drain
2022-05-02T12:15:48Z: Node "4841bc38-3fb4-90b3-dce4-a2191031e8fe" drain strategy set
2022-05-02T12:15:49Z: Alloc "34d5356a-f59f-da4e-28e8-91296a502f6f" marked for migration
2022-05-02T12:15:49Z: Alloc "65d8341c-37aa-cc03-ea6d-5de00261b4b8" marked for migration
2022-05-02T12:15:49Z: Alloc "4fd17396-573d-aaa6-f7b8-562161c1d84c" marked for migration
2022-05-02T12:15:49Z: Alloc "2823d704-0c9d-ecb9-f4dc-28afb8201d1d" marked for migration
2022-05-02T12:15:49Z: Alloc "61686de3-37b9-9da5-56fa-0a8ea11d561f" marked for migration
2022-05-02T12:15:49Z: Alloc "5b64e570-6681-0fbf-3087-df6e00a50541" marked for migration
2022-05-02T12:15:49Z: Alloc "2823d704-0c9d-ecb9-f4dc-28afb8201d1d" draining
2022-05-02T12:15:49Z: Alloc "61686de3-37b9-9da5-56fa-0a8ea11d561f" draining
2022-05-02T12:15:49Z: Alloc "34d5356a-f59f-da4e-28e8-91296a502f6f" draining
2022-05-02T12:15:49Z: Alloc "5b64e570-6681-0fbf-3087-df6e00a50541" draining
2022-05-02T12:15:49Z: Alloc "65d8341c-37aa-cc03-ea6d-5de00261b4b8" draining
2022-05-02T12:15:49Z: Alloc "4fd17396-573d-aaa6-f7b8-562161c1d84c" draining
2022-05-02T12:15:49Z: Alloc "34d5356a-f59f-da4e-28e8-91296a502f6f" status running -> complete
2022-05-02T12:15:49Z: Alloc "61686de3-37b9-9da5-56fa-0a8ea11d561f" status running -> complete
2022-05-02T12:15:50Z: Alloc "c3483899-6686-7a18-632c-5bb8c19a9b3e" marked for migration
2022-05-02T12:15:50Z: Alloc "c3483899-6686-7a18-632c-5bb8c19a9b3e" draining
2022-05-02T12:15:54Z: Alloc "2823d704-0c9d-ecb9-f4dc-28afb8201d1d" status running -> complete
2022-05-02T12:15:54Z: Alloc "5b64e570-6681-0fbf-3087-df6e00a50541" status running -> complete
2022-05-02T12:15:55Z: Alloc "65d8341c-37aa-cc03-ea6d-5de00261b4b8" status running -> complete
2022-05-02T12:15:55Z: Alloc "4fd17396-573d-aaa6-f7b8-562161c1d84c" status running -> complete
2022-05-02T12:15:55Z: Alloc "c3483899-6686-7a18-632c-5bb8c19a9b3e" status running -> complete
2022-05-02T12:15:56Z: Alloc "9ac154e8-27ff-5e69-df03-7cd67a605bfd" status running -> pending
2022-05-02T12:15:56Z: Alloc "9ac154e8-27ff-5e69-df03-7cd67a605bfd" status pending -> running
2022-05-02T12:16:02Z: Alloc "9ac154e8-27ff-5e69-df03-7cd67a605bfd" status running -> pending
2022-05-02T12:16:02Z: Alloc "9ac154e8-27ff-5e69-df03-7cd67a605bfd" status pending -> running
2022-05-02T12:17:17Z: Alloc "9ac154e8-27ff-5e69-df03-7cd67a605bfd" marked for migration
2022-05-02T12:17:18Z: Alloc "9ac154e8-27ff-5e69-df03-7cd67a605bfd" draining
2022-05-02T12:17:23Z: Alloc "9ac154e8-27ff-5e69-df03-7cd67a605bfd" status running -> complete

# nomad job status ceph-csi-plugin-controller
ID            = ceph-csi-plugin-controller
Name          = ceph-csi-plugin-controller
Submit Date   = 2022-04-27T15:38:35Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
controller  0       0         1        0       2         0     0

Latest Deployment
ID          = 8b15810f
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
controller  1        1       1        0          2022-04-27T15:48:46Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
ac91f0a2  4841bc38  controller  55       run      running  4d21h ago  4d21h ago

# nomad alloc status ac91f0a2
ID                  = ac91f0a2-91c0-e0b8-6f5b-fb81a85afc59
Eval ID             = c5069fe2
Name                = ceph-csi-plugin-controller.controller[0]
Node ID             = 4841bc38
Node Name           = nomad2
Job ID              = ceph-csi-plugin-controller
Job Version         = 55
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 4d21h ago
Modified            = 4d21h ago
Deployment ID       = 8b15810f
Deployment Health   = healthy

Task "ceph-controller" is "running"
Task Resources
CPU         Memory          Disk     Addresses
0/1024 MHz  16 MiB/512 MiB  300 MiB

Task Events:
Started At     = 2022-04-27T15:38:36Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type                   Description
2022-04-27T15:38:36Z  Plugin became healthy  plugin: ceph-csi
2022-04-27T15:38:36Z  Started                Task started by client
2022-04-27T15:38:36Z  Task Setup             Building Task Directory
2022-04-27T15:38:36Z  Received               Task received by client

Nomad Server logs (if appropriate)

May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.422Z [INFO]  client.driver_mgr.docker: stopped container: container_id=c2f30774a3b6dd3e165e6051518f326d8fca619320ec5dff09f9e37d5141df00 driver=docker
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.463Z [INFO]  agent: (runner) stopping
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.463Z [INFO]  agent: (runner) received finish
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.495Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/terraform/local/main.tf"
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.541Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/terraform/local/main.tf"
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.549Z [INFO]  client.driver_mgr.docker: stopped container: container_id=b1bb97dccd3486f75762b091751f77f9d63bb3b0549a51ae273efdeb76fcbf81 driver=docker
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.587Z [INFO]  client.driver_mgr.docker: stopped container: container_id=f01d34080ce70be3a639d347f8e4e29906436bb1b1ac7c199001d3f913decc49 driver=docker
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.594Z [INFO]  client.gc: marking allocation for GC: alloc_id=61686de3-37b9-9da5-56fa-0a8ea11d561f
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.596Z [INFO]  agent: (runner) stopping
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.596Z [INFO]  agent: (runner) received finish
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.630Z [INFO]  client.driver_mgr.docker: stopped container: container_id=5e425cc3043670bfc37836630bd9b959f0c63c71bc1fb74546519f42358816e6 driver=docker
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.633Z [INFO]  client.gc: marking allocation for GC: alloc_id=34d5356a-f59f-da4e-28e8-91296a502f6f
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.635Z [INFO]  agent: (runner) stopping
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.635Z [INFO]  agent: (runner) received finish
May 02 12:15:50 nomad2 nomad[1065]:     2022-05-02T12:15:50.421Z [INFO]  client.driver_mgr.docker: stopped container: container_id=74d89052531d5635e92fb5ad738ffdd228e347b2942c81c12f6adc6d042c88ba driver=docker
May 02 12:15:50 nomad2 nomad[1065]:     2022-05-02T12:15:50.713Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/terraform/local/main.tf"
May 02 12:15:50 nomad2 nomad[1065]:     2022-05-02T12:15:50.831Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/terraform/local/main.tf"
May 02 12:15:51 nomad2 nomad[1065]:     2022-05-02T12:15:51.472Z [INFO]  client.driver_mgr.docker: stopped container: container_id=6449d398e83532afa59a99259ad7b1e8d88deace0c1f05ade36186e1427e6fb3 driver=docker
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.240Z [INFO]  client.driver_mgr.docker: stopped container: container_id=f28b66393f9d90b2f8c88714efa189bb603864d621422fbc1aeb893a1b72827f driver=docker
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.415Z [INFO]  client.driver_mgr.docker: stopped container: container_id=cf57b5569798138b4662ae1738b10468633ac3ef846620581e5688b76f7ae5d8 driver=docker
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.432Z [INFO]  client.alloc_runner: killing task: alloc_id=2823d704-0c9d-ecb9-f4dc-28afb8201d1d task=connect-proxy-registry
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.432Z [INFO]  client.gc: marking allocation for GC: alloc_id=2823d704-0c9d-ecb9-f4dc-28afb8201d1d
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.590Z [INFO]  client.driver_mgr.docker: stopped container: container_id=78bd6c19b8aadb30e761fc8fda5c1e94ae497276d882b0f092b7c269b6f905da driver=docker
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.607Z [INFO]  client.alloc_runner: killing task: alloc_id=5b64e570-6681-0fbf-3087-df6e00a50541 task=connect-proxy-echo
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.607Z [INFO]  client.gc: marking allocation for GC: alloc_id=5b64e570-6681-0fbf-3087-df6e00a50541
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.729Z [INFO]  client.driver_mgr.docker: stopped container: container_id=4f62ab8d8f847a43955365b364bee6b2e24652ccb95a4997c675c682896967d1 driver=docker
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.753Z [INFO]  client.alloc_runner: killing task: alloc_id=65d8341c-37aa-cc03-ea6d-5de00261b4b8 task=connect-proxy-hawkbit-mysql
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.753Z [INFO]  client.gc: marking allocation for GC: alloc_id=65d8341c-37aa-cc03-ea6d-5de00261b4b8
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.834Z [INFO]  client.driver_mgr.docker: stopped container: container_id=b62960cb61b94fd6e74aaef9f3de6ea216cb76ed19d64021638ae47c414331ed driver=docker
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.837Z [INFO]  client.driver_mgr.docker: stopped container: container_id=f698ae9b2b09f1a07bc04e9e9352ff412179108dc2e56d2e2112741d53fd222b driver=docker
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.881Z [INFO]  client.alloc_runner: killing task: alloc_id=4fd17396-573d-aaa6-f7b8-562161c1d84c task=connect-proxy-rabbitmq-amqp
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.881Z [INFO]  client.alloc_runner: killing task: alloc_id=4fd17396-573d-aaa6-f7b8-562161c1d84c task=connect-proxy-rabbitmq-mgmt
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.881Z [INFO]  client.gc: marking allocation for GC: alloc_id=4fd17396-573d-aaa6-f7b8-562161c1d84c
May 02 12:15:55 nomad2 nomad[1065]:     2022-05-02T12:15:55.424Z [INFO]  client.driver_mgr.docker: stopped container: container_id=a6bc09498ca835bf7e0e59f494dd8d0fde14569fa055aadf9fd14f77769865a6 driver=docker
May 02 12:15:55 nomad2 nomad[1065]:     2022-05-02T12:15:55.447Z [INFO]  client.alloc_runner: killing task: alloc_id=c3483899-6686-7a18-632c-5bb8c19a9b3e task=connect-proxy-echo
May 02 12:15:55 nomad2 nomad[1065]:     2022-05-02T12:15:55.447Z [INFO]  client.gc: marking allocation for GC: alloc_id=c3483899-6686-7a18-632c-5bb8c19a9b3e
May 02 12:15:56 nomad2 nomad[1065]:     2022-05-02T12:15:56.025Z [INFO]  client.driver_mgr.docker: stopped container: container_id=8d3f0593f75146cb2521182440472aae424baf2d23ab07ef13bd0302fea56432 driver=docker
May 02 12:15:56 nomad2 nomad[1065]:     2022-05-02T12:15:56.036Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=9ac154e8-27ff-5e69-df03-7cd67a605bfd task=terraform reason="" delay=0s
May 02 12:15:56 nomad2 nomad[1065]:     2022-05-02T12:15:56.039Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=9ac154e8-27ff-5e69-df03-7cd67a605bfd task=terraform @module=logmon path=/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/alloc/logs/.terraform.stdout>
May 02 12:15:56 nomad2 nomad[1065]:     2022-05-02T12:15:56.040Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=9ac154e8-27ff-5e69-df03-7cd67a605bfd task=terraform path=/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/alloc/logs/.terraform.stderr.fifo @module=l>
May 02 12:15:56 nomad2 nomad[1065]:     2022-05-02T12:15:56.068Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=efc49426570c12243673fc8185333e0f653b7f8b72ae0fa1d28580d9ed4a1ba0
May 02 12:15:56 nomad2 nomad[1065]:     2022-05-02T12:15:56.175Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=efc49426570c12243673fc8185333e0f653b7f8b72ae0fa1d28580d9ed4a1ba0
May 02 12:15:56 nomad2 nomad[1065]:     2022-05-02T12:15:56.268Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/terraform/local/main.tf"
May 02 12:15:57 nomad2 nomad[1065]:     2022-05-02T12:15:57.873Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/terraform/local/main.tf"
May 02 12:16:01 nomad2 nomad[1065]:     2022-05-02T12:16:01.948Z [INFO]  client.driver_mgr.docker: stopped container: container_id=efc49426570c12243673fc8185333e0f653b7f8b72ae0fa1d28580d9ed4a1ba0 driver=docker
May 02 12:16:01 nomad2 nomad[1065]:     2022-05-02T12:16:01.959Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=9ac154e8-27ff-5e69-df03-7cd67a605bfd task=terraform reason="" delay=0s
May 02 12:16:01 nomad2 nomad[1065]:     2022-05-02T12:16:01.962Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=9ac154e8-27ff-5e69-df03-7cd67a605bfd task=terraform @module=logmon path=/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/alloc/logs/.terraform.stdout>
May 02 12:16:01 nomad2 nomad[1065]:     2022-05-02T12:16:01.962Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=9ac154e8-27ff-5e69-df03-7cd67a605bfd task=terraform @module=logmon path=/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/alloc/logs/.terraform.stderr>
May 02 12:16:02 nomad2 nomad[1065]:     2022-05-02T12:16:02.012Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=9a1fea0a4a8f22f9e9aff3a55ee62f6858f6bb06f4000e4a68ab3978d7c97b4d
May 02 12:16:02 nomad2 nomad[1065]:     2022-05-02T12:16:02.126Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=9a1fea0a4a8f22f9e9aff3a55ee62f6858f6bb06f4000e4a68ab3978d7c97b4d
May 02 12:17:23 nomad2 nomad[1065]:     2022-05-02T12:17:23.167Z [INFO]  client.driver_mgr.docker: stopped container: container_id=9a1fea0a4a8f22f9e9aff3a55ee62f6858f6bb06f4000e4a68ab3978d7c97b4d driver=docker
May 02 12:17:23 nomad2 nomad[1065]:     2022-05-02T12:17:23.188Z [INFO]  client.gc: marking allocation for GC: alloc_id=9ac154e8-27ff-5e69-df03-7cd67a605bfd
May 02 12:17:23 nomad2 nomad[1065]:     2022-05-02T12:17:23.190Z [INFO]  agent: (runner) stopping
May 02 12:17:23 nomad2 nomad[1065]:     2022-05-02T12:17:23.190Z [INFO]  agent: (runner) received finish

Nomad Client logs (if appropriate)

The text was updated successfully, but these errors were encountered:

tgross · 2022-05-02T15:50:54Z

@iSchluff has the -deadline time expired? The default is quite long and as the docs say:

By default system jobs (and CSI plugins) are stopped last, after the deadline time has expired.

iSchluff · 2022-05-02T16:27:22Z

@iSchluff has the -deadline time expired? The default is quite long and as the docs say:

By default system jobs (and CSI plugins) are stopped last, after the deadline time has expired.

Hi @tgross, yeah I read that just 10 minutes ago and also saw your update to the docs in the PR. However I am quite suprised by this because previously with 1.2 I could drain system jobs immediately without issue.

Looking at the code I think the system jobs are supposed to be drained when all "normal" allocations are gone, not only when the deadline expires.

I think the fix would be for the isDone check to also need to ignore Plugin Jobs aswell

nomad/nomad/drainer/draining_node.go

Lines 69 to 73 in 4d404b3

    
           // System jobs are only stopped after a node is done draining 
        
           // everything else, so ignore them here. 
        
           if alloc.Job.Type == structs.JobTypeSystem { 
        
           	continue 
        
           }

otherwise

nomad/nomad/drainer/drainer.go

Line 292 in 4d404b3

func (n *NodeDrainer) handleMigratedAllocs(allocs []*structs.Allocation) {

will never declare the node done.

I guess isDone could generally just be replaced by len(DrainingJobs()) == 0 ?

tgross · 2022-05-02T17:03:18Z

Oh shoot, you're right. Good catch on the IsDone(). I'll get that fixed.

tgross · 2022-05-02T17:15:41Z

I've opened #12846 with that fix.

github-actions · 2022-10-08T02:35:59Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

iSchluff added the type/bug label May 2, 2022

iSchluff changed the title ~~nomad 1.3.0: unable to drain csi controller~~ nomad 1.3.0: node drain stuck at csi controller May 2, 2022

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation May 2, 2022

tgross added the theme/storage label May 2, 2022

tgross added the stage/waiting-reply label May 2, 2022

tgross self-assigned this May 2, 2022

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage May 2, 2022

tgross mentioned this issue May 2, 2022

CSI: node drain should end once only plugins remain #12846

Merged

tgross closed this as completed in #12846 May 3, 2022

Nomad - Community Issues Triage automation moved this from In Progress to Done May 3, 2022

hc-github-team-nomad-core mentioned this issue May 3, 2022

Backport of CSI: node drain should end once only plugins remain into release/1.3.x #12856

Merged

github-actions bot locked as resolved and limited conversation to collaborators Oct 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nomad 1.3.0: node drain stuck at csi controller #12835

nomad 1.3.0: node drain stuck at csi controller #12835

iSchluff commented May 2, 2022 •

edited

Loading

tgross commented May 2, 2022

iSchluff commented May 2, 2022 •

edited

Loading

tgross commented May 2, 2022

tgross commented May 2, 2022

github-actions bot commented Oct 8, 2022

nomad 1.3.0: node drain stuck at csi controller #12835

nomad 1.3.0: node drain stuck at csi controller #12835

Comments

iSchluff commented May 2, 2022 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

tgross commented May 2, 2022

iSchluff commented May 2, 2022 • edited Loading

tgross commented May 2, 2022

tgross commented May 2, 2022

github-actions bot commented Oct 8, 2022

iSchluff commented May 2, 2022 •

edited

Loading

iSchluff commented May 2, 2022 •

edited

Loading