Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad 1.3.0: node drain stuck at csi controller #12835

Closed
iSchluff opened this issue May 2, 2022 · 5 comments · Fixed by #12846
Closed

nomad 1.3.0: node drain stuck at csi controller #12835

iSchluff opened this issue May 2, 2022 · 5 comments · Fixed by #12846

Comments

@iSchluff
Copy link

iSchluff commented May 2, 2022

Nomad version

Nomad v1.3.0-beta.1 (2eba643)

Operating system and Environment details

Ubuntu 20.04.4 LTS
Linux 5.4.0-109-generic #123-Ubuntu SMP Fri Apr 8 09:10:54 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Issue

I don't know whether this is related to #12324 but it seems that nomad is not trying to stop the controller allocation at all with 1.3.0.

Reproduction steps

Run csi controller as service job and nodes as system jobs. Issue single node drain for the node running the controller.

Expected Result

Should continue draining.

Actual Result

Drain gets stuck with the controller and csi node left running.

Job file (if appropriate)

csi controller jobfile

variable "version" {
  type        = string
  description = "Ceph csi container version"
}

variable "cluster_id" {
  type        = string
  description = "cluster ID for the Ceph monitor"
}

variable "monitor_nodes" {
  type        = list(string)
  description = "Ceph monitor node addresses"
}

job "ceph-csi-plugin-controller" {
  datacenters = ["dc1"]
  type        = "service"

  constraint {
    operator = "distinct_hosts"
    value    = true
  }

  group "controller" {
    task "ceph-controller" {
      driver = "docker"
      config {
        image        = "quay.io/cephcsi/cephcsi:${var.version}"
        network_mode = "host"
        args = [
          "--type=cephfs",
          "--controllerserver=true",
          "--drivername=cephfs.csi.ceph.com",
          "--endpoint=unix://csi/csi.sock",
          "--nodeid=${node.unique.name}",
          "--instanceid=${node.unique.name}-controller",
          "--pidlimit=-1",
          "--logtostderr=true",
          "--v=5",
        ]
        volumes = [
          "./local/config.json:/etc/ceph-csi-config/config.json"
        ]
        mount {
          type   = "tmpfs"
          target = "/tmp/csi/keys"
          tmpfs_options {
            size = 1000000
          }
        }
      }

      resources {
        cpu    = 1024
        memory = 512
      }

      template {
        data = jsonencode([{
          "clusterID" = var.cluster_id,
          "monitors"  = var.monitor_nodes
        }])
        destination = "local/config.json"
      }

      csi_plugin {
        id        = "ceph-csi"
        type      = "controller"
        mount_dir = "/csi"
      }
    }
  }
}

csi node jobfile

variable "version" {
  type        = string
  description = "Ceph csi container version"
}

variable "cluster_id" {
  type        = string
  description = "cluster ID for the Ceph monitor"
}

variable "monitor_nodes" {
  type        = list(string)
  description = "Ceph monitor node addresses"
}

job "ceph-csi-plugin-nodes" {
  priority    = 94
  datacenters = ["dc1"]
  type        = "system"
  group "nodes" {
    task "ceph-node" {
      driver = "docker"
      config {
        image        = "quay.io/cephcsi/cephcsi:${var.version}"
        network_mode = "host"
        privileged   = true
        args = [
          "--type=cephfs",
          "--drivername=cephfs.csi.ceph.com",
          "--nodeserver=true",
          "--endpoint=unix://csi/csi.sock",
          "--nodeid=${node.unique.name}",
          "--instanceid=${node.unique.name}-nodes",
          "--pidlimit=-1",
          "--logtostderr=true",
          "--v=5",
        ]
        volumes = [
          "./local/config.json:/etc/ceph-csi-config/config.json",
          "/lib/modules:/lib/modules"
        ]
        mount {
          type   = "tmpfs"
          target = "/tmp/csi/keys"
          tmpfs_options {
            size = 1000000
          }
        }
      }

      resources {
        cpu    = 256
        memory = 256
      }

      template {
        data = jsonencode([{
          "clusterID" = var.cluster_id,
          "monitors"  = var.monitor_nodes
        }])
        destination = "local/config.json"
      }

      csi_plugin {
        id        = "ceph-csi"
        type      = "node"
        mount_dir = "/csi"
      }
    }
  }
}

# nomad node drain -self -enable -yes
2022-05-02T12:15:48Z: Ctrl-C to stop monitoring: will not cancel the node drain
2022-05-02T12:15:48Z: Node "4841bc38-3fb4-90b3-dce4-a2191031e8fe" drain strategy set
2022-05-02T12:15:49Z: Alloc "34d5356a-f59f-da4e-28e8-91296a502f6f" marked for migration
2022-05-02T12:15:49Z: Alloc "65d8341c-37aa-cc03-ea6d-5de00261b4b8" marked for migration
2022-05-02T12:15:49Z: Alloc "4fd17396-573d-aaa6-f7b8-562161c1d84c" marked for migration
2022-05-02T12:15:49Z: Alloc "2823d704-0c9d-ecb9-f4dc-28afb8201d1d" marked for migration
2022-05-02T12:15:49Z: Alloc "61686de3-37b9-9da5-56fa-0a8ea11d561f" marked for migration
2022-05-02T12:15:49Z: Alloc "5b64e570-6681-0fbf-3087-df6e00a50541" marked for migration
2022-05-02T12:15:49Z: Alloc "2823d704-0c9d-ecb9-f4dc-28afb8201d1d" draining
2022-05-02T12:15:49Z: Alloc "61686de3-37b9-9da5-56fa-0a8ea11d561f" draining
2022-05-02T12:15:49Z: Alloc "34d5356a-f59f-da4e-28e8-91296a502f6f" draining
2022-05-02T12:15:49Z: Alloc "5b64e570-6681-0fbf-3087-df6e00a50541" draining
2022-05-02T12:15:49Z: Alloc "65d8341c-37aa-cc03-ea6d-5de00261b4b8" draining
2022-05-02T12:15:49Z: Alloc "4fd17396-573d-aaa6-f7b8-562161c1d84c" draining
2022-05-02T12:15:49Z: Alloc "34d5356a-f59f-da4e-28e8-91296a502f6f" status running -> complete
2022-05-02T12:15:49Z: Alloc "61686de3-37b9-9da5-56fa-0a8ea11d561f" status running -> complete
2022-05-02T12:15:50Z: Alloc "c3483899-6686-7a18-632c-5bb8c19a9b3e" marked for migration
2022-05-02T12:15:50Z: Alloc "c3483899-6686-7a18-632c-5bb8c19a9b3e" draining
2022-05-02T12:15:54Z: Alloc "2823d704-0c9d-ecb9-f4dc-28afb8201d1d" status running -> complete
2022-05-02T12:15:54Z: Alloc "5b64e570-6681-0fbf-3087-df6e00a50541" status running -> complete
2022-05-02T12:15:55Z: Alloc "65d8341c-37aa-cc03-ea6d-5de00261b4b8" status running -> complete
2022-05-02T12:15:55Z: Alloc "4fd17396-573d-aaa6-f7b8-562161c1d84c" status running -> complete
2022-05-02T12:15:55Z: Alloc "c3483899-6686-7a18-632c-5bb8c19a9b3e" status running -> complete
2022-05-02T12:15:56Z: Alloc "9ac154e8-27ff-5e69-df03-7cd67a605bfd" status running -> pending
2022-05-02T12:15:56Z: Alloc "9ac154e8-27ff-5e69-df03-7cd67a605bfd" status pending -> running
2022-05-02T12:16:02Z: Alloc "9ac154e8-27ff-5e69-df03-7cd67a605bfd" status running -> pending
2022-05-02T12:16:02Z: Alloc "9ac154e8-27ff-5e69-df03-7cd67a605bfd" status pending -> running
2022-05-02T12:17:17Z: Alloc "9ac154e8-27ff-5e69-df03-7cd67a605bfd" marked for migration
2022-05-02T12:17:18Z: Alloc "9ac154e8-27ff-5e69-df03-7cd67a605bfd" draining
2022-05-02T12:17:23Z: Alloc "9ac154e8-27ff-5e69-df03-7cd67a605bfd" status running -> complete
# nomad job status ceph-csi-plugin-controller
ID            = ceph-csi-plugin-controller
Name          = ceph-csi-plugin-controller
Submit Date   = 2022-04-27T15:38:35Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
controller  0       0         1        0       2         0     0

Latest Deployment
ID          = 8b15810f
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
controller  1        1       1        0          2022-04-27T15:48:46Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
ac91f0a2  4841bc38  controller  55       run      running  4d21h ago  4d21h ago

# nomad alloc status ac91f0a2
ID                  = ac91f0a2-91c0-e0b8-6f5b-fb81a85afc59
Eval ID             = c5069fe2
Name                = ceph-csi-plugin-controller.controller[0]
Node ID             = 4841bc38
Node Name           = nomad2
Job ID              = ceph-csi-plugin-controller
Job Version         = 55
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 4d21h ago
Modified            = 4d21h ago
Deployment ID       = 8b15810f
Deployment Health   = healthy

Task "ceph-controller" is "running"
Task Resources
CPU         Memory          Disk     Addresses
0/1024 MHz  16 MiB/512 MiB  300 MiB

Task Events:
Started At     = 2022-04-27T15:38:36Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type                   Description
2022-04-27T15:38:36Z  Plugin became healthy  plugin: ceph-csi
2022-04-27T15:38:36Z  Started                Task started by client
2022-04-27T15:38:36Z  Task Setup             Building Task Directory
2022-04-27T15:38:36Z  Received               Task received by client

Nomad Server logs (if appropriate)

May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.422Z [INFO]  client.driver_mgr.docker: stopped container: container_id=c2f30774a3b6dd3e165e6051518f326d8fca619320ec5dff09f9e37d5141df00 driver=docker
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.463Z [INFO]  agent: (runner) stopping
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.463Z [INFO]  agent: (runner) received finish
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.495Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/terraform/local/main.tf"
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.541Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/terraform/local/main.tf"
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.549Z [INFO]  client.driver_mgr.docker: stopped container: container_id=b1bb97dccd3486f75762b091751f77f9d63bb3b0549a51ae273efdeb76fcbf81 driver=docker
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.587Z [INFO]  client.driver_mgr.docker: stopped container: container_id=f01d34080ce70be3a639d347f8e4e29906436bb1b1ac7c199001d3f913decc49 driver=docker
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.594Z [INFO]  client.gc: marking allocation for GC: alloc_id=61686de3-37b9-9da5-56fa-0a8ea11d561f
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.596Z [INFO]  agent: (runner) stopping
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.596Z [INFO]  agent: (runner) received finish
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.630Z [INFO]  client.driver_mgr.docker: stopped container: container_id=5e425cc3043670bfc37836630bd9b959f0c63c71bc1fb74546519f42358816e6 driver=docker
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.633Z [INFO]  client.gc: marking allocation for GC: alloc_id=34d5356a-f59f-da4e-28e8-91296a502f6f
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.635Z [INFO]  agent: (runner) stopping
May 02 12:15:49 nomad2 nomad[1065]:     2022-05-02T12:15:49.635Z [INFO]  agent: (runner) received finish
May 02 12:15:50 nomad2 nomad[1065]:     2022-05-02T12:15:50.421Z [INFO]  client.driver_mgr.docker: stopped container: container_id=74d89052531d5635e92fb5ad738ffdd228e347b2942c81c12f6adc6d042c88ba driver=docker
May 02 12:15:50 nomad2 nomad[1065]:     2022-05-02T12:15:50.713Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/terraform/local/main.tf"
May 02 12:15:50 nomad2 nomad[1065]:     2022-05-02T12:15:50.831Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/terraform/local/main.tf"
May 02 12:15:51 nomad2 nomad[1065]:     2022-05-02T12:15:51.472Z [INFO]  client.driver_mgr.docker: stopped container: container_id=6449d398e83532afa59a99259ad7b1e8d88deace0c1f05ade36186e1427e6fb3 driver=docker
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.240Z [INFO]  client.driver_mgr.docker: stopped container: container_id=f28b66393f9d90b2f8c88714efa189bb603864d621422fbc1aeb893a1b72827f driver=docker
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.415Z [INFO]  client.driver_mgr.docker: stopped container: container_id=cf57b5569798138b4662ae1738b10468633ac3ef846620581e5688b76f7ae5d8 driver=docker
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.432Z [INFO]  client.alloc_runner: killing task: alloc_id=2823d704-0c9d-ecb9-f4dc-28afb8201d1d task=connect-proxy-registry
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.432Z [INFO]  client.gc: marking allocation for GC: alloc_id=2823d704-0c9d-ecb9-f4dc-28afb8201d1d
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.590Z [INFO]  client.driver_mgr.docker: stopped container: container_id=78bd6c19b8aadb30e761fc8fda5c1e94ae497276d882b0f092b7c269b6f905da driver=docker
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.607Z [INFO]  client.alloc_runner: killing task: alloc_id=5b64e570-6681-0fbf-3087-df6e00a50541 task=connect-proxy-echo
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.607Z [INFO]  client.gc: marking allocation for GC: alloc_id=5b64e570-6681-0fbf-3087-df6e00a50541
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.729Z [INFO]  client.driver_mgr.docker: stopped container: container_id=4f62ab8d8f847a43955365b364bee6b2e24652ccb95a4997c675c682896967d1 driver=docker
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.753Z [INFO]  client.alloc_runner: killing task: alloc_id=65d8341c-37aa-cc03-ea6d-5de00261b4b8 task=connect-proxy-hawkbit-mysql
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.753Z [INFO]  client.gc: marking allocation for GC: alloc_id=65d8341c-37aa-cc03-ea6d-5de00261b4b8
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.834Z [INFO]  client.driver_mgr.docker: stopped container: container_id=b62960cb61b94fd6e74aaef9f3de6ea216cb76ed19d64021638ae47c414331ed driver=docker
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.837Z [INFO]  client.driver_mgr.docker: stopped container: container_id=f698ae9b2b09f1a07bc04e9e9352ff412179108dc2e56d2e2112741d53fd222b driver=docker
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.881Z [INFO]  client.alloc_runner: killing task: alloc_id=4fd17396-573d-aaa6-f7b8-562161c1d84c task=connect-proxy-rabbitmq-amqp
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.881Z [INFO]  client.alloc_runner: killing task: alloc_id=4fd17396-573d-aaa6-f7b8-562161c1d84c task=connect-proxy-rabbitmq-mgmt
May 02 12:15:54 nomad2 nomad[1065]:     2022-05-02T12:15:54.881Z [INFO]  client.gc: marking allocation for GC: alloc_id=4fd17396-573d-aaa6-f7b8-562161c1d84c
May 02 12:15:55 nomad2 nomad[1065]:     2022-05-02T12:15:55.424Z [INFO]  client.driver_mgr.docker: stopped container: container_id=a6bc09498ca835bf7e0e59f494dd8d0fde14569fa055aadf9fd14f77769865a6 driver=docker
May 02 12:15:55 nomad2 nomad[1065]:     2022-05-02T12:15:55.447Z [INFO]  client.alloc_runner: killing task: alloc_id=c3483899-6686-7a18-632c-5bb8c19a9b3e task=connect-proxy-echo
May 02 12:15:55 nomad2 nomad[1065]:     2022-05-02T12:15:55.447Z [INFO]  client.gc: marking allocation for GC: alloc_id=c3483899-6686-7a18-632c-5bb8c19a9b3e
May 02 12:15:56 nomad2 nomad[1065]:     2022-05-02T12:15:56.025Z [INFO]  client.driver_mgr.docker: stopped container: container_id=8d3f0593f75146cb2521182440472aae424baf2d23ab07ef13bd0302fea56432 driver=docker
May 02 12:15:56 nomad2 nomad[1065]:     2022-05-02T12:15:56.036Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=9ac154e8-27ff-5e69-df03-7cd67a605bfd task=terraform reason="" delay=0s
May 02 12:15:56 nomad2 nomad[1065]:     2022-05-02T12:15:56.039Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=9ac154e8-27ff-5e69-df03-7cd67a605bfd task=terraform @module=logmon path=/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/alloc/logs/.terraform.stdout>
May 02 12:15:56 nomad2 nomad[1065]:     2022-05-02T12:15:56.040Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=9ac154e8-27ff-5e69-df03-7cd67a605bfd task=terraform path=/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/alloc/logs/.terraform.stderr.fifo @module=l>
May 02 12:15:56 nomad2 nomad[1065]:     2022-05-02T12:15:56.068Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=efc49426570c12243673fc8185333e0f653b7f8b72ae0fa1d28580d9ed4a1ba0
May 02 12:15:56 nomad2 nomad[1065]:     2022-05-02T12:15:56.175Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=efc49426570c12243673fc8185333e0f653b7f8b72ae0fa1d28580d9ed4a1ba0
May 02 12:15:56 nomad2 nomad[1065]:     2022-05-02T12:15:56.268Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/terraform/local/main.tf"
May 02 12:15:57 nomad2 nomad[1065]:     2022-05-02T12:15:57.873Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/terraform/local/main.tf"
May 02 12:16:01 nomad2 nomad[1065]:     2022-05-02T12:16:01.948Z [INFO]  client.driver_mgr.docker: stopped container: container_id=efc49426570c12243673fc8185333e0f653b7f8b72ae0fa1d28580d9ed4a1ba0 driver=docker
May 02 12:16:01 nomad2 nomad[1065]:     2022-05-02T12:16:01.959Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=9ac154e8-27ff-5e69-df03-7cd67a605bfd task=terraform reason="" delay=0s
May 02 12:16:01 nomad2 nomad[1065]:     2022-05-02T12:16:01.962Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=9ac154e8-27ff-5e69-df03-7cd67a605bfd task=terraform @module=logmon path=/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/alloc/logs/.terraform.stdout>
May 02 12:16:01 nomad2 nomad[1065]:     2022-05-02T12:16:01.962Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=9ac154e8-27ff-5e69-df03-7cd67a605bfd task=terraform @module=logmon path=/opt/nomad/data/alloc/9ac154e8-27ff-5e69-df03-7cd67a605bfd/alloc/logs/.terraform.stderr>
May 02 12:16:02 nomad2 nomad[1065]:     2022-05-02T12:16:02.012Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=9a1fea0a4a8f22f9e9aff3a55ee62f6858f6bb06f4000e4a68ab3978d7c97b4d
May 02 12:16:02 nomad2 nomad[1065]:     2022-05-02T12:16:02.126Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=9a1fea0a4a8f22f9e9aff3a55ee62f6858f6bb06f4000e4a68ab3978d7c97b4d
May 02 12:17:23 nomad2 nomad[1065]:     2022-05-02T12:17:23.167Z [INFO]  client.driver_mgr.docker: stopped container: container_id=9a1fea0a4a8f22f9e9aff3a55ee62f6858f6bb06f4000e4a68ab3978d7c97b4d driver=docker
May 02 12:17:23 nomad2 nomad[1065]:     2022-05-02T12:17:23.188Z [INFO]  client.gc: marking allocation for GC: alloc_id=9ac154e8-27ff-5e69-df03-7cd67a605bfd
May 02 12:17:23 nomad2 nomad[1065]:     2022-05-02T12:17:23.190Z [INFO]  agent: (runner) stopping
May 02 12:17:23 nomad2 nomad[1065]:     2022-05-02T12:17:23.190Z [INFO]  agent: (runner) received finish

Nomad Client logs (if appropriate)

@iSchluff iSchluff changed the title nomad 1.3.0: unable to drain csi controller nomad 1.3.0: node drain stuck at csi controller May 2, 2022
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation May 2, 2022
@tgross
Copy link
Member

tgross commented May 2, 2022

@iSchluff has the -deadline time expired? The default is quite long and as the docs say:

By default system jobs (and CSI plugins) are stopped last, after the deadline time has expired.

@tgross tgross self-assigned this May 2, 2022
@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage May 2, 2022
@iSchluff
Copy link
Author

iSchluff commented May 2, 2022

@iSchluff has the -deadline time expired? The default is quite long and as the docs say:

By default system jobs (and CSI plugins) are stopped last, after the deadline time has expired.

Hi @tgross, yeah I read that just 10 minutes ago and also saw your update to the docs in the PR. However I am quite suprised by this because previously with 1.2 I could drain system jobs immediately without issue.

Looking at the code I think the system jobs are supposed to be drained when all "normal" allocations are gone, not only when the deadline expires.

I think the fix would be for the isDone check to also need to ignore Plugin Jobs aswell

// System jobs are only stopped after a node is done draining
// everything else, so ignore them here.
if alloc.Job.Type == structs.JobTypeSystem {
continue
}

otherwise
func (n *NodeDrainer) handleMigratedAllocs(allocs []*structs.Allocation) {
will never declare the node done.

I guess isDone could generally just be replaced by len(DrainingJobs()) == 0 ?

@tgross
Copy link
Member

tgross commented May 2, 2022

Oh shoot, you're right. Good catch on the IsDone(). I'll get that fixed.

@tgross
Copy link
Member

tgross commented May 2, 2022

I've opened #12846 with that fix.

@github-actions
Copy link

github-actions bot commented Oct 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging a pull request may close this issue.

2 participants