Jobs Unable to attach to EBS CSI Volumes When Plugin Status Reports Incorrect Controller Healthy/Expected Count #7817

tydomitrovich · 2020-04-28T01:18:55Z

Nomad version

Nomad Server and Clients are both running the following build:

Nomad v0.11.1 (b43457070037800fcc8442c8ff095ff4005dab33)

Operating system and Environment details

Amazon Linux 2:

4.14.173-137.229.amzn2.x86_64

Issue

While running the EBS CSI plugin I have noticed that nomad expects plugin tasks that complete to still report as healthy:

$ nomad plugin status aws-ebs4
ID                   = aws-ebs4
Provider             = ebs.csi.aws.com
Version              = v0.6.0-dirty
Controllers Healthy  = 1
Controllers Expected = 2
Nodes Healthy        = 3
Nodes Expected       = 4

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
738adb4b  46e6db9e  controller  4        run      running   33m59s ago  31m16s ago
a999a840  4470dc51  controller  3        stop     complete  32m4s ago   31m25s ago
9290e85e  46e6db9e  nodes       0        run      running   42m23s ago  42m16s ago
eed3459a  ec4c06b3  nodes       0        stop     complete  42m23s ago  35m8s ago
d9ecfc6b  4470dc51  nodes       0        run      running   42m23s ago  42m8s ago
ad2698aa  eaac2f32  nodes       0        run      running   37m49s ago  37m31s ago

This seems unusual since if a CSI plugin has completed it should no longer be expected to be running and healthy. When this mismatch between healthy and expected plugin task counts occurs, all tasks that need to attach a CSI volume using the plugin in question are unable to do so. Instead of successfully mounting the volume, the following error occurs:

failed to setup alloc: pre-run hook "csi_hook" failed: rpc error: code = InvalidArgument desc = Device path not provided

The plugin returns this error when it is missing information in the PublishContext passed to a NodePublishVolume/NodeStageVolume RPC as seen here.
The PublishContext is returned by a ControllerPublishVolume RPC, however, after checking the logs of my controller plugin it turns out ControllerPublishVolume is never called.

Again, this only occurs when there is a mismatch between healthy and expected counts. Otherwise ControllerPublishVolume is called when a task requesting a CSI volume is scheduled and the volume is successfully attached.

Reproduction steps

The easiest way to create a healthy/expected value mismatch is to increase the number of controller tasks to 2 then decrement back to 1.

Run the CSI controller plugin job:

job "plugin-aws-ebs-controller" {
  datacenters = ["dc1"]

  group "controller" {
    task "plugin" {
      driver = "docker"

      config {
        image = "amazon/aws-ebs-csi-driver:latest"

        args = [
          "controller",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]
      }

      csi_plugin {
        id        = "aws-ebs0"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }

      # ensuring the plugin has time to shut down gracefully 
      kill_timeout = "2m"
    }
  }
}

2 Run the CSI node plugin job:

job "plugin-aws-ebs-nodes" {
  datacenters = ["dc1"]

  # you can run node plugins as service jobs as well, but this ensures
  # that all nodes in the DC have a copy.
  type = "system"

  group "nodes" {
    task "plugin" {
      driver = "docker"

      config {
        image = "amazon/aws-ebs-csi-driver:latest"

        args = [
          "node",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]

        # node plugins must run as privileged jobs because they
        # mount disks to the host
        privileged = true
      }

      csi_plugin {
        id        = "aws-ebs0"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }

      # ensuring the plugin has time to shut down gracefully 
      kill_timeout = "2m"
    }
  }
}

Create and register an EBS volume with nomad. E.G. https://learn.hashicorp.com/nomad/stateful-workloads/csi-volumes
Optionally run the example MySQL job to verify that volumes can be attached successfully. Be sure to use constraints to run the task using the volume in the same availability zone as your EBS volume

job "mysql-server" {
  datacenters = ["dc1"]
  type        = "service"

  group "mysql-server" {
    count = 1

    volume "mysql" {
      type      = "csi"
      read_only = false
      source    = "mysql"
    }

    restart {
      attempts = 10
      interval = "5m"
      delay    = "25s"
      mode     = "delay"
    }

    task "mysql-server" {
      driver = "docker"

      volume_mount {
        volume      = "mysql"
        destination = "/srv"
        read_only   = false
      }

      env = {
        "MYSQL_ROOT_PASSWORD" = "password"
      }

      config {
        image = "hashicorp/mysql-portworx-demo:latest"
        args = ["--datadir", "/srv/mysql"]

        port_map {
          db = 3306
        }
      }

      resources {
        cpu    = 500
        memory = 1024

        network {
          port "db" {
            static = 3306
          }
        }
      }

      service {
        name = "mysql-server"
        port = "db"

        check {
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Increment the count for the number of controller plugin tasks to 2 and wait for the new task to become healthy. Then scale down to 1 task and wait for the other task to complete.
Run nomad plugin status. You should see mismatched healthy/expected values for the controller plugins. E.G.

Container Storage Interface
ID        Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs4  ebs.csi.aws.com  1/2                           3/4

Run the MySQL job. You should now see the "Device path not provided" error.

Additional Notes:

I am also seeing issues where plugins with no running jobs are not being garbage collected as described here:

#7743

$ nomad plugin status
Container Storage Interface
ID        Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs0  ebs.csi.aws.com  0/3                           0/29
aws-ebs2  ebs.csi.aws.com  0/2                           0/25
aws-ebs3  ebs.csi.aws.com  0/2                           0/3
aws-ebs4  ebs.csi.aws.com  1/2                           3/4

Not sure if this could be related but I figured it was worth mentioning.

The text was updated successfully, but these errors were encountered:

tgross · 2020-04-28T12:16:45Z

Hi @tydomitrovich! Thanks for the thorough reproduction!

This seems unusual since if a CSI plugin has completed it should no longer be expected to be running and healthy. When this mismatch between healthy and expected plugin task counts occurs, all tasks that need to attach a CSI volume using the plugin in question are unable to do so.

Yeah, agreed that this is totally a bug. That'll impact updates to plugins too, I think. I don't have a good workaround for you at the moment but I'll dig in and see if I can come up with a fix shortly.

tydomitrovich · 2020-04-28T19:49:40Z

Hello @tgross, thanks for taking a look! I will be monitoring as I am really excited about using the new CSI features.

tgross · 2020-04-30T21:11:12Z

I'm working up a PR #7844 which should clear this up. I need to check a few more things out but I'm making good progress on it.

tgross · 2020-05-05T19:41:02Z

#7844 has been merged. This will ship in 0.11.2.

github-actions · 2022-11-08T02:31:45Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added the type/bug label Apr 28, 2020

tgross added this to the 0.11.2 milestone Apr 28, 2020

tgross mentioned this issue Apr 29, 2020

0.11 csi plugin, volume, job cleanup problems #7743

Closed

tgross added the theme/storage label Apr 29, 2020

tgross self-assigned this Apr 29, 2020

tgross mentioned this issue Apr 30, 2020

csi: fix plugin counts on node update #7844

Merged

tgross closed this as completed May 5, 2020

tyler-domitrovich mentioned this issue May 21, 2020

CSI Controller Plugin Is Ignored After Scale Down #8034

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs Unable to attach to EBS CSI Volumes When Plugin Status Reports Incorrect Controller Healthy/Expected Count #7817

Jobs Unable to attach to EBS CSI Volumes When Plugin Status Reports Incorrect Controller Healthy/Expected Count #7817

tydomitrovich commented Apr 28, 2020

tgross commented Apr 28, 2020

tydomitrovich commented Apr 28, 2020

tgross commented Apr 30, 2020

tgross commented May 5, 2020

github-actions bot commented Nov 8, 2022

Jobs Unable to attach to EBS CSI Volumes When Plugin Status Reports Incorrect Controller Healthy/Expected Count #7817

Jobs Unable to attach to EBS CSI Volumes When Plugin Status Reports Incorrect Controller Healthy/Expected Count #7817

Comments

tydomitrovich commented Apr 28, 2020

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Additional Notes:

tgross commented Apr 28, 2020

tydomitrovich commented Apr 28, 2020

tgross commented Apr 30, 2020

tgross commented May 5, 2020

github-actions bot commented Nov 8, 2022