Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs Unable to attach to EBS CSI Volumes When Plugin Status Reports Incorrect Controller Healthy/Expected Count #7817

Closed
tydomitrovich opened this issue Apr 28, 2020 · 5 comments
Assignees
Milestone

Comments

@tydomitrovich
Copy link

Nomad version

Nomad Server and Clients are both running the following build:

Nomad v0.11.1 (b43457070037800fcc8442c8ff095ff4005dab33)

Operating system and Environment details

Amazon Linux 2:

4.14.173-137.229.amzn2.x86_64

Issue

While running the EBS CSI plugin I have noticed that nomad expects plugin tasks that complete to still report as healthy:

$ nomad plugin status aws-ebs4
ID                   = aws-ebs4
Provider             = ebs.csi.aws.com
Version              = v0.6.0-dirty
Controllers Healthy  = 1
Controllers Expected = 2
Nodes Healthy        = 3
Nodes Expected       = 4

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
738adb4b  46e6db9e  controller  4        run      running   33m59s ago  31m16s ago
a999a840  4470dc51  controller  3        stop     complete  32m4s ago   31m25s ago
9290e85e  46e6db9e  nodes       0        run      running   42m23s ago  42m16s ago
eed3459a  ec4c06b3  nodes       0        stop     complete  42m23s ago  35m8s ago
d9ecfc6b  4470dc51  nodes       0        run      running   42m23s ago  42m8s ago
ad2698aa  eaac2f32  nodes       0        run      running   37m49s ago  37m31s ago

This seems unusual since if a CSI plugin has completed it should no longer be expected to be running and healthy. When this mismatch between healthy and expected plugin task counts occurs, all tasks that need to attach a CSI volume using the plugin in question are unable to do so. Instead of successfully mounting the volume, the following error occurs:

failed to setup alloc: pre-run hook "csi_hook" failed: rpc error: code = InvalidArgument desc = Device path not provided

The plugin returns this error when it is missing information in the PublishContext passed to a NodePublishVolume/NodeStageVolume RPC as seen here.
The PublishContext is returned by a ControllerPublishVolume RPC, however, after checking the logs of my controller plugin it turns out ControllerPublishVolume is never called.

Again, this only occurs when there is a mismatch between healthy and expected counts. Otherwise ControllerPublishVolume is called when a task requesting a CSI volume is scheduled and the volume is successfully attached.

Reproduction steps

The easiest way to create a healthy/expected value mismatch is to increase the number of controller tasks to 2 then decrement back to 1.

  1. Run the CSI controller plugin job:
job "plugin-aws-ebs-controller" {
  datacenters = ["dc1"]

  group "controller" {
    task "plugin" {
      driver = "docker"

      config {
        image = "amazon/aws-ebs-csi-driver:latest"

        args = [
          "controller",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]
      }

      csi_plugin {
        id        = "aws-ebs0"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }

      # ensuring the plugin has time to shut down gracefully 
      kill_timeout = "2m"
    }
  }
}

2 Run the CSI node plugin job:

job "plugin-aws-ebs-nodes" {
  datacenters = ["dc1"]

  # you can run node plugins as service jobs as well, but this ensures
  # that all nodes in the DC have a copy.
  type = "system"

  group "nodes" {
    task "plugin" {
      driver = "docker"

      config {
        image = "amazon/aws-ebs-csi-driver:latest"

        args = [
          "node",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]

        # node plugins must run as privileged jobs because they
        # mount disks to the host
        privileged = true
      }

      csi_plugin {
        id        = "aws-ebs0"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }

      # ensuring the plugin has time to shut down gracefully 
      kill_timeout = "2m"
    }
  }
}

  1. Create and register an EBS volume with nomad. E.G. https://learn.hashicorp.com/nomad/stateful-workloads/csi-volumes

  2. Optionally run the example MySQL job to verify that volumes can be attached successfully. Be sure to use constraints to run the task using the volume in the same availability zone as your EBS volume

job "mysql-server" {
  datacenters = ["dc1"]
  type        = "service"

  group "mysql-server" {
    count = 1

    volume "mysql" {
      type      = "csi"
      read_only = false
      source    = "mysql"
    }

    restart {
      attempts = 10
      interval = "5m"
      delay    = "25s"
      mode     = "delay"
    }

    task "mysql-server" {
      driver = "docker"

      volume_mount {
        volume      = "mysql"
        destination = "/srv"
        read_only   = false
      }

      env = {
        "MYSQL_ROOT_PASSWORD" = "password"
      }

      config {
        image = "hashicorp/mysql-portworx-demo:latest"
        args = ["--datadir", "/srv/mysql"]

        port_map {
          db = 3306
        }
      }

      resources {
        cpu    = 500
        memory = 1024

        network {
          port "db" {
            static = 3306
          }
        }
      }

      service {
        name = "mysql-server"
        port = "db"

        check {
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

  1. Increment the count for the number of controller plugin tasks to 2 and wait for the new task to become healthy. Then scale down to 1 task and wait for the other task to complete.

  2. Run nomad plugin status. You should see mismatched healthy/expected values for the controller plugins. E.G.

Container Storage Interface
ID        Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs4  ebs.csi.aws.com  1/2                           3/4
  1. Run the MySQL job. You should now see the "Device path not provided" error.

Additional Notes:

I am also seeing issues where plugins with no running jobs are not being garbage collected as described here:

#7743

$ nomad plugin status
Container Storage Interface
ID        Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs0  ebs.csi.aws.com  0/3                           0/29
aws-ebs2  ebs.csi.aws.com  0/2                           0/25
aws-ebs3  ebs.csi.aws.com  0/2                           0/3
aws-ebs4  ebs.csi.aws.com  1/2                           3/4

Not sure if this could be related but I figured it was worth mentioning.

@tgross
Copy link
Member

tgross commented Apr 28, 2020

Hi @tydomitrovich! Thanks for the thorough reproduction!

This seems unusual since if a CSI plugin has completed it should no longer be expected to be running and healthy. When this mismatch between healthy and expected plugin task counts occurs, all tasks that need to attach a CSI volume using the plugin in question are unable to do so.

Yeah, agreed that this is totally a bug. That'll impact updates to plugins too, I think. I don't have a good workaround for you at the moment but I'll dig in and see if I can come up with a fix shortly.

@tgross tgross added this to the 0.11.2 milestone Apr 28, 2020
@tydomitrovich
Copy link
Author

Hello @tgross, thanks for taking a look! I will be monitoring as I am really excited about using the new CSI features.

@tgross
Copy link
Member

tgross commented Apr 30, 2020

I'm working up a PR #7844 which should clear this up. I need to check a few more things out but I'm making good progress on it.

@tgross
Copy link
Member

tgross commented May 5, 2020

#7844 has been merged. This will ship in 0.11.2.

@github-actions
Copy link

github-actions bot commented Nov 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants