CSI: Migrating Allocation w/ CSI Volume to New Node Fails #8080

kainoaseto · 2020-05-30T04:29:03Z

Nomad version

Nomad servers and clients both running this version
Nomad v0.11.2 (807cfebe90d56f9e5beec3e72936ebe86acc8ce3)

Operating system and Environment details

Amazon Linux 2:
4.14.173-137.229.amzn2.x86_64

EBS CSI Plugin:

ID                   = aws-ebs1
Provider             = ebs.csi.aws.com
Version              = v0.6.0-dirty
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 4
Nodes Expected       = 4

Issue

What I've observed is that when a nomad client is drained along with its system jobs (nomad client will be shutdown or the node is marked ineligible and draining for both system jobs and normal jobs occurs) the CSI Volume gets into a bad state.

When the job that owns the CSI volume is being removed from the node, the ControllerUnpublishVolume RPC call is executed but is canceled via context:

ebs-csi-controller-plugin-service E0528 00:06:23.578472 1 driver.go:109] GRPC error: rpc error: code = Internal desc = Could not detach volume "vol-0c17xxxxxxxxxxxxx" from node "i-0e58xxxxxxxxxxxxx": RequestCanceled: request context canceled
ebs-csi-controller-plugin-service I0528 00:06:06.271365 1 controller.go:275] ControllerUnpublishVolume: called with args {VolumeId:vol-0c17xxxxxxxxxxxxx NodeId:i-0e58xxxxxxxxxxxxx Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

Because this call never completes the EBS volume is still attached to the old node/instance. When the job is scheduled onto another node the controller gets a call to ControllerPublishVolume but this fails with the AlreadyExists error:

ebs-csi-controller-plugin-service I0528 00:07:42.336709 1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-0c17xxxxxxxxxxxxx
ebs-csi-controller-plugin-service E0528 00:07:42.336732 1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
ebs-csi-controller-plugin-service I0528 00:07:41.953571 1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-0c17xxxxxxxxxxxxx NodeId:i-0394xxxxxxxxxxxxx VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

I looked into this error a little bit and according to the CSI spec it should "Indicate that a volume corresponding to the specified volume_id has already been published at the node corresponding to the specified node_id but is incompatible with the specified volume_capability or readonly flag."
However, this is actually implemented incorrectly in the CSI EBS plugin. It seems like they should be throwing the FAILED_PRECONDITION error instead since that seems to more accurately reflect that the volume is still attached on a different node under the CSI Spec.

I've filed this issue on the EBS CSI plugin to get their input since it seems that might be out of spec. Here is also the line of code that shows why this error is being thrown (since the volume is just still attached to the other node).

Anyways, once that was cleared up looking back at the CSI controller logs I was able to see that Nomad seems to be calling ControllerUnpublishVolume but also ControllerPublishVolume before the unpublish is able to complete and free the volume. This results in this sometimes endless loop of the ControllerUnpublishVolume trying to complete before the ControllerPublishVolume executes and seems to cancel it. Which in some cases never happens and in others will resolve after a few tries. Here is a small snippet of logs to show this:

ebs-csi-controller-plugin-service E0528 00:09:29.862416 1 driver.go:109] GRPC error: rpc error: code = Internal desc = Could not detach volume "vol-0c17xxxxxxxxxxxxx" from node "i-0e58xxxxxxxxxxxxx": RequestCanceled: request context canceled caused by: context canceled
ebs-csi-controller-plugin-service I0528 00:09:27.636101 1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
ebs-csi-controller-plugin-service E0528 00:09:13.690648 1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
ebs-csi-controller-plugin-service I0528 00:09:13.690622 1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-0c17xxxxxxxxxxxxx
ebs-csi-controller-plugin-service I0528 00:09:13.279797 1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-0c17xxxxxxxxxxxxx NodeId:i-0394xxxxxxxxxxxxx VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
ebs-csi-controller-plugin-service I0528 00:09:12.449520 1 controller.go:275] ControllerUnpublishVolume: called with args {VolumeId:vol-0c17xxxxxxxxxxxxx NodeId:i-0e58xxxxxxxxxxxxx Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

Just to reiterate, this issue happens only when the volume is attached still to another Node (EC2 instance) and isn't able to be removed by the controller because the request is being cancelled. This will sometimes complete when the volume can be detached successfully before the ControllerPublishVolume call goes through but it seems somewhat random in how this happens from first observations.

Reproduction steps

The easiest way I've found to reproduce this issue somewhat consistently is by scheduling a job that uses a CSI volume and then doing a non forced 3 minute drain on the node that includes system jobs. When the job that needs the CSI volume is rescheduled to a new node in the same az, this issue can be observed.

Start the CSI controller plugin

job "plugin-aws-ebs-controller" {
  datacenters = ["sandbox"]

  group "controller" {
    task "plugin" {
      driver = "docker"

      config {
        image = "amazon/aws-ebs-csi-driver:latest"

        args = [
          "controller",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]
      }

      csi_plugin {
        id        = "aws-ebs1"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }

      # ensuring the plugin has time to shut down gracefully 
      kill_timeout = "2m"
    }
  }
}

Start the CSI node Job

job "plugin-aws-ebs-nodes" {
  datacenters = ["dc1"]

  # you can run node plugins as service jobs as well, but this ensures
  # that all nodes in the DC have a copy.
  type = "system"

  group "nodes" {
    task "plugin" {
      driver = "docker"

      config {
        image = "amazon/aws-ebs-csi-driver:latest"

        args = [
          "node",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]

        # node plugins must run as privileged jobs because they
        # mount disks to the host
        privileged = true
      }

      csi_plugin {
        id        = "aws-ebs1"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }

      # ensuring the plugin has time to shut down gracefully 
      kill_timeout = "2m"
    }
  }
}

Register an EBS volume in the same region/az, I followed the steps here with us-west-2b-a as my volume id/name:
https://learn.hashicorp.com/nomad/stateful-workloads/csi-volumes
Register the mysql job:

job "mysql-server" {
  datacenters = ["sandbox"]
  type        = "service"
  group "mysql-server" {
    constraint {
        attribute = "${node.class}"
      operator  = "="
      value     = "experiment"
    }
    # filtering task group to the AZ our EBS volume is located in
    constraint {
        attribute = "${attr.platform.aws.placement.availability-zone}"
      operator  = "="
      value     = "us-west-2b"
    }
    count = 1
    volume "mysql" {
      type      = "csi"
      read_only = false
      source    = "us-west-2b-a"
    }
    restart {
      attempts = 10
      interval = "5m"
      delay    = "25s"
      mode     = "delay"
    }
    task "mysql-server" {
      driver = "docker"
      volume_mount {
        volume      = "mysql"
        destination = "/srv"
        read_only   = false
      }
      env = {
        "MYSQL_ROOT_PASSWORD" = "password"
      }
      config {
        image = "hashicorp/mysql-portworx-demo:latest"
        args = ["--datadir", "/srv/mysql"]
        port_map {
          db = 3306
        }
      }
      resources {
        cpu    = 500
        memory = 1024
        network {
          port "db" {
            static = 3306
          }
        }
      }
      service {
        name = "mysql-server"
        port = "db"
        check {
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Mark the current client that the job w/ CSI volume is running on as ineligible and drain the node with a 3m deadline and system jobs. The node should have logs something like this when complete:

May 25, '20 19:28:06 -0700      CSI     failed fingerprinting with error: rpc error: code = Canceled desc = context canceled
May 25, '20 19:27:51 -0700      Storage Unmount volume
May 25, '20 19:27:46 -0700      CSI     failed fingerprinting with error: rpc error: code = Canceled desc = context canceled
May 25, '20 19:27:46 -0700      Drain   Node drain complete
May 25, '20 19:24:31 -0700      CSI     failed fingerprinting with error: rpc error: code = Canceled desc = context canceled
May 25, '20 19:24:29 -0700      Drain   Node drain strategy set
May 25, '20 19:23:58 -0700      Cluster Node marked as ineligible for scheduling
May 25, '20 19:22:45 -0700      Storage Mount volume

When the job is rescheduled to a new node you should be able to observe in the csi-ebs-controller logs the above scenario of the ControllerPublishVolume and ControllerUnpublishVolume are being called and stepping on each other. However it might take some tries as each time there can be different results from what I've found. Sometimes it will self resolve if the unpublish is called early enough or if the volume is successfully detached when the node is draining.

Nomad Task Event Log

Setup Failure failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: rpc error: controller publish: attach volume: rpc error: controller attach volume: rpc error: code = AlreadyExists desc = Resource already exists
 Received Task received by client

The text was updated successfully, but these errors were encountered:

tgross · 2020-06-22T18:02:02Z

Hi @kainoaseto! I'm looking into this!

Possibly related: #8232

tgross · 2020-06-23T13:34:17Z

After doing a little bit more testing, it looks like the other failure mode we can see here is that the volumewatcher runs asynchronously, so it can try to NodeUnpublish after the node plugin has been removed from the node, which results in an error like:

2020-06-23T13:08:37.919Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=ebs-vol0 error="1 error occurred:
        * could not detach from node: rpc error: node detach volume: plugin aws-ebs0 for type csi-node not found

"

tgross · 2020-06-23T13:49:12Z

@kainoaseto I'm investigating how to fix this, but in the meantime I do want to give you what I think will work as a workaround for this class of problem.

Run multiple controller plugins, spread across the cluster such that a node that drains always has a controller to talk to.
Run the node plugins as system jobs, as you're doing
Use the -ignore-system flag while node draining to ensure that the node plugin stays on the host even after it's done draining so as to avoid the race with the volumewatcher.

kainoaseto · 2020-06-23T17:36:11Z

Hi @tgross Thank you for looking into this! Ah that's interesting, I believe I've ran into that one at some point in time with mixed results on reproducing it consistently. Interested to how that one will be resolved, I'm guessing some type of synchronous call or wait for all the volumes to be unpublished before completing the drain would need to happen.

Thank you for the work around! I will test that as a possible solution in the meantime. One problem that I might run into still, is that the nomad clients that are used for this cluster are spot instances (provisioned through spot.io/spotinst). We get about a 5 minute drain time before the instance gets terminated but I believe that forces system jobs to drain off as well which is another way I've been able to reproduce this issue. I'll see if it's possible to modify the drain to ignore system jobs for spot to help facilitate this fix but I'm not sure if that's possible.

The controllers I've tried running multiple of but if one of those gets shutdown (maybe the node it was on drained it to start up on a new node) I've ran into this bug: #8034 Although it looks like there is a workaround with restarting the controllers if they get into a bad state.

tgross · 2020-08-07T19:36:26Z

There's a status update on this issue here: #8232 (comment)

tgross · 2020-08-10T17:09:37Z

For sake of our planning, I'm going to close this issue and we'll continue track the progress in #8232.

thatsk · 2022-07-20T12:42:06Z

@tgross #13860

github-actions · 2022-11-18T02:30:57Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added theme/storage stage/needs-investigation labels Jun 22, 2020

tgross self-assigned this Jun 22, 2020

tgross mentioned this issue Jun 22, 2020

Volume is not released during node drain, causing job to fail forever #8232

Closed

tgross added type/bug and removed stage/needs-investigation labels Jun 23, 2020

tgross added the theme/drain label Jun 23, 2020

tgross mentioned this issue Jun 23, 2020

[CSI] "volume in use" error deregistering volume previously associated with now stopped job #8057

Closed

This was referenced Jul 29, 2020

csi: retry controller client RPCs on next controller #8561

Merged

csi: add unpublish RPC #8572

Merged

csi: update volumewatcher to use unpublish RPC #8579

Merged

csi: release claims via csi_hook postrun unpublish RPC #8580

Merged

tgross mentioned this issue Aug 7, 2020

CSI unpublish error handling improvements #8605

Merged

tgross closed this as completed Aug 10, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI: Migrating Allocation w/ CSI Volume to New Node Fails #8080

CSI: Migrating Allocation w/ CSI Volume to New Node Fails #8080

kainoaseto commented May 30, 2020

tgross commented Jun 22, 2020

tgross commented Jun 23, 2020

tgross commented Jun 23, 2020

kainoaseto commented Jun 23, 2020

tgross commented Aug 7, 2020

tgross commented Aug 10, 2020

thatsk commented Jul 20, 2022

github-actions bot commented Nov 18, 2022

CSI: Migrating Allocation w/ CSI Volume to New Node Fails #8080

CSI: Migrating Allocation w/ CSI Volume to New Node Fails #8080

Comments

kainoaseto commented May 30, 2020

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Task Event Log

tgross commented Jun 22, 2020

tgross commented Jun 23, 2020

tgross commented Jun 23, 2020

kainoaseto commented Jun 23, 2020

tgross commented Aug 7, 2020

tgross commented Aug 10, 2020

thatsk commented Jul 20, 2022

github-actions bot commented Nov 18, 2022