Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: Migrating Allocation w/ CSI Volume to New Node Fails #8080

Closed
kainoaseto opened this issue May 30, 2020 · 8 comments
Closed

CSI: Migrating Allocation w/ CSI Volume to New Node Fails #8080

kainoaseto opened this issue May 30, 2020 · 8 comments

Comments

@kainoaseto
Copy link

Nomad version

Nomad servers and clients both running this version
Nomad v0.11.2 (807cfebe90d56f9e5beec3e72936ebe86acc8ce3)

Operating system and Environment details

Amazon Linux 2:
4.14.173-137.229.amzn2.x86_64

EBS CSI Plugin:

ID                   = aws-ebs1
Provider             = ebs.csi.aws.com
Version              = v0.6.0-dirty
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 4
Nodes Expected       = 4

Issue

What I've observed is that when a nomad client is drained along with its system jobs (nomad client will be shutdown or the node is marked ineligible and draining for both system jobs and normal jobs occurs) the CSI Volume gets into a bad state.

When the job that owns the CSI volume is being removed from the node, the ControllerUnpublishVolume RPC call is executed but is canceled via context:

ebs-csi-controller-plugin-service E0528 00:06:23.578472 1 driver.go:109] GRPC error: rpc error: code = Internal desc = Could not detach volume "vol-0c17xxxxxxxxxxxxx" from node "i-0e58xxxxxxxxxxxxx": RequestCanceled: request context canceled
ebs-csi-controller-plugin-service I0528 00:06:06.271365 1 controller.go:275] ControllerUnpublishVolume: called with args {VolumeId:vol-0c17xxxxxxxxxxxxx NodeId:i-0e58xxxxxxxxxxxxx Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

Because this call never completes the EBS volume is still attached to the old node/instance. When the job is scheduled onto another node the controller gets a call to ControllerPublishVolume but this fails with the AlreadyExists error:

ebs-csi-controller-plugin-service I0528 00:07:42.336709 1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-0c17xxxxxxxxxxxxx
ebs-csi-controller-plugin-service E0528 00:07:42.336732 1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
ebs-csi-controller-plugin-service I0528 00:07:41.953571 1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-0c17xxxxxxxxxxxxx NodeId:i-0394xxxxxxxxxxxxx VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

I looked into this error a little bit and according to the CSI spec it should "Indicate that a volume corresponding to the specified volume_id has already been published at the node corresponding to the specified node_id but is incompatible with the specified volume_capability or readonly flag."
However, this is actually implemented incorrectly in the CSI EBS plugin. It seems like they should be throwing the FAILED_PRECONDITION error instead since that seems to more accurately reflect that the volume is still attached on a different node under the CSI Spec.

I've filed this issue on the EBS CSI plugin to get their input since it seems that might be out of spec. Here is also the line of code that shows why this error is being thrown (since the volume is just still attached to the other node).

Anyways, once that was cleared up looking back at the CSI controller logs I was able to see that Nomad seems to be calling ControllerUnpublishVolume but also ControllerPublishVolume before the unpublish is able to complete and free the volume. This results in this sometimes endless loop of the ControllerUnpublishVolume trying to complete before the ControllerPublishVolume executes and seems to cancel it. Which in some cases never happens and in others will resolve after a few tries. Here is a small snippet of logs to show this:

ebs-csi-controller-plugin-service E0528 00:09:29.862416 1 driver.go:109] GRPC error: rpc error: code = Internal desc = Could not detach volume "vol-0c17xxxxxxxxxxxxx" from node "i-0e58xxxxxxxxxxxxx": RequestCanceled: request context canceled caused by: context canceled
ebs-csi-controller-plugin-service I0528 00:09:27.636101 1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
ebs-csi-controller-plugin-service E0528 00:09:13.690648 1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
ebs-csi-controller-plugin-service I0528 00:09:13.690622 1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-0c17xxxxxxxxxxxxx
ebs-csi-controller-plugin-service I0528 00:09:13.279797 1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-0c17xxxxxxxxxxxxx NodeId:i-0394xxxxxxxxxxxxx VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
ebs-csi-controller-plugin-service I0528 00:09:12.449520 1 controller.go:275] ControllerUnpublishVolume: called with args {VolumeId:vol-0c17xxxxxxxxxxxxx NodeId:i-0e58xxxxxxxxxxxxx Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

Just to reiterate, this issue happens only when the volume is attached still to another Node (EC2 instance) and isn't able to be removed by the controller because the request is being cancelled. This will sometimes complete when the volume can be detached successfully before the ControllerPublishVolume call goes through but it seems somewhat random in how this happens from first observations.

Reproduction steps

The easiest way I've found to reproduce this issue somewhat consistently is by scheduling a job that uses a CSI volume and then doing a non forced 3 minute drain on the node that includes system jobs. When the job that needs the CSI volume is rescheduled to a new node in the same az, this issue can be observed.

  1. Start the CSI controller plugin
job "plugin-aws-ebs-controller" {
  datacenters = ["sandbox"]

  group "controller" {
    task "plugin" {
      driver = "docker"

      config {
        image = "amazon/aws-ebs-csi-driver:latest"

        args = [
          "controller",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]
      }

      csi_plugin {
        id        = "aws-ebs1"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }

      # ensuring the plugin has time to shut down gracefully 
      kill_timeout = "2m"
    }
  }
}
  1. Start the CSI node Job
job "plugin-aws-ebs-nodes" {
  datacenters = ["dc1"]

  # you can run node plugins as service jobs as well, but this ensures
  # that all nodes in the DC have a copy.
  type = "system"

  group "nodes" {
    task "plugin" {
      driver = "docker"

      config {
        image = "amazon/aws-ebs-csi-driver:latest"

        args = [
          "node",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]

        # node plugins must run as privileged jobs because they
        # mount disks to the host
        privileged = true
      }

      csi_plugin {
        id        = "aws-ebs1"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }

      # ensuring the plugin has time to shut down gracefully 
      kill_timeout = "2m"
    }
  }
}
  1. Register an EBS volume in the same region/az, I followed the steps here with us-west-2b-a as my volume id/name:
    https://learn.hashicorp.com/nomad/stateful-workloads/csi-volumes

  2. Register the mysql job:

job "mysql-server" {
  datacenters = ["sandbox"]
  type        = "service"
  group "mysql-server" {
    constraint {
        attribute = "${node.class}"
      operator  = "="
      value     = "experiment"
    }
    # filtering task group to the AZ our EBS volume is located in
    constraint {
        attribute = "${attr.platform.aws.placement.availability-zone}"
      operator  = "="
      value     = "us-west-2b"
    }
    count = 1
    volume "mysql" {
      type      = "csi"
      read_only = false
      source    = "us-west-2b-a"
    }
    restart {
      attempts = 10
      interval = "5m"
      delay    = "25s"
      mode     = "delay"
    }
    task "mysql-server" {
      driver = "docker"
      volume_mount {
        volume      = "mysql"
        destination = "/srv"
        read_only   = false
      }
      env = {
        "MYSQL_ROOT_PASSWORD" = "password"
      }
      config {
        image = "hashicorp/mysql-portworx-demo:latest"
        args = ["--datadir", "/srv/mysql"]
        port_map {
          db = 3306
        }
      }
      resources {
        cpu    = 500
        memory = 1024
        network {
          port "db" {
            static = 3306
          }
        }
      }
      service {
        name = "mysql-server"
        port = "db"
        check {
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}
  1. Mark the current client that the job w/ CSI volume is running on as ineligible and drain the node with a 3m deadline and system jobs. The node should have logs something like this when complete:
May 25, '20 19:28:06 -0700      CSI     failed fingerprinting with error: rpc error: code = Canceled desc = context canceled
May 25, '20 19:27:51 -0700      Storage Unmount volume
May 25, '20 19:27:46 -0700      CSI     failed fingerprinting with error: rpc error: code = Canceled desc = context canceled
May 25, '20 19:27:46 -0700      Drain   Node drain complete
May 25, '20 19:24:31 -0700      CSI     failed fingerprinting with error: rpc error: code = Canceled desc = context canceled
May 25, '20 19:24:29 -0700      Drain   Node drain strategy set
May 25, '20 19:23:58 -0700      Cluster Node marked as ineligible for scheduling
May 25, '20 19:22:45 -0700      Storage Mount volume
  1. When the job is rescheduled to a new node you should be able to observe in the csi-ebs-controller logs the above scenario of the ControllerPublishVolume and ControllerUnpublishVolume are being called and stepping on each other. However it might take some tries as each time there can be different results from what I've found. Sometimes it will self resolve if the unpublish is called early enough or if the volume is successfully detached when the node is draining.

Nomad Task Event Log

Setup Failure failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: rpc error: controller publish: attach volume: rpc error: controller attach volume: rpc error: code = AlreadyExists desc = Resource already exists
 Received Task received by client
@tgross
Copy link
Member

tgross commented Jun 22, 2020

Hi @kainoaseto! I'm looking into this!

Possibly related: #8232

@tgross
Copy link
Member

tgross commented Jun 23, 2020

After doing a little bit more testing, it looks like the other failure mode we can see here is that the volumewatcher runs asynchronously, so it can try to NodeUnpublish after the node plugin has been removed from the node, which results in an error like:

2020-06-23T13:08:37.919Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=ebs-vol0 error="1 error occurred:
        * could not detach from node: rpc error: node detach volume: plugin aws-ebs0 for type csi-node not found

"

@tgross
Copy link
Member

tgross commented Jun 23, 2020

@kainoaseto I'm investigating how to fix this, but in the meantime I do want to give you what I think will work as a workaround for this class of problem.

  • Run multiple controller plugins, spread across the cluster such that a node that drains always has a controller to talk to.
  • Run the node plugins as system jobs, as you're doing
  • Use the -ignore-system flag while node draining to ensure that the node plugin stays on the host even after it's done draining so as to avoid the race with the volumewatcher.

@kainoaseto
Copy link
Author

Hi @tgross Thank you for looking into this! Ah that's interesting, I believe I've ran into that one at some point in time with mixed results on reproducing it consistently. Interested to how that one will be resolved, I'm guessing some type of synchronous call or wait for all the volumes to be unpublished before completing the drain would need to happen.

Thank you for the work around! I will test that as a possible solution in the meantime. One problem that I might run into still, is that the nomad clients that are used for this cluster are spot instances (provisioned through spot.io/spotinst). We get about a 5 minute drain time before the instance gets terminated but I believe that forces system jobs to drain off as well which is another way I've been able to reproduce this issue. I'll see if it's possible to modify the drain to ignore system jobs for spot to help facilitate this fix but I'm not sure if that's possible.

The controllers I've tried running multiple of but if one of those gets shutdown (maybe the node it was on drained it to start up on a new node) I've ran into this bug: #8034 Although it looks like there is a workaround with restarting the controllers if they get into a bad state.

@tgross
Copy link
Member

tgross commented Aug 7, 2020

There's a status update on this issue here: #8232 (comment)

@tgross
Copy link
Member

tgross commented Aug 10, 2020

For sake of our planning, I'm going to close this issue and we'll continue track the progress in #8232.

@tgross tgross closed this as completed Aug 10, 2020
@thatsk
Copy link

thatsk commented Jul 20, 2022

@tgross #13860

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants