Fail to detach ceph csi volume from a down node and migrate to another #13450

enaftali2 · 2022-06-21T16:24:47Z

Nomad version

Nomad v1.3.0

Operating system and Environment details

Ubuntu 18.04.6 LTS

Issue

Hi
We are testing ceph storage with nomad volume csi plugin, for the POC iv'e created 3 vm's on GCP with ceph cluster and nomad cluster with client and server role on all 3 vm's, the csi plugin and volumes creation and attachment work very well, i'm running mysql job, when i restart the node running the sql i can see the job migrating to another node with the volume and data.
FYI - to run the csi, sql and volumes creation i used the guide in ceph documentation - https://docs.ceph.com/en/latest/rbd/rbd-nomad/

The issue start when i perform shutdown -h now on the node running the sql , after about 10 minutes the allocation marked as Lost and new allocation is trying to start and get stuck on Pending status forever.
As you can see in the logs below, nomad fails to detach the volume from node that is currently down.
I want to also mention the even tough ceph also lost 1 node in the test i run , it seems working and accessible and there's no errors in the csi plugin.

Reproduction steps

Shutdown of the machine running a job with volume create from ceph by csi plugin.

Expected Result

job with external volume to migrate to another node with the same volume attached.

Actual Result

job try to migrate and stuck on pending.

CSI Job files

ceph-csi-plugin-controller.nomad

job "ceph-csi-plugin-controller" {
  datacenters = ["dc1"]
  group "controller" {
    network {
      port "metrics" {}
    }
    task "ceph-controller" {
      template {
        data        = <<EOF
[{
    "clusterID": "b9127830-b0cc-4e34-aa47-9d1a2e9949a8",
    "monitors": [
        "10.155.0.16",
        "10.155.0.54",
        "10.155.0.59"
    ]
}]
EOF
        destination = "local/config.json"
        change_mode = "restart"
      }
      driver = "docker"
      config {
        image = "quay.io/cephcsi/cephcsi:v3.3.1"
        volumes = [
          "./local/config.json:/etc/ceph-csi-config/config.json"
        ]
        mounts = [
          {
            type     = "tmpfs"
            target   = "/tmp/csi/keys"
            readonly = false
            tmpfs_options = {
              size = 1000000 # size in bytes
            }
          }
        ]
        args = [
          "--type=rbd",
          "--controllerserver=true",
          "--drivername=rbd.csi.ceph.com",
          "--endpoint=unix://csi/csi.sock",
          "--nodeid=${node.unique.name}",
          "--instanceid=${node.unique.name}-controller",
          "--pidlimit=-1",
          "--logtostderr=true",
          "--v=5",
          "--metricsport=$${NOMAD_PORT_metrics}"
        ]
      }
      resources {
        cpu    = 500
        memory = 256
      }
      service {
        name = "ceph-csi-controller"
        port = "metrics"
        tags = [ "prometheus" ]
      }
      csi_plugin {
        id        = "ceph-csi"
        type      = "controller"
        mount_dir = "/csi"
      }
    }
  }
}

ceph-csi-plugin-nodes.nomad

job "ceph-csi-plugin-nodes" {
  datacenters = ["dc1"]
  type        = "system"
  group "nodes" {
    network {
      port "metrics" {}
    }
    task "ceph-node" {
      driver = "docker"
      template {
        data        = <<EOF
[{
    "clusterID": "b9127830-b0cc-4e34-aa47-9d1a2e9949a8",
    "monitors": [
        "10.155.0.16",
        "10.155.0.54",
        "10.155.0.59"
    ]
}]
EOF
        destination = "local/config.json"
        change_mode = "restart"
      }
      config {
        image = "quay.io/cephcsi/cephcsi:v3.3.1"
        volumes = [
          "./local/config.json:/etc/ceph-csi-config/config.json"
        ]
        mounts = [
          {
            type     = "tmpfs"
            target   = "/tmp/csi/keys"
            readonly = false
            tmpfs_options = {
              size = 1000000 # size in bytes
            }
          }
        ]
        args = [
          "--type=rbd",
          "--drivername=rbd.csi.ceph.com",
          "--nodeserver=true",
          "--endpoint=unix://csi/csi.sock",
          "--nodeid=${node.unique.name}",
          "--instanceid=${node.unique.name}-nodes",
          "--pidlimit=-1",
          "--logtostderr=true",
          "--v=5",
          "--metricsport=$${NOMAD_PORT_metrics}"
        ]
        privileged = true
      }
      resources {
        cpu    = 500
        memory = 256
      }
      service {
        name = "ceph-csi-nodes"
        port = "metrics"
        tags = [ "prometheus" ]
      }
      csi_plugin {
        id        = "ceph-csi"
        type      = "node"
        mount_dir = "/csi"
      }
    }
  }
}

Volume file

ceph-volume.hcl

id = "ceph-mysql2"
name = "ceph-mysql2"
external_id = "ceph-mysql2"
type = "csi"
plugin_id = "ceph-csi"
capacity_max = "5G"
capacity_min = "2G"

capability {
  access_mode     = "single-node-writer"
  attachment_mode = "file-system"
}

secrets {
  userID  = "nomad"
  userKey = "AQAlh9Rgg2vrDxAARy25T7KHabs6iskSHpAEAQ=="
}

context {
  clusterID = "b9127830-b0cc-4e34-aa47-9d1a2e9949a8"
  pool = "nomad"
  imageFeatures = "layering"
}

parameters {
  clusterID = "b9127830-b0cc-4e34-aa47-9d1a2e9949a8"
  pool = "nomad"
  imageFeatures = "layering"
}

Mysql job file

mysql.nomad

job "mysql-server2" {
  datacenters = ["dc1"]
  type        = "service"
  group "mysql-server" {
    count = 1
    volume "ceph-mysql2" {
      type      = "csi"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"
      read_only = false
      source    = "ceph-mysql2"
    }
    network {
      port "db" {
        to = 3306
      }
    }
    restart {
      attempts = 10
      interval = "5m"
      delay    = "25s"
      mode     = "delay"
    }
    task "mysql-server" {
      driver = "docker"
      volume_mount {
        volume      = "ceph-mysql2"
        destination = "/srv"
        read_only   = false
      }
      env {
        MYSQL_ROOT_PASSWORD = "password"
      }
      config {
        image = "hashicorp/mysql-portworx-demo:latest"
        args  = ["--datadir", "/srv/mysql"]
        ports = ["db"]
      }
      resources {
        cpu    = 500
        memory = 1024
      }
      service {
        name = "mysql-server"
        port = "db"
        check {
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Nomad logs

2022-06-21T15:35:00.035Z [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
2022-06-21T15:35:00.036Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.155.0.16:4647
2022-06-21T15:35:00.036Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.155.0.16:4647
2022-06-21T15:35:57.747Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=ceph-mysql2
  error=
  | 1 error occurred:
  | 	* could not detach from node: No path to node
  |

The text was updated successfully, but these errors were encountered:

tgross · 2022-06-21T16:28:51Z

Hi @enaftali2! This issue has been fixed in #13301 and will ship in Nomad 1.3.2. Essentially the problem is that there's no way for the server to send a node unpublish command to the node plugin that's running on a down node without violating the CSI spec. We've decided to break strict compliance in order to make non-graceful shutdown work.

In the meantime, you can avoid this condition by draining a node before shutting it down.

enaftali2 · 2022-06-26T14:49:28Z

Hi @tgross , thanks, you were very helpful, since i saw the fix were merged to master i built a new binary with the fix i need from master, the issue is fixed, the cluster behaving as expected.
I have a question, after i ungracefully shutdown the machine, very fast i can see the alloc in Lost state, it takes about 5-6 minutes for the alloc to be on Running state again , it's on Pending all this time.
Is there a way to make this time shorter? or it will be in the future?
This is the log i see in nomad monitor:
2022-06-26T14:33:58.529Z [ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"
There is a way to increase the max claims to volume?

tgross · 2022-06-27T12:53:59Z

I have a question, after i ungracefully shutdown the machine, very fast i can see the alloc in Lost state, it takes about 5-6 minutes for the alloc to be on Running state again , it's on Pending all this time.
Is there a way to make this time shorter? or it will be in the future?

That timeout is governed by the client heartbeat timeout, which isn't currently configurable. You can also force your jobs to immediately stop by setting a stop_on_client_disconnect timeout but note that this will impact cases where the client agent has simply restarted (which is normally fine). The best way to handle this case is to drain nodes before stopping them whenever possible.

There is a way to increase the max claims to volume?

Use a non-single-node access_mode but note that this may not be available for the kind of volume you're using (this is controlled by the third-party storage provider) and may be unsafe for the consuming application as well if it doesn't have a way of coordinating multi-writer access.

github-actions · 2022-10-26T02:39:32Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

enaftali2 added the type/bug label Jun 21, 2022

tgross closed this as completed Jun 21, 2022

tgross added theme/storage stage/duplicate labels Jun 21, 2022

tgross added this to the 1.3.2 milestone Jun 21, 2022

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jun 21, 2022

tgross moved this from Needs Triage to Done in Nomad - Community Issues Triage Jun 21, 2022

github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to detach ceph csi volume from a down node and migrate to another #13450

Fail to detach ceph csi volume from a down node and migrate to another #13450

enaftali2 commented Jun 21, 2022

tgross commented Jun 21, 2022

enaftali2 commented Jun 26, 2022 •

edited

Loading

tgross commented Jun 27, 2022

github-actions bot commented Oct 26, 2022

Fail to detach ceph csi volume from a down node and migrate to another #13450

Fail to detach ceph csi volume from a down node and migrate to another #13450

Comments

enaftali2 commented Jun 21, 2022

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

CSI Job files

Volume file

Mysql job file

Nomad logs

tgross commented Jun 21, 2022

enaftali2 commented Jun 26, 2022 • edited Loading

tgross commented Jun 27, 2022

github-actions bot commented Oct 26, 2022

enaftali2 commented Jun 26, 2022 •

edited

Loading