Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot restart jobs with attached CSI volumes #17756

Closed
josh-m-sharpe opened this issue Jun 28, 2023 · 28 comments
Closed

cannot restart jobs with attached CSI volumes #17756

josh-m-sharpe opened this issue Jun 28, 2023 · 28 comments

Comments

@josh-m-sharpe
Copy link

josh-m-sharpe commented Jun 28, 2023

Nomad version

Output from nomad version
1.5.6

Operating system and Environment details

amazonlinux

Issue

I started a job that had attached CSI volumes. It started successfully and did its thing for a long while and generally was OK. I started testing our process to update the nomad client host machines, which is roughly (double the # of client nodes, lock the old nodes, nomad job restart the relevant jobs, then drain and shut down the old nodes).
This failed at the restart point because the CSI volumes were never (correctly?) released and made available to the new nodes.

From the original host's /var/log/messages;

"rpc error: could not detach from controller: controller detach volume: failed to find clients running controller plugin "aws-ebs0"" rpc=CSIVolume.Unpublish server=10.20.31.119:4647

So then I tried a variety of attempting to forcibly detach the volume:

$ nomad volume status --namespace=elasticsearch elasticsearch_data[0]
ID                   = elasticsearch_data[0]
Name                 = elasticsearch_data[0]
External ID          = vol-0f05174864a181762
Plugin ID            = aws-ebs0
Provider             = ebs.csi.aws.com
Version              = v1.19.0
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 6
Nodes Expected       = 6
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = fs_type: ext4
Namespace            = elasticsearch

Topologies
Topology  Segments
00        topology.ebs.csi.aws.com/zone=us-west-2a

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
d0fa56b2  01c25044  data        5        stop     complete  43m ago  29m29s ago
$ nomad volume detach --namespace=elasticsearch elasticsearch_data[0] d0fa56b2
Error detaching volume: Unexpected response code: 500 (rpc error: Unknown allocation)
$ nomad volume detach --namespace=elasticsearch elasticsearch_data[0] 01c25044
Error detaching volume: Unexpected response code: 500 (Unknown allocation)
$ nomad volume detach --namespace=elasticsearch elasticsearch_data\[0\] d0fa56b2
Error detaching volume: Unexpected response code: 500 (rpc error: Unknown allocation)
$ nomad volume detach --namespace=elasticsearch elasticsearch_data\[0\] 01c25044
Error detaching volume: Unexpected response code: 500 (Unknown allocation)

Along the while, after trying to start the job again, one of the new host looks like it's allocating a container:
Cluster_Topology_-_Nomad

Client_ip-10-20-11-222_us-west-2_compute_internal_-_Nomad

But this is not remotely accurate. docker isn't doing anything, there's seemingly no active process running on the host, or even attempting to be run. Allocation logs aren't sowing movement.

So... then I Drained that node, and when that didn't clear the pending allocation, I force drained it, and it's still there. It's a liar because it says it's done and it's not:

Client_ip-10-20-11-222_us-west-2_compute_internal_-_Nomad

I've filed a similar bug before, which claimed to be a UI issue. It seems to me there's a missing piece of logic somewhere that asserts there are in fact no allocations running before a drain is considered done. This seems like more than a UI bug.

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

variable "image_tag" {
  type = string
}

job "elasticsearch" {
  datacenters = ["es1"]
  type = "service"

  migrate {
    health_check     = "checks"
    healthy_deadline = "3m"
    max_parallel     = 1
    min_healthy_time = "5s"
  }

  update {
    health_check     = "checks"
    auto_revert      = true
    healthy_deadline = "3m"
    max_parallel     = 1
    min_healthy_time = "5s"
  }

  group "master" {
    count = 3

    constraint {
      distinct_hosts = true
    }

    spread {
      attribute = "$${attr.platform.aws.placement.availability-zone}"
    }

    network {
      port "http" { }
      port "transport" { }
    }

    volume "master" {
      type            = "csi"
      source          = "elasticsearch_master"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"
      per_alloc       = true

      mount_options {
        fs_type     = "ext4"
      }
    }

    task "elasticsearch_master_init" {
      lifecycle {
        hook = "prestart"
        sidecar = false
      }

      driver = "docker"
      config {
        image   = "${ account_id }.dkr.ecr.${ region }.amazonaws.com/${ env }-elasticsearch:$${ var.image_tag }"
        command = "chown"
        args = ["-R", "1000:1000", "/usr/share/elasticsearch/data"]
      }

      volume_mount {
        volume      = "master"
        destination = "/usr/share/elasticsearch/data"
      }

      resources {
        cpu = 250
        memory = 250
        # cpu = 500
        # memory = 2048
      }
    }

    task "elasticsearch_master" {
      driver = "docker"

      config {
        # image = "nginx:latest"
        # auth {
        #   server_address = "http://www.dockerhub.com"
        # }

        image = "535528833958.dkr.ecr.us-west-2.amazonaws.com/ninja-elasticsearch:7.10.2"
        ports = ["http", "transport"]
        args = [
          "elasticsearch",
          "-Ecluster.initial_master_nodes=elasticsearch-master-0,elasticsearch-master-1,elasticsearch-master-2",
          "-Ecluster.name=my-es-cluster",
          "-Enode.name=elasticsearch-master-$${NOMAD_ALLOC_INDEX}",
          "-Enode.roles=master",
          "-Ediscovery.seed_providers=file",
          "-Enetwork.host=0.0.0.0",
          "-Enetwork.publish_host=$${NOMAD_IP_http}",
          "-Ehttp.publish_port=$${NOMAD_HOST_PORT_http}",
          "-Ehttp.port=$${NOMAD_PORT_http}",
          "-Etransport.publish_port=$${NOMAD_HOST_PORT_transport}",
          "-Etransport.tcp.port=$${NOMAD_PORT_transport}",
        ]
      }

      volume_mount {
        volume      = "master"
        destination = "/usr/share/elasticsearch/data/"
      }

      resources {
        cpu = 1500
        memory = 1500
        # cpu = 500
        # memory = 2048
      }
      template {
        data = <<EOF
{{- range service "elasticsearch-master" -}}
{{ .Address }}:{{ .Port }}
{{ end }}
EOF
        destination = "local/unicast_hosts.txt"
        change_mode = "noop"
      }

      service {
        name = "elasticsearch-master"
        port = "transport"

        check {
          type = "tcp"
          port = "transport"
          interval = "10s"
          timeout = "2s"
        }
      }
    }
  }
  group "data" {
    count = 3

     spread {
       attribute = "$${attr.platform.aws.placement.availability-zone}"
     }

     network {
       port "http" {
         static = 9200
       }
       port "transport" { }
     }

     volume "data" {
       type            = "csi"
       source          = "elasticsearch_data"
       attachment_mode = "file-system"
       access_mode     = "single-node-writer"
       per_alloc       = true

       mount_options {
         fs_type     = "ext4"
       }
     }

     task "elasticsearch_data_init" {
       lifecycle {
         hook = "prestart"
         sidecar = false
       }

       driver = "docker"
       config {
         image = "535528833958.dkr.ecr.us-west-2.amazonaws.com/ninja-elasticsearch:7.10.2"
         command = "chown"
         args = ["-R", "1000:1000", "/usr/share/elasticsearch/data"]

       }

       volume_mount {
         volume      = "data"
         destination = "/usr/share/elasticsearch/data"
       }

       resources {
         cpu = 500
         memory = 128
       }
     }

    task "elasticsearch" {
      driver = "docker"

      # constraint {
      #   distinct_hosts = true
      # }

      config {
        # image = "nginx:latest"
        # auth {
        #   server_address = "http://www.dockerhub.com"
        # }
        image = "535528833958.dkr.ecr.us-west-2.amazonaws.com/ninja-elasticsearch:7.10.2"
        ports = ["http", "transport"]
        # TODO: initial_master_nodes should be removed after the initial bootstrap
        args = [
          "elasticsearch",
          "-Ecluster.initial_master_nodes=elasticsearch-master-0,elasticsearch-master-1,elasticsearch-master-2",
          "-Ecluster.name=my-es-cluster",
          "-Enode.name=elasticsearch-data-$${NOMAD_ALLOC_INDEX}",
          "-Enode.roles=data",
          "-Ediscovery.seed_providers=file",
          "-Enetwork.host=0.0.0.0",
          "-Enetwork.publish_host=$${NOMAD_IP_http}",
          "-Ehttp.publish_port=$${NOMAD_HOST_PORT_http}",
          "-Ehttp.port=$${NOMAD_PORT_http}",
          "-Etransport.publish_port=$${NOMAD_HOST_PORT_transport}",
          "-Etransport.tcp.port=$${NOMAD_PORT_transport}",
        ]
      }

      volume_mount {
        volume      = "data"
        destination = "/usr/share/elasticsearch/data"
      }

      resources {
        cpu = 1500
        memory = 1500
      }

      template {
        data = <<EOF
{{- range service "elasticsearch-master" -}}
{{ .Address }}:{{ .Port }}
{{ end }}
EOF
        destination = "local/unicast_hosts.txt"
        change_mode = "noop"
      }

      service {
        name = "elasticsearch-data"
        port = "transport"

        check {
          type = "tcp"
          port = "transport"
          interval = "10s"
          timeout = "2s"
        }
      }
    }
  }
}

@josh-m-sharpe
Copy link
Author

Okay, since OP, I've Terminated all my Elasticsearch instances and brought up new ones - this is obviously bad.
The job came up successfully after doing that.

I attmpted to reschedule the job again:

  1. Doubled the number of nodes, and allowed them to join the cluster
  2. Marked the old nodes as inelligible
  3. Ran: NOMAD_VAR_image_tag=$(latest_tag) nomad job restart -reschedule -namespace=elasticsearch elasticsearch

To which, I see this:

==> 2023-06-28T14:45:42Z: Restarting 6 allocations
    2023-06-28T14:45:42Z: Rescheduling allocation "159ecf99" for group "master"
==> 2023-06-28T14:45:42Z: 1 error occurred while restarting job:
                               * Error restarting allocation "159ecf99": Evaluation "14cb8fdb" has placement failures for group "master":
                                   * Constraint "CSI volume elasticsearch_master[1] is unschedulable or is read-only": 1 nodes excluded by filter
                                   * Constraint "did not meet topology requirement": 2 nodes excluded by filter
==> 2023-06-28T14:45:42Z: The errors above are likely to happen again.
                               Ignore them anyway and proceed with the next batch? [y/N] n
==> 2023-06-28T14:46:19Z: Job restart canceled

1 error occurred while restarting job:
* Error restarting allocation "159ecf99": Evaluation "14cb8fdb" has placement failures for group "master":
    * Constraint "CSI volume elasticsearch_master[1] is unschedulable or is read-only": 1 nodes excluded by filter
    * Constraint "did not meet topology requirement": 2 nodes excluded by filter

Last time I pressed "Y". This time I pressed "n"

Initially it seemed like it did nothing, but it seems that one instance did allocate successfully on a new node.

Uh... Yay?

Why does this "failure" prompt not attempt to roll the job back? Now I'm in a pretty weird state... the job is sorta half deployed. Two of the three instances are still on their old host and not migrated.

@josh-m-sharpe
Copy link
Author

After all of that re-ran the exact same command and it rotated the other 5 instances, seemingly just fine:

./bin/elasticsearch.sh restart
==> 2023-06-28T14:51:44Z: Restarting 6 allocations
    2023-06-28T14:51:44Z: Rescheduling allocation "fe7d7d0d" for group "master"
    2023-06-28T14:51:45Z: Allocation "fe7d7d0d" replaced by "025ffada", waiting for "025ffada" to start running
    2023-06-28T14:52:02Z: Allocation "025ffada" is "running"
    2023-06-28T14:52:02Z: Rescheduling allocation "41578030" for group "master"
    2023-06-28T14:52:03Z: Allocation "41578030" replaced by "c13072ca", waiting for "c13072ca" to start running
    2023-06-28T14:52:43Z: Allocation "c13072ca" is "running"
    2023-06-28T14:52:43Z: Rescheduling allocation "6e1cb0c4" for group "master"
    2023-06-28T14:52:44Z: Allocation "6e1cb0c4" replaced by "31f708e8", waiting for "31f708e8" to start running
    2023-06-28T14:53:17Z: Allocation "31f708e8" is "running"
    2023-06-28T14:53:17Z: Rescheduling allocation "7c9e4a69" for group "data"
    2023-06-28T14:53:18Z: Allocation "7c9e4a69" replaced by "dd40414f", waiting for "dd40414f" to start running
    2023-06-28T14:53:35Z: Allocation "dd40414f" is "running"
    2023-06-28T14:53:35Z: Rescheduling allocation "a9994f41" for group "data"
    2023-06-28T14:53:35Z: Allocation "a9994f41" replaced by "9410a656", waiting for "9410a656" to start running
    2023-06-28T14:53:52Z: Allocation "9410a656" is "running"
    2023-06-28T14:53:52Z: Rescheduling allocation "f9eb270a" for group "data"
    2023-06-28T14:53:53Z: Allocation "f9eb270a" replaced by "5cd5cadb", waiting for "5cd5cadb" to start running
    2023-06-28T14:54:10Z: Allocation "5cd5cadb" is "running"
==> 2023-06-28T14:54:10Z: Job restart finished

Job restarted successfully!

@josh-m-sharpe
Copy link
Author

josh-m-sharpe commented Jun 29, 2023

Seems like restart -schedule is explicitly to blame here.

I've determined that if I individually nomad alloc stop <allocID> it will "reschedule' itself to an eligible node. Works just fine. I can do that for all allocations, effectively re-implementing, what restart -schedule is supposed to be doing.

EDIT, this might not be true. Seems like it worked once or twice, but now I see it hanging again.

@josh-m-sharpe
Copy link
Author

@tgross not sure if i'm experiencing similar issue to these, so possible regression?

#8080
#8232
#8609

@josh-m-sharpe josh-m-sharpe changed the title Cannot restart jobs with attached CSI volumes ; also Drain, and Force Drain are liars. cannot restart jobs with attached CSI volumes Jun 29, 2023
@lgfa29
Copy link
Contributor

lgfa29 commented Jul 13, 2023

Thanks for the report @josh-m-sharpe.

I'm still not sure what's going on, but given that you once had success on re-running the command I wonder if something in the allocation stop procedure is not properly waiting for the volume to detach before reschedule.

Do you happen to have any server or client logs around the time this problem happened? If there are any sensitive data you can email them to nomad-oss-debug@hashicorp.com with the issue ID in the subject.

@josh-m-sharpe
Copy link
Author

@lgfa29 In retrospect I suspect I never had success running restart -reschedule when a host-change would have occurred. I think it only worked when the allocation would have been restarted on the same host (which is more or less the same as vanilla restart)

I can get plenty of logs for you, but I think the important thing is the lack of logs that should indicate that nomad is attempting to unmount the volume. Those logs don't exist. I can find logs all day long of attaching the volume when they should be, but I only see errors on the new hosts about unavailable volumes and nothing about even attempting to unmount the volumes from the old hosts. Those logs are symptoms of the real problem.

So, I suspect nomad isn't even trying to unmount the volumes in this case.

I'd be more than happy to pair on this if you'd be down. I work ET hours usually.

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 13, 2023

The volumes are unmounted and unpublished here:

for _, result := range c.volumeResults {
wg.Add(1)
// CSI RPCs can potentially take a long time. Split the work
// into goroutines so that operators could potentially reuse
// one of a set of volumes
go func(result *volumePublishResult) {
defer wg.Done()
err := c.unmountImpl(result)
if err != nil {
// we can recover an unmount failure if the operator
// brings the plugin back up, so retry every few minutes
// but eventually give up. Don't block shutdown so that
// we don't block shutting down the client in -dev mode
go func(result *volumePublishResult) {
err := c.unmountWithRetry(result)
if err != nil {
c.logger.Error("volume could not be unmounted")
}
err = c.unpublish(result)
if err != nil {
c.logger.Error("volume could not be unpublished")
}
}(result)
}
// we can't recover from this RPC error client-side; the
// volume claim GC job will have to clean up for us once
// the allocation is marked terminal
errs <- c.unpublish(result)
}(result)
}

There isn't a lot to log here, but you can see these messages at the TRACE level with something like nomad monitor -log-level=TRACE:

if ar.logger.IsTrace() {
start := time.Now()
ar.logger.Trace("running post-run hooks", "start", start)
defer func() {
end := time.Now()
ar.logger.Trace("finished post-run hooks", "end", end, "duration", end.Sub(start))
}()
}
for _, hook := range ar.runnerHooks {
post, ok := hook.(interfaces.RunnerPostrunHook)
if !ok {
continue
}
name := post.Name()
var start time.Time
if ar.logger.IsTrace() {
start = time.Now()
ar.logger.Trace("running post-run hook", "name", name, "start", start)
}
if err := post.Postrun(); err != nil {
return fmt.Errorf("hook %q failed: %v", name, err)
}
if ar.logger.IsTrace() {
end := time.Now()
ar.logger.Trace("finished post-run hooks", "name", name, "end", end, "duration", end.Sub(start))
}
}

Your CSI plugin may also have some log information, both the code and controller plugins, but each CSI plugin behaves a little differently.

@lgfa29 lgfa29 added this to Needs Triage in Nomad - Community Issues Triage via automation Jul 13, 2023
@lgfa29 lgfa29 moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Jul 13, 2023
@josh-m-sharpe
Copy link
Author

Okay, maybe found the issue (full trace below)

Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.156Z [TRACE] nomad.csi_volume: controller unpublish: vol=elasticsearch_data[2]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.156Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=elasticsearch_data[2]
Jul 14 01:02:35 ip-10-20-11-253 nomad: error=
Jul 14 01:02:35 ip-10-20-11-253 nomad: | 1 error occurred:
Jul 14 01:02:35 ip-10-20-11-253 nomad: | \t* could not detach from controller: controller detach volume: failed to find clients running controller plugin "aws-ebs0"

...but I don't think that's true? I have a public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver:v1.19.0 allocation running on each node. They have this:

csi_plugin {
  id = "aws-ebs0"
  type = "node"
  mount_dir = "/csi"
}

Context is:

  • 3 nodes up and running
  • aws-ebs-controller job running (single allocation)
  • aws-ebs-nodes job running as system job. (3 allocations, 1 per node)
  • elasticsearch job deployed. volumes mounted. service working.
  • 3 new nodes started - identical to the first three. They join the cluster.
    • aws-ebs-nodes system job starts successfully on each of these
  • 3 old nodes marked as inelligible (manually, in the UI)
  • nomad job restart --reschedule elasticsearch

State of cluster. This all looks correct to me. Allocation was moved to the correct new node. aws-ebs-nodes are all up and running

Cursor_and_Cluster_Topology_-_Nomad

Full output from moment of restart --reschedule


Jul 14 01:02:34 ip-10-20-11-253 nomad: results=
Jul 14 01:02:34 ip-10-20-11-253 nomad: | Total changes: (place 1) (destructive 0) (inplace 0) (stop 1) (disconnect 0) (reconnect 0)
Jul 14 01:02:34 ip-10-20-11-253 nomad: | Desired Changes for "master": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 3) (canary 0)
Jul 14 01:02:34 ip-10-20-11-253 nomad: | Desired Changes for "data": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 1) (ignore 2) (canary 0)
Jul 14 01:02:34 ip-10-20-11-253 nomad: 2023-07-14T01:02:34.261Z [TRACE] nomad: evaluating plan: plan="(eval c6caa965, job elasticsearch, NodeUpdates: (node[032dc10d] (e8a69b3f stop/evict)), NodeAllocations: (node[d0875000] (71ef8e63 elasticsearc
h.data[2] run)))"
Jul 14 01:02:34 ip-10-20-11-253 nomad: 2023-07-14T01:02:34.270Z [DEBUG] http: request complete: method=GET path=/v1/job/elasticsearch/allocations?index=4277143 duration=29.658297072s
Jul 14 01:02:34 ip-10-20-11-253 nomad: 2023-07-14T01:02:34.279Z [DEBUG] worker: submitted plan for evaluation: worker_id=22869eb0-daaf-b993-9e1c-0a2e84fb6bc7 eval_id=c6caa965-d767-6b0c-8e6e-7f04605004ca
Jul 14 01:02:34 ip-10-20-11-253 nomad: 2023-07-14T01:02:34.279Z [DEBUG] worker.service_sched: setting eval status: eval_id=c6caa965-d767-6b0c-8e6e-7f04605004ca job_id=elasticsearch namespace=default worker_id=22869eb0-daaf-b993-9e1c-0a2e84fb6bc7
 status=complete
Jul 14 01:02:34 ip-10-20-11-253 nomad: 2023-07-14T01:02:34.280Z [DEBUG] http: request complete: method=GET path=/v1/job/elasticsearch/summary?index=4277121 duration=59.121638416s
Jul 14 01:02:34 ip-10-20-11-253 nomad: 2023-07-14T01:02:34.282Z [DEBUG] http: request complete: method=GET path="/v1/job/elasticsearch/evaluations?index=4277148&namespace=default" duration=11.60186ms
Jul 14 01:02:34 ip-10-20-11-253 nomad: 2023-07-14T01:02:34.283Z [DEBUG] http: request complete: method=GET path="/v1/jobs?meta=true&namespace=%2A&index=4277143" duration=38.673864728s
Jul 14 01:02:34 ip-10-20-11-253 nomad: 2023-07-14T01:02:34.283Z [DEBUG] http: request complete: method=GET path=/v1/job/elasticsearch/summary?index=4277121 duration=31.557499126s
Jul 14 01:02:34 ip-10-20-11-253 nomad: 2023-07-14T01:02:34.287Z [DEBUG] worker: updated evaluation: worker_id=22869eb0-daaf-b993-9e1c-0a2e84fb6bc7 eval="<Eval \"c6caa965-d767-6b0c-8e6e-7f04605004ca\" JobID: \"elasticsearch\" Namespace: \"default
\">"
Jul 14 01:02:34 ip-10-20-11-253 nomad: 2023-07-14T01:02:34.287Z [DEBUG] worker: ack evaluation: worker_id=22869eb0-daaf-b993-9e1c-0a2e84fb6bc7 eval_id=c6caa965-d767-6b0c-8e6e-7f04605004ca type=service namespace=default job_id=elasticsearch node_
id="" triggered_by=alloc-stop
Jul 14 01:02:34 ip-10-20-11-253 nomad: 2023-07-14T01:02:34.287Z [TRACE] worker: changed workload status: worker_id=22869eb0-daaf-b993-9e1c-0a2e84fb6bc7 from=Scheduling to=WaitingToDequeue
Jul 14 01:02:34 ip-10-20-11-253 nomad: 2023-07-14T01:02:34.291Z [DEBUG] http: request complete: method=GET path="/v1/job/elasticsearch/evaluations?index=4277149&namespace=default" duration=3.618303ms
Jul 14 01:02:34 ip-10-20-11-253 nomad: 2023-07-14T01:02:34.628Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration="271.922µs"
Jul 14 01:02:34 ip-10-20-11-253 nomad: 2023-07-14T01:02:34.874Z [DEBUG] nomad: memberlist: Stream connection from=10.20.31.119:42936
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.132Z [TRACE] nomad.csi_volume: checkpointing claim
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.138Z [TRACE] nomad.csi_volume: node unpublish: vol=elasticsearch_data[2]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.138Z [TRACE] nomad.csi_volume: checkpointing claim
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.139Z [TRACE] nomad.volumes_watcher: releasing unused volume claims: namespace=default volume_id=elasticsearch_master[2]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.139Z [TRACE] nomad.volumes_watcher: releasing unused volume claims: namespace=default volume_id=elasticsearch_data[2]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.139Z [TRACE] nomad.volumes_watcher: unpublishing volume: namespace=default volume_id=elasticsearch_data[2] alloc=e8a69b3f-f543-aafc-1379-bee1f73395c6
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.139Z [TRACE] nomad.volumes_watcher: releasing unused volume claims: namespace=default volume_id=elasticsearch_data[0]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.139Z [TRACE] nomad.volumes_watcher: releasing unused volume claims: namespace=default volume_id=elasticsearch_data[1]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.139Z [TRACE] nomad.volumes_watcher: releasing unused volume claims: namespace=default volume_id=elasticsearch_master[0]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.139Z [TRACE] nomad.volumes_watcher: releasing unused volume claims: namespace=default volume_id=elasticsearch_master[1]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.140Z [TRACE] nomad.csi_volume: checkpointing claim
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.145Z [TRACE] nomad.csi_volume: controller unpublish: vol=elasticsearch_data[2]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.146Z [TRACE] nomad.volumes_watcher: releasing unused volume claims: namespace=default volume_id=elasticsearch_master[2]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.146Z [TRACE] nomad.volumes_watcher: releasing unused volume claims: namespace=default volume_id=elasticsearch_data[1]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.146Z [TRACE] nomad.volumes_watcher: releasing unused volume claims: namespace=default volume_id=elasticsearch_data[0]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.146Z [TRACE] nomad.volumes_watcher: releasing unused volume claims: namespace=default volume_id=elasticsearch_master[0]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.146Z [TRACE] nomad.volumes_watcher: releasing unused volume claims: namespace=default volume_id=elasticsearch_master[1]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.149Z [TRACE] nomad.csi_volume: node unpublish: vol=elasticsearch_data[2]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.149Z [TRACE] nomad.csi_volume: checkpointing claim
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.150Z [TRACE] nomad.volumes_watcher: releasing unused volume claims: namespace=default volume_id=elasticsearch_data[1]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.150Z [TRACE] nomad.volumes_watcher: releasing unused volume claims: namespace=default volume_id=elasticsearch_data[0]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.156Z [TRACE] nomad.csi_volume: controller unpublish: vol=elasticsearch_data[2]
Jul 14 01:02:35 ip-10-20-11-253 nomad: 2023-07-14T01:02:35.156Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=elasticsearch_data[2]
Jul 14 01:02:35 ip-10-20-11-253 nomad: error=
Jul 14 01:02:35 ip-10-20-11-253 nomad: | 1 error occurred:
Jul 14 01:02:35 ip-10-20-11-253 nomad: | \t* could not detach from controller: controller detach volume: failed to find clients running controller plugin "aws-ebs0"

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 14, 2023

Thanks for the extra info @josh-m-sharpe.

So it seems like the Nomad client is not able to communicate with the CSI controller plugin.

CSI has lots of moving parts, so let me try to explain what appears to be going on.

When the allocation is stopped the Nomad client that is running it tries to contact the CSI controller to detach the volume that was being used from itself, meaning that the EBS that was being used will now be available to be attached to another Nomad client.

The CSI node plugin (which is what you showed in the plugin snippet and the Topology view screenshot) is only responsible for unmounting the disk within the Nomad client fliesystem, and that seems to be happening just fine.

From your logs the problem is that the Nomad client is failing to communicate with the CSI controller plugin, and so the EBS stays attached to the current client and when the new allocation tries to run on a different client it will fail to attach the volume to itself.

So a good next step would be to make sure your CSI controller is running properly and that all your Nomad clients are able to properly communicate with it.

You can check the logs for the CSI controller allocation and look for calls to Probe/GetCapabilities for example which would indicate the clients are able to heartbeat.

@josh-m-sharpe
Copy link
Author

@lgfa29 Unfotunately, I'm not seeing any errors from the controller (or node) allocations. I've got a lot for you here:

Four log dumps down below, a quick summary:

  1. Controller logs when the job is initially started
  • this shows a lot. There are 6 allocations, and each has a volume)
  • looks correct to me
  1. Controller logs when restart is issued
  • seems all correct to me, each allocations gracefully restarted.
  • looks like the volume was detached the reattached even though the allocation didn't move hosts. unnecessary, but fine
  1. Controller logs when I launched 3 new Nomad Clients, with system ebs-node allocations. So, we see ControllerGetCapabilities log lines. Is this the heartbeat you were referring to?
  2. Logs when restart -reschedule was issued
  • I only saw activity in the ebs-node allocation on the same host as the deallocated elasticsearch allocation (that's a mouthful).
  • The controller didn't show anything at all.
  • This looks fine to me. The volume has been detached. Great.
  • I did some poking around the directories in /mnt/nomad/client/csi/node/aws-ebs0/per-alloc/<allocid>. Looks like the volume has actually been detached. (I'm not sure how to systematically list these mounts). Other hosts show real elasticsearch data in there indicating the volumes are still attached (this is expected).

So, the ebs-node reflects that the volume has been detached. It actually has been detached. Yet, the controller never got the message? No logs, nothing. :(

1. controller logs when job is started

I0714 18:09:07.865079       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:09:37.866728       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:09:40.796202       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-044fc34fc407eebc0","node_id":"i-0a10195e48ad25a06","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:09:40.797508       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-0f05174864a181762","node_id":"i-0a10195e48ad25a06","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:09:40.806623       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-03d6988dc1ff6d092","node_id":"i-0cfb3d2a879c5585a","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:09:40.807165       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-0761082735e8a6b3c","node_id":"i-042cef0670c6068a3","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:09:40.809608       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-0b3e474666422c1e0","node_id":"i-042cef0670c6068a3","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:09:40.813503       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-08a4b1dc453dae9d0","node_id":"i-0cfb3d2a879c5585a","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:09:47.651105       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-03d6988dc1ff6d092" nodeID="i-0cfb3d2a879c5585a" resp=<
        {
          AttachTime: 2023-07-14 18:09:47.626 +0000 UTC,
          Device: "/dev/xvdab",
          InstanceId: "i-0cfb3d2a879c5585a",
          State: "attaching",
          VolumeId: "vol-03d6988dc1ff6d092"
        }
 >
I0714 18:09:47.658794       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-044fc34fc407eebc0" nodeID="i-0a10195e48ad25a06" resp=<
        {
          AttachTime: 2023-07-14 18:09:47.638 +0000 UTC,
          Device: "/dev/xvdab",
          InstanceId: "i-0a10195e48ad25a06",
          State: "attaching",
          VolumeId: "vol-044fc34fc407eebc0"
        }
 >
I0714 18:09:47.661191       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-0b3e474666422c1e0" nodeID="i-042cef0670c6068a3" resp=<
        {
          AttachTime: 2023-07-14 18:09:47.64 +0000 UTC,
          Device: "/dev/xvdaa",
          InstanceId: "i-042cef0670c6068a3",
          State: "attaching",
          VolumeId: "vol-0b3e474666422c1e0"
        }
 >
I0714 18:09:47.668082       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-0f05174864a181762" nodeID="i-0a10195e48ad25a06" resp=<
        {
          AttachTime: 2023-07-14 18:09:47.646 +0000 UTC,
          Device: "/dev/xvdaa",
          InstanceId: "i-0a10195e48ad25a06",
          State: "attaching",
          VolumeId: "vol-0f05174864a181762"
        }
 >
I0714 18:09:47.689215       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-0761082735e8a6b3c" nodeID="i-042cef0670c6068a3" resp=<
        {
          AttachTime: 2023-07-14 18:09:47.668 +0000 UTC,
          Device: "/dev/xvdab",
          InstanceId: "i-042cef0670c6068a3",
          State: "attaching",
          VolumeId: "vol-0761082735e8a6b3c"
        }
 >
I0714 18:09:47.709309       1 cloud.go:713] "Waiting for volume state" volumeID="vol-03d6988dc1ff6d092" actual="attaching" desired="attached"
I0714 18:09:47.711421       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-08a4b1dc453dae9d0" nodeID="i-0cfb3d2a879c5585a" resp=<
        {
          AttachTime: 2023-07-14 18:09:47.69 +0000 UTC,
          Device: "/dev/xvdaa",
          InstanceId: "i-0cfb3d2a879c5585a",
          State: "attaching",
          VolumeId: "vol-08a4b1dc453dae9d0"
        }
 >
I0714 18:09:47.720761       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0b3e474666422c1e0" actual="attaching" desired="attached"
I0714 18:09:47.727059       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0f05174864a181762" actual="attaching" desired="attached"
I0714 18:09:47.737448       1 cloud.go:713] "Waiting for volume state" volumeID="vol-044fc34fc407eebc0" actual="attaching" desired="attached"
I0714 18:09:47.744425       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0761082735e8a6b3c" actual="attaching" desired="attached"
I0714 18:09:47.767886       1 cloud.go:713] "Waiting for volume state" volumeID="vol-08a4b1dc453dae9d0" actual="attaching" desired="attached"
I0714 18:09:48.769086       1 manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdab" volume="vol-03d6988dc1ff6d092"
I0714 18:09:48.769110       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node" volumeID="vol-03d6988dc1ff6d092" nodeID="i-0cfb3d2a879c5585a" devicePath="/dev/xvdab"
I0714 18:09:48.780733       1 manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdaa" volume="vol-0b3e474666422c1e0"
I0714 18:09:48.780758       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node" volumeID="vol-0b3e474666422c1e0" nodeID="i-042cef0670c6068a3" devicePath="/dev/xvdaa"
I0714 18:09:48.812582       1 manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdaa" volume="vol-0f05174864a181762"
I0714 18:09:48.812618       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node" volumeID="vol-0f05174864a181762" nodeID="i-0a10195e48ad25a06" devicePath="/dev/xvdaa"
I0714 18:09:48.820840       1 manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdab" volume="vol-044fc34fc407eebc0"
I0714 18:09:48.820862       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node" volumeID="vol-044fc34fc407eebc0" nodeID="i-0a10195e48ad25a06" devicePath="/dev/xvdab"
I0714 18:09:48.828449       1 manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdaa" volume="vol-08a4b1dc453dae9d0"
I0714 18:09:48.828470       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node" volumeID="vol-08a4b1dc453dae9d0" nodeID="i-0cfb3d2a879c5585a" devicePath="/dev/xvdaa"
I0714 18:09:48.842723       1 manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdab" volume="vol-0761082735e8a6b3c"
I0714 18:09:48.842748       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node" volumeID="vol-0761082735e8a6b3c" nodeID="i-042cef0670c6068a3" devicePath="/dev/xvdab"
I0714 18:10:07.868349       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:10:37.870284       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 17:20:17.661042       1 driver.go:75] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.19.0"
I0714 17:20:17.661095       1 controller.go:85] "[Debug] Retrieving region from metadata service"
I0714 17:20:17.661109       1 metadata.go:85] "retrieving instance data from ec2 metadata"
I0714 17:20:20.790069       1 metadata.go:92] "ec2 metadata is available"
I0714 17:20:23.933777       1 metadata_ec2.go:25] "Retrieving EC2 instance identity metadata" regionFromSession=""
I0714 17:20:33.362740       1 driver.go:147] "Listening for connections" address="/csi/csi.sock"
I0714 17:20:37.741994       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 17:21:07.743250       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 17:21:37.743955       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 17:22:07.744814       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 17:22:37.745997       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 17:23:07.747114       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 17:23:37.748247       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 17:24:07.749027       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 17:24:37.750169       1 controller.go:446] "ControllerGetCapabilities: called" args={}

I0714 18:01:37.842664       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:02:07.843619       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:02:37.844510       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:03:07.851791       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:03:37.852799       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:04:07.853400       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:04:37.854771       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:05:07.855823       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:05:37.857439       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:06:07.858652       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:06:37.859354       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:07:07.860611       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:07:37.861831       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:08:07.863077       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:08:37.864056       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:09:07.865079       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:09:37.866728       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:09:40.796202       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-044fc34fc407eebc0","node_id":"i-0a10195e48ad25a06","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:09:40.797508       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-0f05174864a181762","node_id":"i-0a10195e48ad25a06","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:09:40.806623       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-03d6988dc1ff6d092","node_id":"i-0cfb3d2a879c5585a","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:09:40.807165       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-0761082735e8a6b3c","node_id":"i-042cef0670c6068a3","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:09:40.809608       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-0b3e474666422c1e0","node_id":"i-042cef0670c6068a3","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:09:40.813503       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-08a4b1dc453dae9d0","node_id":"i-0cfb3d2a879c5585a","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:09:47.651105       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-03d6988dc1ff6d092" nodeID="i-0cfb3d2a879c5585a" resp=<
        {
          AttachTime: 2023-07-14 18:09:47.626 +0000 UTC,
          Device: "/dev/xvdab",
          InstanceId: "i-0cfb3d2a879c5585a",
          State: "attaching",
          VolumeId: "vol-03d6988dc1ff6d092"
        }
 >
I0714 18:09:47.658794       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-044fc34fc407eebc0" nodeID="i-0a10195e48ad25a06" resp=<
        {
          AttachTime: 2023-07-14 18:09:47.638 +0000 UTC,
          Device: "/dev/xvdab",
          InstanceId: "i-0a10195e48ad25a06",
          State: "attaching",
          VolumeId: "vol-044fc34fc407eebc0"
        }
 >
I0714 18:09:47.661191       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-0b3e474666422c1e0" nodeID="i-042cef0670c6068a3" resp=<
        {
          AttachTime: 2023-07-14 18:09:47.64 +0000 UTC,
          Device: "/dev/xvdaa",
          InstanceId: "i-042cef0670c6068a3",
          State: "attaching",
          VolumeId: "vol-0b3e474666422c1e0"
        }
 >
I0714 18:09:47.668082       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-0f05174864a181762" nodeID="i-0a10195e48ad25a06" resp=<
        {
          AttachTime: 2023-07-14 18:09:47.646 +0000 UTC,
          Device: "/dev/xvdaa",
          InstanceId: "i-0a10195e48ad25a06",
          State: "attaching",
          VolumeId: "vol-0f05174864a181762"
        }
 >
I0714 18:09:47.689215       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-0761082735e8a6b3c" nodeID="i-042cef0670c6068a3" resp=<
        {
          AttachTime: 2023-07-14 18:09:47.668 +0000 UTC,
          Device: "/dev/xvdab",
          InstanceId: "i-042cef0670c6068a3",
          State: "attaching",
          VolumeId: "vol-0761082735e8a6b3c"
        }
 >
I0714 18:09:47.709309       1 cloud.go:713] "Waiting for volume state" volumeID="vol-03d6988dc1ff6d092" actual="attaching" desired="attached"
I0714 18:09:47.711421       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-08a4b1dc453dae9d0" nodeID="i-0cfb3d2a879c5585a" resp=<
        {
          AttachTime: 2023-07-14 18:09:47.69 +0000 UTC,
          Device: "/dev/xvdaa",
          InstanceId: "i-0cfb3d2a879c5585a",
          State: "attaching",
          VolumeId: "vol-08a4b1dc453dae9d0"
        }
 > I0714 18:09:47.720761       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0b3e474666422c1e0"
   actual="attaching" desired="attached" I0714 18:09:47.727059       1 cloud.go:713] "Waiting for volume state"
   volumeID="vol-0f05174864a181762" actual="attaching" desired="attached" I0714 18:09:47.737448       1
   cloud.go:713] "Waiting for volume state" volumeID="vol-044fc34fc407eebc0" actual="attaching" desired="attached"
   I0714 18:09:47.744425       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0761082735e8a6b3c"
   actual="attaching" desired="attached" I0714 18:09:47.767886       1 cloud.go:713] "Waiting for volume state"
   volumeID="vol-08a4b1dc453dae9d0" actual="attaching" desired="attached" I0714 18:09:48.769086       1
   manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdab" volume="vol-03d6988dc1ff6d092" I0714
   18:09:48.769110       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node"
   volumeID="vol-03d6988dc1ff6d092" nodeID="i-0cfb3d2a879c5585a" devicePath="/dev/xvdab" I0714 18:09:48.780733       1
   manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdaa" volume="vol-0b3e474666422c1e0" I0714
   18:09:48.780758       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node"
   volumeID="vol-0b3e474666422c1e0" nodeID="i-042cef0670c6068a3" devicePath="/dev/xvdaa" I0714 18:09:48.812582       1
   manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdaa" volume="vol-0f05174864a181762" I0714
   18:09:48.812618       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node"
   volumeID="vol-0f05174864a181762" nodeID="i-0a10195e48ad25a06" devicePath="/dev/xvdaa" I0714 18:09:48.820840       1
   manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdab" volume="vol-044fc34fc407eebc0" I0714
   18:09:48.820862       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node"
   volumeID="vol-044fc34fc407eebc0" nodeID="i-0a10195e48ad25a06" devicePath="/dev/xvdab" I0714 18:09:48.828449       1
   manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdaa" volume="vol-08a4b1dc453dae9d0" I0714
   18:09:48.828470       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node"
   volumeID="vol-08a4b1dc453dae9d0" nodeID="i-0cfb3d2a879c5585a" devicePath="/dev/xvdaa" I0714 18:09:48.842723       1
   manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdab" volume="vol-0761082735e8a6b3c" I0714
   18:09:48.842748       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node"
   volumeID="vol-0761082735e8a6b3c" nodeID="i-042cef0670c6068a3" devicePath="/dev/xvdab" I0714 18:10:07.868349       1
   controller.go:446] "ControllerGetCapabilities: called" args={} I0714 18:10:37.870284       1
   controller.go:446] "ControllerGetCapabilities: called" args={} I0714 18:11:07.878248       1
   controller.go:446] "ControllerGetCapabilities: called" args={}

2. restart -reschedule (only 3 nodes present)

I0714 18:14:15.515063       1 controller.go:414] "ControllerUnpublishVolume: called" args={"volume_id":"vol-03d6988dc1ff6d092","node_id":"i-0cfb3d2a879c5585a"}
I0714 18:14:15.526981       1 controller.go:414] "ControllerUnpublishVolume: called" args={"volume_id":"vol-03d6988dc1ff6d092","node_id":"i-0cfb3d2a879c5585a"}
I0714 18:14:15.899704       1 cloud.go:713] "Waiting for volume state" volumeID="vol-03d6988dc1ff6d092" actual="detaching" desired="detached"
I0714 18:14:15.946905       1 cloud.go:713] "Waiting for volume state" volumeID="vol-03d6988dc1ff6d092" actual="detaching" desired="detached"
I0714 18:14:16.960525       1 cloud.go:713] "Waiting for volume state" volumeID="vol-03d6988dc1ff6d092" actual="detaching" desired="detached"
I0714 18:14:16.992994       1 cloud.go:713] "Waiting for volume state" volumeID="vol-03d6988dc1ff6d092" actual="detaching" desired="detached"
I0714 18:14:18.804885       1 cloud.go:713] "Waiting for volume state" volumeID="vol-03d6988dc1ff6d092" actual="detaching" desired="detached"
I0714 18:14:18.840825       1 cloud.go:713] "Waiting for volume state" volumeID="vol-03d6988dc1ff6d092" actual="detaching" desired="detached"
I0714 18:14:22.081861       1 cloud.go:713] "Waiting for volume state" volumeID="vol-03d6988dc1ff6d092" actual="detaching" desired="detached"
I0714 18:14:22.126833       1 cloud.go:713] "Waiting for volume state" volumeID="vol-03d6988dc1ff6d092" actual="detaching" desired="detached"
I0714 18:14:27.973383       1 controller.go:428] "[Debug] ControllerUnpublishVolume: detached from node" volumeID="vol-03d6988dc1ff6d092" nodeID="i-0cfb3d2a879c5585a"
I0714 18:14:28.019949       1 controller.go:428] "[Debug] ControllerUnpublishVolume: detached from node" volumeID="vol-03d6988dc1ff6d092" nodeID="i-0cfb3d2a879c5585a"
I0714 18:14:29.453578       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-03d6988dc1ff6d092","node_id":"i-0cfb3d2a879c5585a","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:14:29.938968       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-03d6988dc1ff6d092" nodeID="i-0cfb3d2a879c5585a" resp=<
        {
          AttachTime: 2023-07-14 18:14:29.915 +0000 UTC,
          Device: "/dev/xvdab",
          InstanceId: "i-0cfb3d2a879c5585a",
          State: "attaching",
          VolumeId: "vol-03d6988dc1ff6d092"
        }
 >
I0714 18:14:29.997816       1 cloud.go:713] "Waiting for volume state" volumeID="vol-03d6988dc1ff6d092" actual="attaching" desired="attached"
I0714 18:14:31.056486       1 manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdab" volume="vol-03d6988dc1ff6d092"
I0714 18:14:31.056514       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node" volumeID="vol-03d6988dc1ff6d092" nodeID="i-0cfb3d2a879c5585a" devicePath="/dev/xvdab"
I0714 18:14:32.235210       1 controller.go:414] "ControllerUnpublishVolume: called" args={"volume_id":"vol-0b3e474666422c1e0","node_id":"i-042cef0670c6068a3"}
I0714 18:14:32.242653       1 controller.go:414] "ControllerUnpublishVolume: called" args={"volume_id":"vol-0b3e474666422c1e0","node_id":"i-042cef0670c6068a3"}
I0714 18:14:32.611847       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0b3e474666422c1e0" actual="detaching" desired="detached"
I0714 18:14:32.653593       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0b3e474666422c1e0" actual="detaching" desired="detached"
I0714 18:14:33.675114       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0b3e474666422c1e0" actual="detaching" desired="detached"
I0714 18:14:33.701305       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0b3e474666422c1e0" actual="detaching" desired="detached"
I0714 18:14:35.519182       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0b3e474666422c1e0" actual="detaching" desired="detached"
I0714 18:14:35.547289       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0b3e474666422c1e0" actual="detaching" desired="detached"
I0714 18:14:37.886143       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:14:38.797404       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0b3e474666422c1e0" actual="detaching" desired="detached"
I0714 18:14:38.835765       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0b3e474666422c1e0" actual="detaching" desired="detached"
I0714 18:14:44.687510       1 controller.go:428] "[Debug] ControllerUnpublishVolume: detached from node" volumeID="vol-0b3e474666422c1e0" nodeID="i-042cef0670c6068a3"
I0714 18:14:44.718503       1 controller.go:428] "[Debug] ControllerUnpublishVolume: detached from node" volumeID="vol-0b3e474666422c1e0" nodeID="i-042cef0670c6068a3"
I0714 18:14:46.241479       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-0b3e474666422c1e0","node_id":"i-042cef0670c6068a3","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:14:46.738277       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-0b3e474666422c1e0" nodeID="i-042cef0670c6068a3" resp=<
        {
          AttachTime: 2023-07-14 18:14:46.715 +0000 UTC,
          Device: "/dev/xvdaa",
          InstanceId: "i-042cef0670c6068a3",
          State: "attaching",
          VolumeId: "vol-0b3e474666422c1e0"
        }
 >
I0714 18:14:46.804616       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0b3e474666422c1e0" actual="attaching" desired="attached"
I0714 18:14:47.873946       1 manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdaa" volume="vol-0b3e474666422c1e0"
I0714 18:14:47.873971       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node" volumeID="vol-0b3e474666422c1e0" nodeID="i-042cef0670c6068a3" devicePath="/dev/xvdaa"
I0714 18:14:49.484777       1 controller.go:414] "ControllerUnpublishVolume: called" args={"volume_id":"vol-08a4b1dc453dae9d0","node_id":"i-0cfb3d2a879c5585a"}
I0714 18:14:49.492555       1 controller.go:414] "ControllerUnpublishVolume: called" args={"volume_id":"vol-08a4b1dc453dae9d0","node_id":"i-0cfb3d2a879c5585a"}
I0714 18:14:49.813809       1 cloud.go:713] "Waiting for volume state" volumeID="vol-08a4b1dc453dae9d0" actual="detaching" desired="detached"
I0714 18:14:49.902282       1 cloud.go:713] "Waiting for volume state" volumeID="vol-08a4b1dc453dae9d0" actual="detaching" desired="detached"
I0714 18:14:50.871397       1 cloud.go:713] "Waiting for volume state" volumeID="vol-08a4b1dc453dae9d0" actual="detaching" desired="detached"
I0714 18:14:50.946581       1 cloud.go:713] "Waiting for volume state" volumeID="vol-08a4b1dc453dae9d0" actual="detaching" desired="detached"
I0714 18:14:52.712559       1 cloud.go:713] "Waiting for volume state" volumeID="vol-08a4b1dc453dae9d0" actual="detaching" desired="detached"
I0714 18:14:52.800590       1 cloud.go:713] "Waiting for volume state" volumeID="vol-08a4b1dc453dae9d0" actual="detaching" desired="detached"
I0714 18:14:56.014085       1 controller.go:428] "[Debug] ControllerUnpublishVolume: detached from node" volumeID="vol-08a4b1dc453dae9d0" nodeID="i-0cfb3d2a879c5585a"
I0714 18:14:56.088853       1 controller.go:428] "[Debug] ControllerUnpublishVolume: detached from node" volumeID="vol-08a4b1dc453dae9d0" nodeID="i-0cfb3d2a879c5585a"
I0714 18:15:03.483225       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-08a4b1dc453dae9d0","node_id":"i-0cfb3d2a879c5585a","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:15:04.038485       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-08a4b1dc453dae9d0" nodeID="i-0cfb3d2a879c5585a" resp=<
        {
          AttachTime: 2023-07-14 18:15:04.015 +0000 UTC,
          Device: "/dev/xvdaa",
          InstanceId: "i-0cfb3d2a879c5585a",
          State: "attaching",
          VolumeId: "vol-08a4b1dc453dae9d0"
        }
 >
I0714 18:15:04.101011       1 cloud.go:713] "Waiting for volume state" volumeID="vol-08a4b1dc453dae9d0" actual="attaching" desired="attached"
I0714 18:15:05.148816       1 cloud.go:713] "Waiting for volume state" volumeID="vol-08a4b1dc453dae9d0" actual="attaching" desired="attached"
I0714 18:15:07.004118       1 manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdaa" volume="vol-08a4b1dc453dae9d0"
I0714 18:15:07.004390       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node" volumeID="vol-08a4b1dc453dae9d0" nodeID="i-0cfb3d2a879c5585a" devicePath="/dev/xvdaa"
I0714 18:15:07.887498       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:15:09.118037       1 controller.go:414] "ControllerUnpublishVolume: called" args={"volume_id":"vol-0f05174864a181762","node_id":"i-0a10195e48ad25a06"}
I0714 18:15:09.130762       1 controller.go:414] "ControllerUnpublishVolume: called" args={"volume_id":"vol-0f05174864a181762","node_id":"i-0a10195e48ad25a06"}
I0714 18:15:09.520714       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0f05174864a181762" actual="detaching" desired="detached"
I0714 18:15:12.447611       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0f05174864a181762" actual="detaching" desired="detached"
I0714 18:15:15.751453       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0f05174864a181762" actual="detaching" desired="detached"
I0714 18:15:15.754763       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0f05174864a181762" actual="detaching" desired="detached"
I0714 18:15:21.632498       1 controller.go:428] "[Debug] ControllerUnpublishVolume: detached from node" volumeID="vol-0f05174864a181762" nodeID="i-0a10195e48ad25a06"
I0714 18:15:21.641099       1 controller.go:428] "[Debug] ControllerUnpublishVolume: detached from node" volumeID="vol-0f05174864a181762" nodeID="i-0a10195e48ad25a06"
I0714 18:15:23.117330       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-0f05174864a181762","node_id":"i-0a10195e48ad25a06","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:15:23.603643       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-0f05174864a181762" nodeID="i-0a10195e48ad25a06" resp=<
        {
          AttachTime: 2023-07-14 18:15:23.579 +0000 UTC,
          Device: "/dev/xvdaa",
          InstanceId: "i-0a10195e48ad25a06",
          State: "attaching",
          VolumeId: "vol-0f05174864a181762"
        }
 >
I0714 18:15:23.662401       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0f05174864a181762" actual="attaching" desired="attached"
I0714 18:15:24.722703       1 manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdaa" volume="vol-0f05174864a181762"
I0714 18:15:24.722734       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node" volumeID="vol-0f05174864a181762" nodeID="i-0a10195e48ad25a06" devicePath="/dev/xvdaa"
I0714 18:15:27.512148       1 controller.go:414] "ControllerUnpublishVolume: called" args={"volume_id":"vol-0761082735e8a6b3c","node_id":"i-042cef0670c6068a3"}
I0714 18:15:27.521194       1 controller.go:414] "ControllerUnpublishVolume: called" args={"volume_id":"vol-0761082735e8a6b3c","node_id":"i-042cef0670c6068a3"}
I0714 18:15:27.835589       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0761082735e8a6b3c" actual="detaching" desired="detached"
I0714 18:15:27.835639       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0761082735e8a6b3c" actual="detaching" desired="detached"
I0714 18:15:28.903324       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0761082735e8a6b3c" actual="detaching" desired="detached"
I0714 18:15:28.905555       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0761082735e8a6b3c" actual="detaching" desired="detached"
I0714 18:15:30.744804       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0761082735e8a6b3c" actual="detaching" desired="detached"
I0714 18:15:30.745481       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0761082735e8a6b3c" actual="detaching" desired="detached"
I0714 18:15:34.041999       1 controller.go:428] "[Debug] ControllerUnpublishVolume: detached from node" volumeID="vol-0761082735e8a6b3c" nodeID="i-042cef0670c6068a3"
I0714 18:15:34.047138       1 controller.go:428] "[Debug] ControllerUnpublishVolume: detached from node" volumeID="vol-0761082735e8a6b3c" nodeID="i-042cef0670c6068a3"
I0714 18:15:37.889214       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:15:41.510941       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-0761082735e8a6b3c","node_id":"i-042cef0670c6068a3","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:15:41.992936       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-0761082735e8a6b3c" nodeID="i-042cef0670c6068a3" resp=<
        {
          AttachTime: 2023-07-14 18:15:41.967 +0000 UTC,
          Device: "/dev/xvdab",
          InstanceId: "i-042cef0670c6068a3",
          State: "attaching",
          VolumeId: "vol-0761082735e8a6b3c"
        }
 >
I0714 18:15:42.052156       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0761082735e8a6b3c" actual="attaching" desired="attached"
I0714 18:15:43.114639       1 manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdab" volume="vol-0761082735e8a6b3c"
I0714 18:15:43.114664       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node" volumeID="vol-0761082735e8a6b3c" nodeID="i-042cef0670c6068a3" devicePath="/dev/xvdab"
I0714 18:15:44.549970       1 controller.go:414] "ControllerUnpublishVolume: called" args={"volume_id":"vol-044fc34fc407eebc0","node_id":"i-0a10195e48ad25a06"}
I0714 18:15:44.560409       1 controller.go:414] "ControllerUnpublishVolume: called" args={"volume_id":"vol-044fc34fc407eebc0","node_id":"i-0a10195e48ad25a06"}
I0714 18:15:44.877240       1 cloud.go:713] "Waiting for volume state" volumeID="vol-044fc34fc407eebc0" actual="detaching" desired="detached"
I0714 18:15:45.173582       1 cloud.go:713] "Waiting for volume state" volumeID="vol-044fc34fc407eebc0" actual="detaching" desired="detached"
I0714 18:15:45.928814       1 cloud.go:713] "Waiting for volume state" volumeID="vol-044fc34fc407eebc0" actual="detaching" desired="detached"
I0714 18:15:46.219850       1 cloud.go:713] "Waiting for volume state" volumeID="vol-044fc34fc407eebc0" actual="detaching" desired="detached"
I0714 18:15:47.773864       1 cloud.go:713] "Waiting for volume state" volumeID="vol-044fc34fc407eebc0" actual="detaching" desired="detached"
I0714 18:15:48.082675       1 cloud.go:713] "Waiting for volume state" volumeID="vol-044fc34fc407eebc0" actual="detaching" desired="detached"
I0714 18:15:51.064791       1 cloud.go:713] "Waiting for volume state" volumeID="vol-044fc34fc407eebc0" actual="detaching" desired="detached"
I0714 18:15:51.370422       1 cloud.go:713] "Waiting for volume state" volumeID="vol-044fc34fc407eebc0" actual="detaching" desired="detached"
I0714 18:15:56.961179       1 controller.go:428] "[Debug] ControllerUnpublishVolume: detached from node" volumeID="vol-044fc34fc407eebc0" nodeID="i-0a10195e48ad25a06"
I0714 18:15:57.240499       1 controller.go:428] "[Debug] ControllerUnpublishVolume: detached from node" volumeID="vol-044fc34fc407eebc0" nodeID="i-0a10195e48ad25a06"
I0714 18:15:58.547068       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-044fc34fc407eebc0","node_id":"i-0a10195e48ad25a06","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0714 18:15:59.007870       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-044fc34fc407eebc0" nodeID="i-0a10195e48ad25a06" resp=<
        {
          AttachTime: 2023-07-14 18:15:58.986 +0000 UTC,
          Device: "/dev/xvdab",
          InstanceId: "i-0a10195e48ad25a06",
          State: "attaching",
          VolumeId: "vol-044fc34fc407eebc0"
        }
 >
I0714 18:15:59.069494       1 cloud.go:713] "Waiting for volume state" volumeID="vol-044fc34fc407eebc0" actual="attaching" desired="attached"
I0714 18:16:00.127603       1 manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdab" volume="vol-044fc34fc407eebc0"
I0714 18:16:00.128231       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node" volumeID="vol-044fc34fc407eebc0" nodeID="i-0a10195e48ad25a06" devicePath="/dev/xvdab"
I0714 18:16:07.890742       1 controller.go:446] "ControllerGetCapabilities: called" args={}

3. Launching 3 replacement nodes
(starts 3 new ebs-node services -- which should "heartbeat" (???))

I0714 18:18:07.895836       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:18:37.897272       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0714 18:19:07.898347       1 controller.go:446] "ControllerGetCapabilities: called" args={}

4. restart -reschedule logs from ebs-node allocation on host where ES allocation was removed:
Nothing to report on from the controller

I0714 18:26:02.086456       1 node.go:447] "NodeUnpublishVolume: called" args={"volume_id":"vol-044fc34fc407eebc0","target_path":"/local/csi/per-alloc/5965243f-16ed-026b-a160-40b0a7797536/elasticsearch_master[0]/rw-file-system-single-node-writer"}
I0714 18:26:02.086491       1 node.go:465] "NodeUnpublishVolume: unmounting" target="/local/csi/per-alloc/5965243f-16ed-026b-a160-40b0a7797536/elasticsearch_master[0]/rw-file-system-single-node-writer"
I0714 18:26:02.086503       1 mount_linux.go:362] Unmounting /local/csi/per-alloc/5965243f-16ed-026b-a160-40b0a7797536/elasticsearch_master[0]/rw-file-system-single-node-writer
I0714 18:26:02.088403       1 node.go:461] "NodeUnPublishVolume: volume operation finished" volumeId="vol-044fc34fc407eebc0"
I0714 18:26:02.088435       1 inflight.go:74] "Node Service: volume operation finished" key="vol-044fc34fc407eebc0"
I0714 18:26:02.089805       1 node.go:275] "NodeUnstageVolume: called" args={"volume_id":"vol-044fc34fc407eebc0","staging_target_path":"/local/csi/staging/elasticsearch_master[0]/rw-file-system-single-node-writer"}
I0714 18:26:02.090020       1 node.go:315] "NodeUnstageVolume: unmounting" target="/local/csi/staging/elasticsearch_master[0]/rw-file-system-single-node-writer"
I0714 18:26:02.090028       1 mount_linux.go:362] Unmounting /local/csi/staging/elasticsearch_master[0]/rw-file-system-single-node-writer
I0714 18:26:02.098219       1 node.go:320] "NodeUnStageVolume: successfully unstaged volume" volumeID="vol-044fc34fc407eebc0" target="/local/csi/staging/elasticsearch_master[0]/rw-file-system-single-node-writer"
I0714 18:26:02.098237       1 node.go:290] "NodeUnStageVolume: volume operation finished" volumeID="vol-044fc34fc407eebc0"
I0714 18:26:02.098247       1 inflight.go:74] "Node Service: volume operation finished" key="vol-044fc34fc407eebc0"
I0714 18:26:08.244203       1 node.go:538] "NodeGetCapabilities: called" args={}

@josh-m-sharpe
Copy link
Author

So stepping back to what I'm attempting: I'm trying to replace the nomad hosts by standing up new ones (in the same set of availability zones) and then rotating allocations onto them. When I start the new nodes, there are two ebs-node allocations in the same availability zone (one on each hose). I'm starting to wonder (without evidence) if that's a problem. They are both responsible for the same EBS volumes. Is the controller maybe getting confused as to which ebs-node allocation is supposed to be doing what?

This is my ebs-node job - is there something missing?

job "plugin-aws-ebs-nodes"
  datacenters = ["dc1"]

  # you can run node plugins as service jobs as well, but this ensures
  # that all nodes in the DC have a copy.
  type = "system"

  group "nodes" {

    constraint {
      attribute = "${node.class}"
      value     = "es"
    }

    task "plugin" {
      driver = "docker"

      config {
        image = "public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver:v1.19.0"
        # image = "amazon/aws-ebs-csi-driver:v0.10.1"
        auth {
          server_address = "http://www.dockerhub.com"
        }

        args = [
          "node",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]

        # node plugins must run as privileged jobs because they
        # mount disks to the host
        privileged = true
      }

      csi_plugin {
        id        = "aws-ebs0"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 14, 2023

So, we see ControllerGetCapabilities log lines. Is this the heartbeat you were referring to?

Yup! That's the one. Do you see them just once though? They should be called periodically by all Nomad clients:

const managerFingerprintInterval = 30 * time.Second

Nothing to report on from the controller

That's the weird part, I would expect there to be RPC calls to detach the volume 🤔

Is the plugin reported as healthy in nomad plugin status? Maybe wait a few minutes to allow for some fingerprint cycles.

The specific error that is happening is coming from here:

for clientID, controller := range plugin.Controllers {
if !controller.IsController() {
// we don't have separate types for CSIInfo depending on
// whether it's a controller or node. this error shouldn't
// make it to production but is to aid developers during
// development
continue
}
node, err := getNodeForRpc(snap, clientID)
if err == nil && node != nil && node.Ready() {
clientIDs = append(clientIDs, clientID)
}
}
if len(clientIDs) == 0 {
return nil, fmt.Errorf("failed to find clients running controller plugin %q", pluginID)
}

So either:

  1. There are no healthy controllers for the plugin, in which case plugin.Controllers is empty and the loop never happens.
  2. getNodeForRpc is failing, and I noticed we end up ignoring the error, which is not great, but looking at that function details I think this is unlikely to happen (snap.NodeByID is called in a bunch of places).
  3. The node running the controller plugin is not "ready", meaning its status is not ready, or it's draining, or ineligible.

    nomad/nomad/structs/structs.go

    Lines 2204 to 2207 in 1e7726c

    // Ready returns true if the node is ready for running allocations
    func (n *Node) Ready() bool {
    return n.Status == NodeStatusReady && n.DrainStrategy == nil && n.SchedulingEligibility == NodeSchedulingEligible
    }

So before you issue the job restart command could you make sure that:

  1. Your plugin is reporting that the controller is healthy.
  2. You see periordic heartbeat requests for all Nomad clients in the controller log.
  3. All your nodes are ready, eligible, and not draining.

I'm starting to wonder (without evidence) if that's a problem. They are both responsible for the same EBS volumes. Is the controller maybe getting confused as to which ebs-node allocation is supposed to be doing what?

I'm not too familiar with the internals of the EBS plugin, so I can't comment much about it. Your deployment seems OK to me. There are some newer image versions available, so maybe worth trying updating the job?

And how does your controller plugin job looks like?

@josh-m-sharpe
Copy link
Author

Before I go do all of that, I take issue with this:

  1. All your nodes are ready, eligible, and not draining.

I am explicitly setting the node where the volume is currently attached, and an allocation is actually using it to inelligible before executing restart -reschedule. This is intentional. The goal here is to force the allocations to move to a new host/node.

To be crystal clear there is absolutely a different node, marked as eligible, with available resources, in the same availability zone and should absolutely take the volume and allocation.

If you've gleaned fro the codez that ALL nodes must be eligible, then maybe we have our problem?

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 15, 2023

Oooh that's a good point. Would you be able to leave the node running the CSI controller plugin as eligible and see if that fixes the problem? I don't eligibility needs to be taken into account here, so maybe that's what we need to fix.

To be crystal clear there is absolutely a different node, marked as eligible, with available resources, in the same availability zone and should absolutely take the volume and allocation.

Yes, but the new nodes don't have the CSI controller, that's the key component that is missing.

The client that is running the controller is now ineligible and I think Nomad is incorrectly ignoring it.

So if leaving the controller eligible fixes the problem we may indeed have found our problem.

@josh-m-sharpe
Copy link
Author

josh-m-sharpe commented Jul 15, 2023

Okay, well, the controller has to be moved in this situation, so if I restart -reschedule that first, let's see what happens:

This looks like it worked. Confirmed the controller alloc moved to a new host.

$ nomad job restart -reschedule plugin-aws-ebs-controller
==> 2023-07-15T13:47:06Z: Restarting 1 allocation
    2023-07-15T13:47:06Z: Rescheduling allocation "8c1dbac3" for group "controller"
    2023-07-15T13:47:06Z: Allocation "8c1dbac3" replaced by "4dafacab", waiting for "4dafacab" to start running
    2023-07-15T13:47:08Z: Allocation "4dafacab" is "running"
==> 2023-07-15T13:47:08Z: Job restart finished
Job restarted successfully!

Okay, let's reschedule elasticsearch:
This produced an immediate error(?) warning(?) - which I press 'y' to and then it successfully moved 5 of the 6 nodes. Why not the 6th?

$ nomad job restart -reschedule elasticsearch
==> 2023-07-15T13:47:18Z: Restarting 6 allocations
    2023-07-15T13:47:18Z: Rescheduling allocation "29433fe7" for group "data"
==> 2023-07-15T13:47:18Z: 1 error occurred while restarting job:
                               * Error restarting allocation "29433fe7": Evaluation "58174358" has placement failures for group "data":
                                   * Class "og": 3 nodes excluded by filter
                                   * Class "es": 3 nodes excluded by filter
                                   * Constraint "${node.class} = es": 3 nodes excluded by filter
                                   * Constraint "CSI volume elasticsearch_data[1] is unschedulable or is read-only": 1 nodes excluded by filter
                                   * Constraint "did not meet topology requirement": 2 nodes excluded by filter
==> 2023-07-15T13:47:18Z: The errors above are likely to happen again.
                               Ignore them anyway and proceed with the next batch? [y/N] y
    2023-07-15T13:47:48Z: Rescheduling allocation "62a40d8e" for group "data"
    2023-07-15T13:47:48Z: Allocation "62a40d8e" replaced by "e2b528f6", waiting for "e2b528f6" to start running
    2023-07-15T13:48:23Z: Allocation "e2b528f6" is "running"
    2023-07-15T13:48:23Z: Rescheduling allocation "b9b8d2e6" for group "data"
    2023-07-15T13:48:23Z: Allocation "b9b8d2e6" replaced by "bae291ce", waiting for "bae291ce" to start running
    2023-07-15T13:48:56Z: Allocation "bae291ce" is "running"
    2023-07-15T13:48:56Z: Rescheduling allocation "dcee0c0f" for group "master"
    2023-07-15T13:48:56Z: Allocation "dcee0c0f" replaced by "dc2c11f4", waiting for "dc2c11f4" to start running
    2023-07-15T13:49:14Z: Allocation "dc2c11f4" is "running"
    2023-07-15T13:49:14Z: Rescheduling allocation "e496576c" for group "master"
    2023-07-15T13:49:14Z: Allocation "e496576c" replaced by "259c4785", waiting for "259c4785" to start running
    2023-07-15T13:49:31Z: Allocation "259c4785" is "running"
    2023-07-15T13:49:31Z: Rescheduling allocation "f0954d73" for group "master"
    2023-07-15T13:49:31Z: Allocation "f0954d73" replaced by "2332b110", waiting for "2332b110" to start running
    2023-07-15T13:50:04Z: Allocation "2332b110" is "running"
==> 2023-07-15T13:50:04Z: Job restart finished with errors

1 error occurred while restarting job:
* Error restarting allocation "29433fe7": Evaluation "58174358" has placement failures for group "data":
    * Class "og": 3 nodes excluded by filter
    * Class "es": 3 nodes excluded by filter
    * Constraint "${node.class} = es": 3 nodes excluded by filter
    * Constraint "CSI volume elasticsearch_data[1] is unschedulable or is read-only": 1 nodes excluded by filter
    * Constraint "did not meet topology requirement": 2 nodes excluded by filter

So the 6th, non-restarted alloc was "moved" but is pending on the new host, so lets just try restartting everything again:

$ nomad job restart elasticsearch
==> 2023-07-15T13:51:05Z: Restarting 6 allocations
    2023-07-15T13:51:05Z: Restarting running tasks in allocation "2332b110" for group "master"
    2023-07-15T13:51:05Z: Restarting running tasks in allocation "259c4785" for group "master"
    2023-07-15T13:51:05Z: Restarting running tasks in allocation "dc2c11f4" for group "master"
    2023-07-15T13:51:06Z: Restarting running tasks in allocation "bae291ce" for group "data"
    2023-07-15T13:51:07Z: Restarting running tasks in allocation "e2b528f6" for group "data"
    2023-07-15T13:51:08Z: Restarting running tasks in allocation "982ff05e" for group "data"
==> 2023-07-15T13:51:08Z: Job restart finished

Job restarted successfully!

This is not true. It did not restart successfully. One of the allocations is still pending. These things have health checks. The last alloc is NOT healthy. How is that a success?

According to /var/log/messages on the NEW host where the allocation is pending:

[ERROR] client.rpc: error performing RPC to server: error="rpc error: controller publish: CSI client error (retryable): controller attach volume: CSI.ControllerAttachVolume: volume \"vol-0761082735e8a6b3c\" is already published on another node and does not have MULTI_NODE volume capability: rpc error: code = FailedPrecondition desc = i-003f193053ec9d2cb" rpc=CSIVolume.Claim server=10.20.11.253:4647
[ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: controller publish: CSI client error (retryable): controller attach volume: CSI.ControllerAttachVolume: volume \"vol-0761082735e8a6b3c\" is already published on another node and does not have MULTI_NODE volume capability: rpc error: code = FailedPrecondition desc = i-003f193053ec9d2cb" rpc=CSIVolume.Claim server=10.20.11.253:4647```

The OLD node where the volume was released looks good:

node.go:447] "NodeUnpublishVolume: called" args={"volume_id":"vol-0b3e474666422c1e0","target_path":"/local/csi/per-alloc/f0954d73-5a04-3f58-3c42-548e36ab7b03/elasticsearch_master[1]/rw-file-system-single-node-writer"}
node.go:465] "NodeUnpublishVolume: unmounting" target="/local/csi/per-alloc/f0954d73-5a04-3f58-3c42-548e36ab7b03/elasticsearch_master[1]/rw-file-system-single-node-writer"
mount_linux.go:362] Unmounting /local/csi/per-alloc/f0954d73-5a04-3f58-3c42-548e36ab7b03/elasticsearch_master[1]/rw-file-system-single-node-writer
node.go:461] "NodeUnPublishVolume: volume operation finished" volumeId="vol-0b3e474666422c1e0"
inflight.go:74] "Node Service: volume operation finished" key="vol-0b3e474666422c1e0"
node.go:275] "NodeUnstageVolume: called" args={"volume_id":"vol-0b3e474666422c1e0","staging_target_path":"/local/csi/staging/elasticsearch_master[1]/rw-file-system-single-node-writer"}
node.go:315] "NodeUnstageVolume: unmounting" target="/local/csi/staging/elasticsearch_master[1]/rw-file-system-single-node-writer"
mount_linux.go:362] Unmounting /local/csi/staging/elasticsearch_master[1]/rw-file-system-single-node-writer
node.go:320] "NodeUnStageVolume: successfully unstaged volume" volumeID="vol-0b3e474666422c1e0" target="/local/csi/staging/elasticsearch_master[1]/rw-file-system-single-node-writer"
node.go:290] "NodeUnStageVolume: volume operation finished" volumeID="vol-0b3e474666422c1e0"
inflight.go:74] "Node Service: volume operation finished" key="vol-0b3e474666422c1e0"

The NEW Node where the volume should have been mounted appears to show no evidence of mounting the volume for this allocation.

@josh-m-sharpe
Copy link
Author

Interestingly, when I shutdown the old node, this "bad" allocation fixed itself. Here are logs from the controller leading up to when I shut down the old node:

I0715 14:08:28.004618       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0715 14:08:54.114756       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-0761082735e8a6b3c","node_id":"i-0630af9cba3545b13","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0715 14:08:54.536755       1 manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdab" volume="vol-0761082735e8a6b3c"
E0715 14:08:54.536785       1 driver.go:122] "GRPC error" err="rpc error: code = FailedPrecondition desc = i-003f193053ec9d2cb"
I0715 14:08:58.006631       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0715 14:09:28.008308       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0715 14:09:54.546847       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-0761082735e8a6b3c","node_id":"i-0630af9cba3545b13","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0715 14:09:54.950279       1 manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdab" volume="vol-0761082735e8a6b3c"
E0715 14:09:54.950311       1 driver.go:122] "GRPC error" err="rpc error: code = FailedPrecondition desc = i-003f193053ec9d2cb"
I0715 14:09:58.009935       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0715 14:10:28.011205       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0715 14:10:54.960867       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-0761082735e8a6b3c","node_id":"i-0630af9cba3545b13","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0715 14:10:55.465637       1 manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdab" volume="vol-0761082735e8a6b3c"
E0715 14:10:55.465669       1 driver.go:122] "GRPC error" err="rpc error: code = FailedPrecondition desc = i-003f193053ec9d2cb"
I0715 14:10:58.013074       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0715 14:11:28.014614       1 controller.go:446] "ControllerGetCapabilities: called" args={}
I0715 14:11:55.477677       1 controller.go:355] "ControllerPublishVolume: called" args={"volume_id":"vol-0761082735e8a6b3c","node_id":"i-0630af9cba3545b13","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}}
I0715 14:11:56.050496       1 cloud.go:541] "[Debug] AttachVolume" volumeID="vol-0761082735e8a6b3c" nodeID="i-0630af9cba3545b13" resp=<
	{
	  AttachTime: 2023-07-15 14:11:56.034 +0000 UTC,
	  Device: "/dev/xvdab",
	  InstanceId: "i-0630af9cba3545b13",
	  State: "attaching",
	  VolumeId: "vol-0761082735e8a6b3c"
	}
 >
I0715 14:11:56.109457       1 cloud.go:713] "Waiting for volume state" volumeID="vol-0761082735e8a6b3c" actual="attaching" desired="attached"
I0715 14:11:57.165213       1 manager.go:191] "[Debug] Releasing in-process" attachment entry="/dev/xvdab" volume="vol-0761082735e8a6b3c"
I0715 14:11:57.165240       1 controller.go:383] "[Debug] ControllerPublishVolume: attached to node" volumeID="vol-0761082735e8a6b3c" nodeID="i-0630af9cba3545b13" devicePath="/dev/xvdab"
I0715 14:11:58.015920       1 controller.go:446] "ControllerGetCapabilities: called" args={}

And unfortunately, I lost any logs on the OLD host's ebs-node when this happened. That said, per my previous comment, the volume was successfully detached

@josh-m-sharpe
Copy link
Author

Ugh, so setting all that up and running it a 2nd and 3rd time worked, both times. Good sign. Let me script this whole thing up and run it 15 more times to see what happens. There's still some fishy things happening here.

$ nomad job restart -reschedule plugin-aws-ebs-controller
==> 2023-07-15T15:28:55Z: Restarting 1 allocation
    2023-07-15T15:28:55Z: Rescheduling allocation "3933bb88" for group "controller"
    2023-07-15T15:28:55Z: Allocation "3933bb88" replaced by "c4b9a3b1", waiting for "c4b9a3b1" to start running
    2023-07-15T15:28:57Z: Allocation "c4b9a3b1" is "running"
==> 2023-07-15T15:28:57Z: Job restart finished

Job restarted successfully!
$ nomad job restart -reschedule elasticsearch
==> 2023-07-15T15:29:34Z: Restarting 6 allocations
    2023-07-15T15:29:34Z: Rescheduling allocation "64ce6726" for group "data"
    2023-07-15T15:29:34Z: Allocation "64ce6726" replaced by "4ae3533c", waiting for "4ae3533c" to start running
    2023-07-15T15:30:30Z: Allocation "4ae3533c" is "running"
    2023-07-15T15:30:30Z: Rescheduling allocation "b6344b77" for group "data"
    2023-07-15T15:30:30Z: Allocation "b6344b77" replaced by "5f7770cc", waiting for "5f7770cc" to start running
    2023-07-15T15:31:03Z: Allocation "5f7770cc" is "running"
    2023-07-15T15:31:03Z: Rescheduling allocation "16b8c0a6" for group "data"
    2023-07-15T15:31:03Z: Allocation "16b8c0a6" replaced by "18b2b6fd", waiting for "18b2b6fd" to start running
    2023-07-15T15:31:36Z: Allocation "18b2b6fd" is "running"
    2023-07-15T15:31:36Z: Rescheduling allocation "632e3e3c" for group "master"
    2023-07-15T15:31:36Z: Allocation "632e3e3c" replaced by "c3b510bf", waiting for "c3b510bf" to start running
    2023-07-15T15:31:54Z: Allocation "c3b510bf" is "running"
    2023-07-15T15:31:54Z: Rescheduling allocation "19499290" for group "master"
    2023-07-15T15:31:55Z: Allocation "19499290" replaced by "9105d242", waiting for "9105d242" to start running
    2023-07-15T15:32:12Z: Allocation "9105d242" is "running"
    2023-07-15T15:32:12Z: Rescheduling allocation "22cb2a8e" for group "master"
    2023-07-15T15:32:12Z: Allocation "22cb2a8e" replaced by "6ca6c289", waiting for "6ca6c289" to start running
    2023-07-15T15:32:31Z: Allocation "6ca6c289" is "running"
==> 2023-07-15T15:32:31Z: Job restart finished

Job restarted successfully!

@josh-m-sharpe
Copy link
Author

josh-m-sharpe commented Jul 16, 2023

What does this even mean:

Rescheduling elasticsearch
==> 2023-07-16T14:36:09Z: Restarting 6 allocations
    2023-07-16T14:36:09Z: Rescheduling allocation "6ca6c289" for group "master"
==> 2023-07-16T14:36:09Z: 1 error occurred while restarting job:
                               * Error restarting allocation "6ca6c289": Evaluation "b7c9f968" has placement failures for group "master":
                                   * Class "es": 3 nodes excluded by filter
                                   * Class "og": 3 nodes excluded by filter
                                   * Constraint "${node.class} = es": 3 nodes excluded by filter
                                   * Constraint "CSI plugin aws-ebs0 is missing from client 9a8e624e-4256-8a90-2a60-bc455b000a19": 1 nodes excluded by filter
                                   * Constraint "CSI plugin aws-ebs0 is missing from client 6c016440-1e3c-679b-bcd5-66091cbdd017": 1 nodes excluded by filter
                                   * Constraint "CSI plugin aws-ebs0 is missing from client 125821fa-7962-344f-8435-5089f21212a8": 1 nodes excluded by filter
==> 2023-07-16T14:36:09Z: The errors above are likely to happen again.
                               Ignore them anyway and proceed with the next batch? [y/N]

When that happens, why is my cluster in this state:

Cluster_Topology_-_Nomad

So... there's an error, nomad can't find a place for the new allocation, but it still pulled down the allocation it was rescheduling?

It just hangs here.

==> 2023-07-16T14:36:09Z: The errors above are likely to happen again.
                               Ignore them anyway and proceed with the next batch? [y/N]

This is SUPER ambiguous. Feels like there's no good option
Y -> ignore errors. Proceed and pray.
N -> Just aborts, but doesn't revert and put the allocation back in place

Rerunning restart -reschedule (a second time) thoroughly screws things up:

Cluster_Topology_-_Nomad

@josh-m-sharpe
Copy link
Author

Even stopping and purging the job leaves the entire cluster borked:

Cluster_Topology_-_Nomad

@josh-m-sharpe
Copy link
Author

do we think #17996 impacts this issue?

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 20, 2023

Hi @josh-m-sharpe,

I had a reply draft here but then I got roped in into some 1.6.0 release and post-release work.

Yes, we believe #17996 will also help your situation here. In #15415 we've learned that the EBS plugin is not as reliable as the CSI spec expects, so #17996 makes some improvements and also increase visibility when things don't go as expected (#17996 (comment)).

For this specific issue, looking at the placement issues I see a few different reasons:

  1. Class "es": 3 nodes excluded by filter, Class "og": 3 nodes excluded by filter, and Constraint "${node.class} = es": 3 nodes excluded by filter. From the job you provided I don't see any node class based constraint, were they added later? And is expected that nodes in the es and og classes be excluded?
  2. did not meet topology requirement. These happen when the volume requested by the allocation has an incompatible topology with these clients. In your case I imagine this happens when evaluating nodes from a different AZ, so it seems expected.
  3. Constraint "CSI volume elasticsearch_data[1] is unschedulable or is read-only": 1 nodes excluded by filter. This usually means the volume attachment mode does not allow writes (which I don't think is your case) or the volume has not healthy plugins, which may be happening here.
  4. CSI plugin aws-ebs0 is missing from client. These happen when the client doesn't have the requested node plugin. Are these nodes expected to have them?

So I think we need to try a mode methodical approach:

  • Mark existing node as ineligible.
  • Provision new nodes.
  • Reschedule the controller plugin.
  • Verify all clients are able to heartbeat to the controller plugin (this can be verified in the controller plugin alloc logs).
  • Verify new clients have a node plugin instance.
  • Verify EBS plugin is healthy.
  • Reschedule elastic search (I don't know enough about ES to prescribe a specific order, but it may be worth restarting data and master allocations separately).
  • If at any point the reschedule command fails, choose N to stop the process (or set -on-error=fail).
  • Verify plugin is still healthy.
  • Collect logs from Nomad server, Nomad client, CSI controller, and CSI node plugins.

Screenshot and command outputs help paint the overall picture, but in order to actually debug problems we do need log information.

@josh-m-sharpe
Copy link
Author

Honestly, I did all that including all the validation - and it all worked that first time above. I think I'll wait for v1.6.1 w/ that fix before trying again.

I think there's some tangential issues here that maybe deserve their own issue threads:

  • I don't know a good way to set a service check on the ebs controller or nodes and would think this should be a thing just like any other service. They take time to start and be prepped. I don't have a smoking gun here for evidence - but it's logical?
  • Shouldn't restart -reschedule and maybe vanilla restart rollback just like a canceled job deployment? Those last few screenshots were my cluster in an extremely unhealthy state. There was apparently zero recourse for solving it. Even stopping and purging the job didn't remove the pending allocations.

@josh-m-sharpe
Copy link
Author

I've been testing 1.5.8 most of today. Results are mixed.

It seems the issue may be resolved when using restart -reschedule as in the allocations are redeployed and
So, since that was "fixed" I started hammering elasticsearch to ensure zero-downtime restarts, but it seems that restart doesn't gracefully restart allocations.

So, trying another angle: adding arbitrary variable to job file and using nomad job run
This did not work. None of the allocations actually reallocated. They stayed pending before failed the deployment. Since the "old" nodes were marked inelligible, the rollback process actually started a new set of new allocations on the new node (thats a mouthful) - and those allocated correctly. Very weird. I know not having logs here is useless. This seems like it should work - it's just a new job deployment.

3rd approach was using drains, which seems the appropriate solution anyways , and actually seems to play nice.

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Jul 22, 2023
@lgfa29
Copy link
Contributor

lgfa29 commented Jul 24, 2023

I'm glad to hear 1.5.8 improved things.

Just to pull a bit on the last threads to make check if there's any follow-up work needed.

but it seems that restart doesn't gracefully restart allocations.

Would you mind expending on what "gracefully" means here, or which behaviour you were expecting? The job restart command is fairly simple in its concept (the implementation was much trickier 😅). From our docs:

Rescheduling the job uses the Stop Allocation API endpoint to stop the allocations and trigger the Nomad scheduler to compute new placements. This may cause the new allocations to be scheduled in different clients from the originals.
[...]
When rescheduling, the current allocations are stopped triggering the Nomad scheduler to create replacement allocations that may be placed in different clients. The command waits until the new allocations have client status ready before proceeding with the remaining batches. Services health checks are not taken into account.

We know this is not perfect, and had initially planned something more robust but we found that approach to be a lot more complex than anticipated and decided to start with this simpler version.

If there's a need to wait before proceeding to the next restart you could use the -batch-wait flag.

So, trying another angle: adding arbitrary variable to job file and using nomad job run
This did not work. None of the allocations actually reallocated.

This is indeed strange, but I wonder if the problem may have been which field was changed. Some changes are in-place updates, where only the tasks may be recreated. If that happened, and the CSI volume got detached, the task may fail to start again in the old host.

3rd approach was using drains, which seems the appropriate solution anyways , and actually seems to play nice.

So this approach was just about draining the old nodes?

@josh-m-sharpe
Copy link
Author

Worth recalling my goal of finding some way to replace the nomad hosts running elasticsearch. ES has to be zero-downtime for us, so this is all just planning for the day the nomad hosts have to be switched out for whatever reason.

Responses inline.

I'm glad to hear 1.5.8 improved things.

Just to pull a bit on the last threads to make check if there's any follow-up work needed.

but it seems that restart doesn't gracefully restart allocations.

Would you mind expending on what "gracefully" means here, or which behaviour you were expecting? The job restart command is fairly simple in its concept (the implementation was much trickier 😅). From our docs:

"Gracefully" just means zero-downtime to us. I need Nomad to correctly pull services in and out of Consul so that Traefik routes requests to healthy allocations.

Rescheduling the job uses the Stop Allocation API endpoint to stop the allocations and trigger the Nomad scheduler to compute new placements. This may cause the new allocations to be scheduled in different clients from the originals.
[...]
When rescheduling, the current allocations are stopped triggering the Nomad scheduler to create replacement allocations that may be placed in different clients. The command waits until the new allocations have client status ready before proceeding with the remaining batches. Services health checks are not taken into account.

Okay, so my initial comment was right after I set up 1.5.8 with the fixed CSI issue. At that point in this here journey, restart -reschedule finally moved the allocations (and volumes) for the first time. I then immediately dove into battle testing elasticsearch while attempting to move the allocations. I immediately saw dropped connections to ES during a restart. It was only then that I realized there is a restart block that differs from the update block. Seeing far fewer options in the restart stanza than update (specifically around max_parallel and checks) it occurred to me that restart may not be the correct solution. If I can't control how many allocations move at once and various deadlines I wasn't sure it was a good idea to proceed. I moved on.

I tried the hacky variable-in-job approach here (more on that below). But nothing learned there.

Next I retried using Drain to move the allocations and that worked much better - and knowing Drain was well-defined to be controlled by the migrate block this seemed directionally the right thing to do. In doing so I uncovered some additional attributes that I needed specifically for elasticsearch. Specifically I added a 60s kill_timeout and shutdown_delay. This allowed Drain to successfully migrate allocations without dropping connetions.

I just set things up again (for you) to re-run restart -reschedue. Now that I know I need shutdown/kill delay args and they are in place, I figured it's worth giving it another shot. Now, it does not drop connections. However, it still seemingly does something odd. It appears (solely from watching Consul) that Nomad is moving on to the next allocation before the previous allocation is fully healthy in Consul. This seems not-good to me, even if connections aren't dropped. Connections, in fact, are probably not dropped because they're routed through Traefik which is backed by Consul -- so, since they are not healthy in Consul, then Traefik doesn't route. Not sure why Nomad moves on so fast. Nomad is definitely waiting for something I'm just not sure what if the alloc isn't healthy in Consul.

We know this is not perfect, and had initially planned something more robust but we found that approach to be a lot more complex than anticipated and decided to start with this simpler version.

If there's a need to wait before proceeding to the next restart you could use the -batch-wait flag.

So, trying another angle: adding arbitrary variable to job file and using nomad job run
This did not work. None of the allocations actually reallocated.

This is indeed strange, but I wonder if the problem may have been which field was changed. Some changes are in-place updates, where only the tasks may be recreated. If that happened, and the CSI volume got detached, the task may fail to start again in the old host.

Yea, strange for sure, but this had "hack" written all over it. I'm not super interested in re-inventing this scenario to test.

3rd approach was using drains, which seems the appropriate solution anyways , and actually seems to play nice.

So this approach was just about draining the old nodes?

Yes, truth be told, in retrospect I probably only ever moved on to restart because draining didn't initially work, not because drain was broken, but due to the CSI bug. So I got sidetracked into restart and complained about that. Sorry :)

@josh-m-sharpe
Copy link
Author

josh-m-sharpe commented Jul 25, 2023

Looking at this a bit more this morning. I think there definitely might be something not quite right with restart -reschedule and its relationship with Consul. As I test restart -reschedule with some other http services, that are similar set up to route through Traefik. it seems that it doesn't wait BETWEEN allocations. So this is all anecdotal, but here's what I'm seeing:

Running restart -reschedule against a job with two running allocations.
The job has one task with one http service. This job drains well (migrate block), and also handles a job definition update (update block) well.
Service is "green" in Consul with "2 instances"
I have hey (similar to apache ab) running hammering the service watching for dropped requests

  1. execute restart -reschedule
  2. 1 allocation immediately deregistered in Consul. Consul is still green, but shows 1 instance
  3. no dropped requests here
  4. There's some waiting here. This task does have a configured 36s shutdown_delay but the gap isn't 36s. It's also a 2GB docker image so it takes a hot minute to download.
  5. new alloc starts spinning up
  6. Seems that when we hit Allocation "733f5788" is "running" the 2nd instance is pulled out of consul.
  7. Consul goes red and shows 0 instances
  8. Dropped requests
  9. Eventually the allocs become healthy and requests are served.

So, my best guess is that restart isn't honoring the health of the service. Just because the alloc is "running" doesn't mean it is healthy. I think?

Notably there's no configuration option in restart for health_check like there is in update and migrate. Should there be?

Maybe my expectation here is wrong?

Come to think of it restart says The restart block configures a task's behavior on task failure. -- I guess that's not nomad restart - so which block configures how nomad restart operates?

Here's the restart output:

==> 2023-07-25T14:57:59Z: Restarting 2 allocations
    2023-07-25T14:57:59Z: Rescheduling allocation "347464ed" for group "puma"
    2023-07-25T14:58:00Z: Allocation "347464ed" replaced by "733f5788", waiting for "733f5788" to start running
    2023-07-25T14:58:59Z: Still waiting for allocation "347464ed" to be replaced
    2023-07-25T14:59:28Z: Allocation "733f5788" is "running"
    2023-07-25T14:59:28Z: Rescheduling allocation "97f5deb9" for group "puma"
    2023-07-25T14:59:29Z: Allocation "97f5deb9" replaced by "022d79d9", waiting for "022d79d9" to start running
    2023-07-25T15:00:28Z: Still waiting for allocation "97f5deb9" to be replaced
    2023-07-25T15:00:58Z: Allocation "022d79d9" is "running"
==> 2023-07-25T15:00:58Z: Job restart finished

@josh-m-sharpe
Copy link
Author

Generally, it's peculiar to me that an allocation goes into "pending" status when it's in fact shutting down.

@josh-m-sharpe
Copy link
Author

@lgfa29 think I'm done blowing you up for now. thanks for caring! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants