CSI plugin expected controllers exceeds actual controller count #12771

iSchluff · 2022-04-25T10:50:43Z

Nomad version

Nomad v1.3.0-beta.1 (2eba643)

Operating system and Environment details

Ubuntu 20.04.4 on x64

Issue

Due to operator error I registered csi nodes as controllers, therefore my expected controller count on the plugin is now too high.

#  nomad plugin status
Container Storage Interface
ID        Provider             Controllers Healthy/Expected  Nodes Healthy/Expected
ceph-csi  cephfs.csi.ceph.com  1/4                           3/3

Nomad still shows the plugin as healthy and the volumes as schedulable, however waiting for plugin healthy via terraform fails:

data "nomad_plugin" "ceph" {
  plugin_id        = "ceph-csi"
  wait_for_healthy = true
}

2022-04-25T12:25:32.941+0200 [TRACE] provider.terraform-provider-nomad_v1.4.16_x4: plugin received interrupt signal, ignoring: count=2 timestamp=2022-04-25T12:25:32.941+0200
2022-04-25T12:25:33.009+0200 [DEBUG] provider.terraform-provider-nomad_v1.4.16_x4: 2022/04/25 12:25:33 [DEBUG] Getting plugin "ceph-csi"...
2022-04-25T12:25:33.065+0200 [DEBUG] provider.terraform-provider-nomad_v1.4.16_x4: 2022/04/25 12:25:33 [DEBUG] plugin ceph-csi not yet healthy: 1/4 controllers healthy  3/3 nodes healthy
2022-04-25T12:25:33.065+0200 [DEBUG] provider.terraform-provider-nomad_v1.4.16_x4: 2022/04/25 12:25:33 [TRACE] Waiting 10s before next try
^C2022-04-25T12:25:34.397+0200 [TRACE] provider.terraform-provider-nomad_v1.4.16_x4: plugin received interrupt signal, ignoring: count=3 timestamp=2022-04-25T12:25:34.397+0200
2022-04-25T12:25:35.967+0200 [TRACE] dag/walk: vertex "module.nomad-registry.nomad_volume.registry (expand)" is waiting for "module.nomad-registry.data.nomad_plugin.ceph (expand)"
2022-04-25T12:25:36.275+0200 [TRACE] dag/walk: vertex "module.nomad-registry (close)" is waiting for "module.nomad-registry.nomad_volume.registry (expand)"
2022-04-25T12:25:36.648+0200 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/hashicorp/nomad\"] (close)" is waiting for "module.nomad-registry.nomad_volume.registry (expand)"
2022-04-25T12:25:37.349+0200 [TRACE] dag/walk: vertex "root" is waiting for "provider[\"registry.terraform.io/hashicorp/nomad\"] (close)"

Is it possible to reduce the expected controller count without recreating the cluster?

Reproduction steps

Start more csi controllers than you need, stop some of them.

Expected Result

expected controller count should probably never exceed 1?

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

The text was updated successfully, but these errors were encountered:

tgross · 2022-04-25T12:44:24Z

Hi @iSchluff! I want to make sure I can reproduce this scenario accurately:

Due to operator error I registered csi nodes as controllers, therefore my expected controller count on the plugin is now too high.
...
Start more csi controllers than you need, stop some of them.

So you had a job that was registered as a controller, and then changed it to node and re-ran? Did the old controller allocations get replaced entirely by node allocations or did they get in-place updates?

iSchluff · 2022-04-25T15:22:19Z

Yes exactly, I accidentally ran node tasks as controller tasks. The old allocations got replaced by an in-place job update.
Just for completeness: I was already running the tasks as nodes before and replaced the csi configuration and then went back.
So basically in-place node>controller>node

tgross · 2022-04-25T15:29:46Z

Ok, thank you. So there's likely two bugs here: the counts aren't being reset properly, but also we should have replaced the tasks entirely in that case and not done an in-place update. I'll see if I can put together a quick repro and report back.

tgross · 2022-04-25T15:52:22Z

I was able to reproduce the non-destructive update pretty easily just by switching the csi_plugin.type. #12774 should fix this and prevent the invariants the counts expect from being violated.

tgross · 2022-04-25T17:01:15Z

I've merged that fix and it'll ship in the GA release of Nomad 1.3.0. Thanks for opening this issue @iSchluff!

iSchluff · 2022-04-25T17:01:49Z

thanks for the quick response

github-actions · 2022-10-08T02:36:19Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

iSchluff added the type/bug label Apr 25, 2022

tgross self-assigned this Apr 25, 2022

tgross added the theme/storage label Apr 25, 2022

tgross mentioned this issue Apr 25, 2022

CSI: plugin config updates should always be destructive #12774

Merged

tgross closed this as completed in #12774 Apr 25, 2022

This was referenced Apr 25, 2022

Backport of CSI: plugin config updates should always be destructive into release/1.1.x #12775

Merged

Backport of CSI: plugin config updates should always be destructive into release/1.2.x #12776

Merged

github-actions bot locked as resolved and limited conversation to collaborators Oct 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI plugin expected controllers exceeds actual controller count #12771

CSI plugin expected controllers exceeds actual controller count #12771

iSchluff commented Apr 25, 2022

tgross commented Apr 25, 2022

iSchluff commented Apr 25, 2022

tgross commented Apr 25, 2022

tgross commented Apr 25, 2022

tgross commented Apr 25, 2022

iSchluff commented Apr 25, 2022

github-actions bot commented Oct 8, 2022

CSI plugin expected controllers exceeds actual controller count #12771

CSI plugin expected controllers exceeds actual controller count #12771

Comments

iSchluff commented Apr 25, 2022

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

tgross commented Apr 25, 2022

iSchluff commented Apr 25, 2022

tgross commented Apr 25, 2022

tgross commented Apr 25, 2022

tgross commented Apr 25, 2022

iSchluff commented Apr 25, 2022

github-actions bot commented Oct 8, 2022