Draining a node confuses CSI plugin node health #9810

apollo13 · 2021-01-13T21:05:36Z

Nomad version

Nomad v1.0.1 (c9c68aa)

Operating system and Environment details

Debian stable

Issue

When draining a node the CSI plugin node health display gets confused:

Reproduction steps

Setup a CSI plugin with more than one host; I have one controller and three nodes (the latter as system jobs). A configuration example can be found here: https://gitlab.com/rocketduck/csi-plugin-nfs/-/tree/main/nomad My plugin is a simple CSI plugin that provisions from an NFS share; but any plugin should do.
Drain one node (preferably not the controller)
Observe the above image
Mark the node as eligible again
Observer proper expected values for the node count.

The text was updated successfully, but these errors were encountered:

tgross · 2021-01-14T13:46:53Z

Hi @apollo13! In 1.0.0 we shipped some fixes for the plugin counts in the UI and API, but it looks like we missed a case. It's interesting that the count fixes itself when the node plugin comes back... that might be a clue as to where the problem is. Thanks for opening this issue.

apollo13 · 2021-02-24T16:51:42Z

Also interesting in this context:

the node " Client Events " never show that CSI becomes healthy again.

apollo13 · 2021-04-05T12:53:24Z

Hi @tgross, I began looking through my plugin and noted something interesting when the nomad UI calls v1/plugin/csi%2Fnfs:

{
	"Allocations": [],
	"ControllerRequired": false,
	"Controllers": {
		"a8efb906-0138-7c55-d156-8de72b5f356c": {
			"AllocID": "d4846cdb-ce40-09de-cd6d-4e88e0a1a816",
			"ControllerInfo": {
				"SupportsAttachDetach": false,
				"SupportsListVolumes": false,
				"SupportsListVolumesAttachedNodes": false,
				"SupportsReadOnlyAttach": false
			},
			"HealthDescription": "healthy",
			"Healthy": true,
			"PluginID": "nfs",
			"RequiresControllerPlugin": true,
			"RequiresTopologies": false,
			"UpdateTime": "2021-04-04T18:26:23.776357453Z"
		}
	},
	"ControllersExpected": 1,
	"ControllersHealthy": 1,
	"CreateIndex": 527856,
	"ID": "nfs",
	"ModifyIndex": 630304,
	"Nodes": {
		"8e100f4b-6d1b-ca4a-e8d6-b27a65ddac93": {
			"AllocID": "a59ff912-1cd2-1951-000b-6e697d939ed4",
			"HealthDescription": "healthy",
			"Healthy": true,
			"NodeInfo": {
				"AccessibleTopology": null,
				"ID": "nomad01",
				"MaxVolumes": 9223372036854776000,
				"RequiresNodeStageVolume": false
			},
			"PluginID": "nfs",
			"RequiresControllerPlugin": true,
			"RequiresTopologies": false,
			"UpdateTime": "2021-04-04T18:28:51.665163423Z"
		},
		"a8efb906-0138-7c55-d156-8de72b5f356c": {
			"AllocID": "e6d594e3-c483-65c3-05e7-7fd161d141d2",
			"HealthDescription": "healthy",
			"Healthy": true,
			"NodeInfo": {
				"AccessibleTopology": null,
				"ID": "nomad02",
				"MaxVolumes": 9223372036854776000,
				"RequiresNodeStageVolume": false
			},
			"PluginID": "nfs",
			"RequiresControllerPlugin": true,
			"RequiresTopologies": false,
			"UpdateTime": "2021-04-04T18:28:51.681156329Z"
		}
	},
	"NodesExpected": 0,
	"NodesHealthy": 2,
	"Provider": "dev.rocketduck.csi.nfs",
	"Version": "0.2.0"
}

Why is the toplevel ControllerRequired false? The controller as well as the node subobjects have RequiresControllerPlugin which comes from their capabilities: https://gitlab.com/rocketduck/csi-plugin-nfs/-/blob/1c1954605c3bb11c366380ce72fd6d7dbe4e27e7/src/csi_plugin_nfs/identity.py#L15-22

apollo13 · 2021-07-20T06:44:52Z

I just realized that simply stopping a single node allocation has the same effect. @tgross did you ever get around reproducing this? I can reliably trigger this.

JohnKiller · 2021-07-20T07:22:24Z

I am experiencing the same problem with ceph-csi module. Stopping and starting an allocation various times causes the problem to appear

apollo13 · 2021-11-04T13:00:24Z

@tgross As promised here is a ping ;) Is there any chance that we can work on fixing that? What do you need from me to move this forward?

tgross · 2021-11-04T13:29:09Z

Hi @apollo13! I don't think I need anything else from you, just a little time to get ramped back up and dig into it. Thanks for the ping... I'll assign myself so that I don't lose this.

apollo13 · 2022-01-05T16:54:13Z

Cross linking to #11758

tgross · 2022-02-03T17:08:22Z

Leaving a note that this issue appears to be related to but may have subtly different code paths from #11784.

Note that I don't think this is actually related to #11758 or #10073 (or at least not by itself). Those issues are primarily about count state on the server, whereas here and in #11784 we have evidence of client-side reconnection issues with the CSI plugins. I'll be looking into that as well as the count state as separate patch sets.

apollo13 · 2022-02-03T20:24:22Z

I closely followed your fixes and am not sure yet either (massive kudos for those!). I'll deploy 1.2.5 and see if I can test it manually. In the meantime I will leave this open as a reminder for myself; don't waste time on it unless you have an hunch that this is still unsolved.

tgross · 2022-02-03T20:39:34Z

Oh sorry if I was unclear. This issue is definitely not fixed yet; I was just saying it may have a cause that's independent from the "plugin counts" issue described in #11758 or #10073

tgross · 2022-02-08T20:05:09Z

I've opened #12027 which should partially address this issue and also #10073. I think the remaining bits are most likely related to #11784

tgross · 2022-02-09T16:53:15Z

I've closed this issue via #12027 which will ship in 1.3.0 (plus backports) but I'll look into the plugin alloc restart issues as part of #11784.

github-actions · 2022-10-11T02:43:47Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added theme/storage type/bug stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jan 14, 2021

apollo13 mentioned this issue Feb 23, 2021

Cinder-CSI Plugin reports 0 healthy nodes/controllers after node decommissioning #10073

Closed

apollo13 mentioned this issue Apr 5, 2021

CSI plugin worng health count #10297

Closed

tgross self-assigned this Nov 4, 2021

tgross mentioned this issue Feb 3, 2022

CSI plugin fails to be marked healthy after reboot #11784

Closed

This was referenced Feb 8, 2022

core.sched: failed to GC plugin: plugin_id=<plugin> error="rpc error: Permission denied" #11162

Closed

CSI: use job status not alloc status for plugin updates from summary #12027

Merged

Unable to normally stop and purge system job with csi plugin #11758

Closed

tgross closed this as completed in #12027 Feb 9, 2022

tgross added this to the 1.3.0 milestone Feb 9, 2022

This was referenced Apr 19, 2022

Backport of CSI: use job status not alloc status for plugin updates from summary into release/1.1.x #12627

Merged

Backport of CSI: use job status not alloc status for plugin updates from summary into release/1.2.x #12628

Merged

github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draining a node confuses CSI plugin node health #9810

Draining a node confuses CSI plugin node health #9810

apollo13 commented Jan 13, 2021

tgross commented Jan 14, 2021

apollo13 commented Feb 24, 2021

apollo13 commented Apr 5, 2021

apollo13 commented Jul 20, 2021

JohnKiller commented Jul 20, 2021

apollo13 commented Nov 4, 2021

tgross commented Nov 4, 2021

apollo13 commented Jan 5, 2022

tgross commented Feb 3, 2022 •

edited

Loading

apollo13 commented Feb 3, 2022

tgross commented Feb 3, 2022

tgross commented Feb 8, 2022

tgross commented Feb 9, 2022

github-actions bot commented Oct 11, 2022

Draining a node confuses CSI plugin node health #9810

Draining a node confuses CSI plugin node health #9810

Comments

apollo13 commented Jan 13, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

tgross commented Jan 14, 2021

apollo13 commented Feb 24, 2021

apollo13 commented Apr 5, 2021

apollo13 commented Jul 20, 2021

JohnKiller commented Jul 20, 2021

apollo13 commented Nov 4, 2021

tgross commented Nov 4, 2021

apollo13 commented Jan 5, 2022

tgross commented Feb 3, 2022 • edited Loading

apollo13 commented Feb 3, 2022

tgross commented Feb 3, 2022

tgross commented Feb 8, 2022

tgross commented Feb 9, 2022

github-actions bot commented Oct 11, 2022

tgross commented Feb 3, 2022 •

edited

Loading