Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draining a node confuses CSI plugin node health #9810

Closed
apollo13 opened this issue Jan 13, 2021 · 14 comments · Fixed by #12027
Closed

Draining a node confuses CSI plugin node health #9810

apollo13 opened this issue Jan 13, 2021 · 14 comments · Fixed by #12027
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Milestone

Comments

@apollo13
Copy link
Contributor

Nomad version

Nomad v1.0.1 (c9c68aa)

Operating system and Environment details

Debian stable

Issue

When draining a node the CSI plugin node health display gets confused:
image

Reproduction steps

  • Setup a CSI plugin with more than one host; I have one controller and three nodes (the latter as system jobs). A configuration example can be found here: https://gitlab.com/rocketduck/csi-plugin-nfs/-/tree/main/nomad My plugin is a simple CSI plugin that provisions from an NFS share; but any plugin should do.
  • Drain one node (preferably not the controller)
  • Observe the above image
  • Mark the node as eligible again
  • Observer proper expected values for the node count.
@tgross
Copy link
Member

tgross commented Jan 14, 2021

Hi @apollo13! In 1.0.0 we shipped some fixes for the plugin counts in the UI and API, but it looks like we missed a case. It's interesting that the count fixes itself when the node plugin comes back... that might be a clue as to where the problem is. Thanks for opening this issue.

@apollo13
Copy link
Contributor Author

Also interesting in this context:
image
the node " Client Events " never show that CSI becomes healthy again.

@apollo13
Copy link
Contributor Author

apollo13 commented Apr 5, 2021

Hi @tgross, I began looking through my plugin and noted something interesting when the nomad UI calls v1/plugin/csi%2Fnfs:

{
	"Allocations": [],
	"ControllerRequired": false,
	"Controllers": {
		"a8efb906-0138-7c55-d156-8de72b5f356c": {
			"AllocID": "d4846cdb-ce40-09de-cd6d-4e88e0a1a816",
			"ControllerInfo": {
				"SupportsAttachDetach": false,
				"SupportsListVolumes": false,
				"SupportsListVolumesAttachedNodes": false,
				"SupportsReadOnlyAttach": false
			},
			"HealthDescription": "healthy",
			"Healthy": true,
			"PluginID": "nfs",
			"RequiresControllerPlugin": true,
			"RequiresTopologies": false,
			"UpdateTime": "2021-04-04T18:26:23.776357453Z"
		}
	},
	"ControllersExpected": 1,
	"ControllersHealthy": 1,
	"CreateIndex": 527856,
	"ID": "nfs",
	"ModifyIndex": 630304,
	"Nodes": {
		"8e100f4b-6d1b-ca4a-e8d6-b27a65ddac93": {
			"AllocID": "a59ff912-1cd2-1951-000b-6e697d939ed4",
			"HealthDescription": "healthy",
			"Healthy": true,
			"NodeInfo": {
				"AccessibleTopology": null,
				"ID": "nomad01",
				"MaxVolumes": 9223372036854776000,
				"RequiresNodeStageVolume": false
			},
			"PluginID": "nfs",
			"RequiresControllerPlugin": true,
			"RequiresTopologies": false,
			"UpdateTime": "2021-04-04T18:28:51.665163423Z"
		},
		"a8efb906-0138-7c55-d156-8de72b5f356c": {
			"AllocID": "e6d594e3-c483-65c3-05e7-7fd161d141d2",
			"HealthDescription": "healthy",
			"Healthy": true,
			"NodeInfo": {
				"AccessibleTopology": null,
				"ID": "nomad02",
				"MaxVolumes": 9223372036854776000,
				"RequiresNodeStageVolume": false
			},
			"PluginID": "nfs",
			"RequiresControllerPlugin": true,
			"RequiresTopologies": false,
			"UpdateTime": "2021-04-04T18:28:51.681156329Z"
		}
	},
	"NodesExpected": 0,
	"NodesHealthy": 2,
	"Provider": "dev.rocketduck.csi.nfs",
	"Version": "0.2.0"
}

Why is the toplevel ControllerRequired false? The controller as well as the node subobjects have RequiresControllerPlugin which comes from their capabilities: https://gitlab.com/rocketduck/csi-plugin-nfs/-/blob/1c1954605c3bb11c366380ce72fd6d7dbe4e27e7/src/csi_plugin_nfs/identity.py#L15-22

@apollo13
Copy link
Contributor Author

I just realized that simply stopping a single node allocation has the same effect. @tgross did you ever get around reproducing this? I can reliably trigger this.

@JohnKiller
Copy link

I am experiencing the same problem with ceph-csi module. Stopping and starting an allocation various times causes the problem to appear

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 4, 2021

@tgross As promised here is a ping ;) Is there any chance that we can work on fixing that? What do you need from me to move this forward?

@tgross
Copy link
Member

tgross commented Nov 4, 2021

Hi @apollo13! I don't think I need anything else from you, just a little time to get ramped back up and dig into it. Thanks for the ping... I'll assign myself so that I don't lose this.

@tgross tgross self-assigned this Nov 4, 2021
@apollo13
Copy link
Contributor Author

apollo13 commented Jan 5, 2022

Cross linking to #11758

@tgross
Copy link
Member

tgross commented Feb 3, 2022

Leaving a note that this issue appears to be related to but may have subtly different code paths from #11784.

Note that I don't think this is actually related to #11758 or #10073 (or at least not by itself). Those issues are primarily about count state on the server, whereas here and in #11784 we have evidence of client-side reconnection issues with the CSI plugins. I'll be looking into that as well as the count state as separate patch sets.

@apollo13
Copy link
Contributor Author

apollo13 commented Feb 3, 2022

I closely followed your fixes and am not sure yet either (massive kudos for those!). I'll deploy 1.2.5 and see if I can test it manually. In the meantime I will leave this open as a reminder for myself; don't waste time on it unless you have an hunch that this is still unsolved.

@tgross
Copy link
Member

tgross commented Feb 3, 2022

Oh sorry if I was unclear. This issue is definitely not fixed yet; I was just saying it may have a cause that's independent from the "plugin counts" issue described in #11758 or #10073

@tgross
Copy link
Member

tgross commented Feb 8, 2022

I've opened #12027 which should partially address this issue and also #10073. I think the remaining bits are most likely related to #11784

@tgross
Copy link
Member

tgross commented Feb 9, 2022

I've closed this issue via #12027 which will ship in 1.3.0 (plus backports) but I'll look into the plugin alloc restart issues as part of #11784.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants