Nomad confused about how many CSI plugins should be running #9371

tsarna · 2020-11-15T23:47:03Z

For reporting security vulnerabilities please refer to the website.

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

currently Nomad v0.12.7 ('0.12.7'), I think the problem started with 0.11.3 but I'm not certain.

Operating system and Environment details

Alpine 3.12.1, currently 3 x AWS t3a.small and 1 x t4g.small clients, 3 x t4g.micro servers.

t4g.small client: CSI plugin chengpan/aws-efs-csi-driver:latest, (a test aarch64 build of amazon/aws-efs-csi-driver)
t3a.small clients: amazon/aws-efs-csi-driver:v1.0.0

The servers and amd64 clients started at 0.11.3 and were upgraded at some point.

Issue

Let me say at the start that I don't know that this issue has had any practical impact at all, or is just a cosmetic issue with the displayed count, but I'm reporting it in case you find the information useful. I did see one bit of flakiness which I'm not sure is related or not.

I'm doing some testing with the AWS EFS CSI drivers in a mixed architecture environment. Somewhere along the way as I added and removed clients, Nomad has become confused about how many plugins are supposed to be running and says at /ui/csi/plugins:

aws-efs | Node Only | Healthy (4/7) | efs.csi.aws.com

similarly from the cli:

$ nomad plugin status
Container Storage Interface
ID       Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs  ebs.csi.aws.com  1/1                           4/4
aws-efs  efs.csi.aws.com  0/0                           4/7

One thing I'm doing that may be unusual is that I am running two separate system jobs for the plugins since they're using different containers, but they register the same plugin, because I want the volume to be accessible to jobs running on either architecture client, however, the count was already screwed up (3/6) before I added the arm64 node and plugin.

I've had one occasion where nomad was unable to schedule a job using an EFS volume (exhausted its available writer claims, odd since the volume had access_mode = "multi-node-multi-writer"). Force deregistering and reregistering fixed it. I've had other volumes with no issues at all. I don't know if this is in some way related to confused bookkeeping about the plugins or not.

Reproduction steps

It's not clear exactly what sequence of events led to the issue. I think the count was off before I upgraded to 0.12.7, but I'm not 100% sure now. I have added and removed several client nodes and the count always remains 3 greater than the number of clients.

Job file (if appropriate)

The plugin jobs files are essentially both the same as what I submitted in #9366, except for one references v1.0.0 of the official image and constrains ${attr.cpu.arch} to amd64 and the other references chengpan/aws-efs-csi-driver:latest and constrains ${attr.cpu.arch} to arm64

Nomad Client logs (if appropriate)

I found nothing relevant in the logs, but I'm not exactly sure what I'd be looking for either, nor during exactly what timeframe

Nomad Server logs (if appropriate)

The text was updated successfully, but these errors were encountered:

tsarna · 2020-11-15T23:50:55Z

Another oddity which I've just noticed. The CLI still shows the EBS controller as 1/1 as above, and in the UI at /ui/csi/volumes shows EBS volumes as Healthy (1/1) under the controller column.

However at /ui/csi/plugins it says:

aws-ebs | Node Only | Healthy (4/4) | ebs.csi.aws.com

(Also it's a little weird that for EFS under the volumes screen it shows controller health as Unhealthy (0/0). There is no controller needed for EFS so having 0 running isn't unhealthy)

tsarna · 2020-11-15T23:55:32Z

Forgot to mention, I tried nomad system gc and it didn't help with the counts.

apollo13 · 2020-11-16T11:36:21Z

(Also it's a little weird that for EFS under the volumes screen it shows controller health as Unhealthy (0/0). There is no controller needed for EFS so having 0 running isn't unhealthy)

this is #9252 I think?

Regarding your other issue, could this be #9248 ?

tgross · 2020-11-16T13:28:32Z

Hi @tsarna! We fixed a few bugs around plugin counts in 0.12.2-0.12.4 (also ref #8948), so the challenge here will be to figure out if this is a problem because of upgrading from a buggy version or whether we haven't really fixed the plugin count issue.

One thing I'm doing that may be unusual is that I am running two separate system jobs for the plugins since they're using different containers, but they register the same plugin, because I want the volume to be accessible to jobs running on either architecture client, however, the count was already screwed up (3/6) before I added the arm64 node and plugin.

That definitely should work, and as you note if the count was wrong before you deployed the second system job that would appear to rule that out. But thank you for that context.

I have added and removed several client nodes and the count always remains 3 greater than the number of clients.

This is what makes me think we might have a bug where a plugin that somehow doesn't get deregistered properly isn't getting the count decremented as we expect. It's been a little bit since I last looked at it, but my suspicion is we're relying on increment/decrement operations rather deriving the expected count from the job specification(s). Would you be willing to run a nomad operator debug so that we can see what's raft for the plugins?

(Also it's a little weird that for EFS under the volumes screen it shows controller health as Unhealthy (0/0). There is no controller needed for EFS so having 0 running isn't unhealthy)

this is #9252 I think?

Definitely.

tgross · 2020-11-23T15:26:36Z

Another thing that could help in addition to that raft data is the output of nomad plugin status aws-ebs. That'll give us a list of allocations that Nomad thinks belong to the plugin, and that'll give us a better idea of where the count is getting skewed.

tgross · 2020-11-23T20:40:43Z

I think I've found where the issue is. While trying to replicate the circumstances in a test I found that concurrently updating allocations could trigger a bug in the way we've set up the transactions around plugins. A more minimal test case that could be dropped into nomad/state_store_test.go is as follows:

func TestStateStore_Nested(t *testing.T) {
	s := testStateStore(t)

	// new plugin
	index := uint64(0)
	plugin := structs.NewCSIPlugin("foo", index)
	err := s.UpsertCSIPlugin(index, plugin)
	require.NoError(t, err)

	txn := s.db.WriteTxn(index)
	defer txn.Abort()

	inc := func(index uint64) {

		// This doesn't work!
		plugin, _ := s.CSIPluginByID(nil, "foo")
		plugin = plugin.Copy()

		// This does!
		// raw, _ := txn.First("csi_plugins", "id_prefix", "foo")
		// plugin := raw.(*structs.CSIPlugin)

		plugin.NodesExpected += 1
		err := txn.Insert("csi_plugins", plugin)
		require.NoError(t, err)
	}
	index++
	inc(index)
	index++
	inc(index)
	index++
	inc(index)
	txn.Commit()

	plugin, _ = s.CSIPluginByID(nil, "foo")
	require.Equal(t, 3, plugin.NodesExpected)
}

We have a helper method CSIPluginByID that should not be used inside another transaction, because it creates its own top-level transaction. This means that concurrent updates of allocations (such as those created by stopping a job) could potentially interleave the updates. The commented-out bit of code above can replace that method call and causes the test to pass. I have fairly strong suspicions this also is behind #8948 and #9248 (and maybe a few other stray bugs, as we have the exact same construct for CSI volumes).

The PR is going to have a lot of small fiddly bits to check, but I'm going to try to land it before the long holiday weekend starts here in the US.

tsarna · 2020-11-23T20:55:44Z

Well, I didn't have a chance to look at this for several days, when I got back to it just now I found that one of my client nodes had died and the count of both EBS and EFS plugins for the remaining nodes was correct at 3/3.

tgross · 2020-11-23T21:15:16Z

Ok, that also tells me that an update to the plugin seems to correct the counts, so I think we need to make sure we do a reconciliation of those during the periodic system GC as well. (If for no other reason than to clean up pre-1.0 plugins that will have hit the bug I've described above.)

apollo13 · 2020-11-24T08:04:31Z

@tgross As for #9248; I was trying this with no other cluster activity running. I honestly do not think that there was another concurrent update. The interesting thing is that it should be easily reproducable: #9248 (comment) -- did you see different results there?

tgross · 2020-11-24T13:02:59Z

From your comment:

Yes, simply running nomad job run (with the job from #9259 (comment)) and then stopping it will show the allocations as "running" as shown in the comment above.

There were multiple allocations running and then you stopped the job, right? That will make at least one update per node (when each Nomad client updates the server that it's stopping the allocation), and there's a very good chance those land concurrently. I haven't been able to reproduce without having multiple node plugin instances running.

apollo13 · 2020-11-24T17:24:08Z

Ah sorry, the way I read your comment I assumed you ment concurrent updates to other jobs. Yes I have a node only plugin running as system job over all three clients in the cluster. Sorry for the confusion.

tgross · 2020-11-25T16:43:04Z

Closed by #9438, which will ship in Nomad 1.0

github-actions · 2022-10-28T02:41:14Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added type/bug theme/storage labels Nov 16, 2020

tgross self-assigned this Nov 16, 2020

tgross mentioned this issue Nov 24, 2020

CSI: fix transaction handling in state store #9438

Merged

tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Nov 24, 2020

tgross added this to the 1.0 milestone Nov 24, 2020

tgross closed this as completed in #9438 Nov 25, 2020

github-actions bot locked as resolved and limited conversation to collaborators Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad confused about how many CSI plugins should be running #9371

Nomad confused about how many CSI plugins should be running #9371

tsarna commented Nov 15, 2020

tsarna commented Nov 15, 2020 •

edited

Loading

tsarna commented Nov 15, 2020

apollo13 commented Nov 16, 2020

tgross commented Nov 16, 2020 •

edited

Loading

tgross commented Nov 23, 2020

tgross commented Nov 23, 2020

tsarna commented Nov 23, 2020

tgross commented Nov 23, 2020

apollo13 commented Nov 24, 2020

tgross commented Nov 24, 2020 •

edited

Loading

apollo13 commented Nov 24, 2020

tgross commented Nov 25, 2020

github-actions bot commented Oct 28, 2022

Nomad confused about how many CSI plugins should be running #9371

Nomad confused about how many CSI plugins should be running #9371

Comments

tsarna commented Nov 15, 2020

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

Nomad Client logs (if appropriate)

Nomad Server logs (if appropriate)

tsarna commented Nov 15, 2020 • edited Loading

tsarna commented Nov 15, 2020

apollo13 commented Nov 16, 2020

tgross commented Nov 16, 2020 • edited Loading

tgross commented Nov 23, 2020

tgross commented Nov 23, 2020

tsarna commented Nov 23, 2020

tgross commented Nov 23, 2020

apollo13 commented Nov 24, 2020

tgross commented Nov 24, 2020 • edited Loading

apollo13 commented Nov 24, 2020

tgross commented Nov 25, 2020

github-actions bot commented Oct 28, 2022

tsarna commented Nov 15, 2020 •

edited

Loading

tgross commented Nov 16, 2020 •

edited

Loading

tgross commented Nov 24, 2020 •

edited

Loading