Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad confused about how many CSI plugins should be running #9371

Closed
tsarna opened this issue Nov 15, 2020 · 13 comments · Fixed by #9438
Closed

Nomad confused about how many CSI plugins should be running #9371

tsarna opened this issue Nov 15, 2020 · 13 comments · Fixed by #9438
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Milestone

Comments

@tsarna
Copy link

tsarna commented Nov 15, 2020

For reporting security vulnerabilities please refer to the website.

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

currently Nomad v0.12.7 ('0.12.7'), I think the problem started with 0.11.3 but I'm not certain.

Operating system and Environment details

Alpine 3.12.1, currently 3 x AWS t3a.small and 1 x t4g.small clients, 3 x t4g.micro servers.

t4g.small client: CSI plugin chengpan/aws-efs-csi-driver:latest, (a test aarch64 build of amazon/aws-efs-csi-driver)
t3a.small clients: amazon/aws-efs-csi-driver:v1.0.0

The servers and amd64 clients started at 0.11.3 and were upgraded at some point.

Issue

Let me say at the start that I don't know that this issue has had any practical impact at all, or is just a cosmetic issue with the displayed count, but I'm reporting it in case you find the information useful. I did see one bit of flakiness which I'm not sure is related or not.

I'm doing some testing with the AWS EFS CSI drivers in a mixed architecture environment. Somewhere along the way as I added and removed clients, Nomad has become confused about how many plugins are supposed to be running and says at /ui/csi/plugins:

aws-efs | Node Only | Healthy (4/7) | efs.csi.aws.com

similarly from the cli:

$ nomad plugin status
Container Storage Interface
ID       Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs  ebs.csi.aws.com  1/1                           4/4
aws-efs  efs.csi.aws.com  0/0                           4/7

One thing I'm doing that may be unusual is that I am running two separate system jobs for the plugins since they're using different containers, but they register the same plugin, because I want the volume to be accessible to jobs running on either architecture client, however, the count was already screwed up (3/6) before I added the arm64 node and plugin.

I've had one occasion where nomad was unable to schedule a job using an EFS volume (exhausted its available writer claims, odd since the volume had access_mode = "multi-node-multi-writer"). Force deregistering and reregistering fixed it. I've had other volumes with no issues at all. I don't know if this is in some way related to confused bookkeeping about the plugins or not.

Reproduction steps

It's not clear exactly what sequence of events led to the issue. I think the count was off before I upgraded to 0.12.7, but I'm not 100% sure now. I have added and removed several client nodes and the count always remains 3 greater than the number of clients.

Job file (if appropriate)

The plugin jobs files are essentially both the same as what I submitted in #9366, except for one references v1.0.0 of the official image and constrains ${attr.cpu.arch} to amd64 and the other references chengpan/aws-efs-csi-driver:latest and constrains ${attr.cpu.arch} to arm64

Nomad Client logs (if appropriate)

I found nothing relevant in the logs, but I'm not exactly sure what I'd be looking for either, nor during exactly what timeframe

Nomad Server logs (if appropriate)

@tsarna
Copy link
Author

tsarna commented Nov 15, 2020

Another oddity which I've just noticed. The CLI still shows the EBS controller as 1/1 as above, and in the UI at /ui/csi/volumes shows EBS volumes as Healthy (1/1) under the controller column.

However at /ui/csi/plugins it says:

aws-ebs | Node Only | Healthy (4/4) | ebs.csi.aws.com

(Also it's a little weird that for EFS under the volumes screen it shows controller health as Unhealthy (0/0). There is no controller needed for EFS so having 0 running isn't unhealthy)

@tsarna
Copy link
Author

tsarna commented Nov 15, 2020

Forgot to mention, I tried nomad system gc and it didn't help with the counts.

@apollo13
Copy link
Contributor

(Also it's a little weird that for EFS under the volumes screen it shows controller health as Unhealthy (0/0). There is no controller needed for EFS so having 0 running isn't unhealthy)

this is #9252 I think?

Regarding your other issue, could this be #9248 ?

@tgross
Copy link
Member

tgross commented Nov 16, 2020

Hi @tsarna! We fixed a few bugs around plugin counts in 0.12.2-0.12.4 (also ref #8948), so the challenge here will be to figure out if this is a problem because of upgrading from a buggy version or whether we haven't really fixed the plugin count issue.

One thing I'm doing that may be unusual is that I am running two separate system jobs for the plugins since they're using different containers, but they register the same plugin, because I want the volume to be accessible to jobs running on either architecture client, however, the count was already screwed up (3/6) before I added the arm64 node and plugin.

That definitely should work, and as you note if the count was wrong before you deployed the second system job that would appear to rule that out. But thank you for that context.

I have added and removed several client nodes and the count always remains 3 greater than the number of clients.

This is what makes me think we might have a bug where a plugin that somehow doesn't get deregistered properly isn't getting the count decremented as we expect. It's been a little bit since I last looked at it, but my suspicion is we're relying on increment/decrement operations rather deriving the expected count from the job specification(s). Would you be willing to run a nomad operator debug so that we can see what's raft for the plugins?

(Also it's a little weird that for EFS under the volumes screen it shows controller health as Unhealthy (0/0). There is no controller needed for EFS so having 0 running isn't unhealthy)

this is #9252 I think?

Definitely.

@tgross
Copy link
Member

tgross commented Nov 23, 2020

Another thing that could help in addition to that raft data is the output of nomad plugin status aws-ebs. That'll give us a list of allocations that Nomad thinks belong to the plugin, and that'll give us a better idea of where the count is getting skewed.

@tgross
Copy link
Member

tgross commented Nov 23, 2020

I think I've found where the issue is. While trying to replicate the circumstances in a test I found that concurrently updating allocations could trigger a bug in the way we've set up the transactions around plugins. A more minimal test case that could be dropped into nomad/state_store_test.go is as follows:

func TestStateStore_Nested(t *testing.T) {
	s := testStateStore(t)

	// new plugin
	index := uint64(0)
	plugin := structs.NewCSIPlugin("foo", index)
	err := s.UpsertCSIPlugin(index, plugin)
	require.NoError(t, err)

	txn := s.db.WriteTxn(index)
	defer txn.Abort()

	inc := func(index uint64) {

		// This doesn't work!
		plugin, _ := s.CSIPluginByID(nil, "foo")
		plugin = plugin.Copy()

		// This does!
		// raw, _ := txn.First("csi_plugins", "id_prefix", "foo")
		// plugin := raw.(*structs.CSIPlugin)

		plugin.NodesExpected += 1
		err := txn.Insert("csi_plugins", plugin)
		require.NoError(t, err)
	}
	index++
	inc(index)
	index++
	inc(index)
	index++
	inc(index)
	txn.Commit()

	plugin, _ = s.CSIPluginByID(nil, "foo")
	require.Equal(t, 3, plugin.NodesExpected)
}

We have a helper method CSIPluginByID that should not be used inside another transaction, because it creates its own top-level transaction. This means that concurrent updates of allocations (such as those created by stopping a job) could potentially interleave the updates. The commented-out bit of code above can replace that method call and causes the test to pass. I have fairly strong suspicions this also is behind #8948 and #9248 (and maybe a few other stray bugs, as we have the exact same construct for CSI volumes).

The PR is going to have a lot of small fiddly bits to check, but I'm going to try to land it before the long holiday weekend starts here in the US.

@tsarna
Copy link
Author

tsarna commented Nov 23, 2020

Well, I didn't have a chance to look at this for several days, when I got back to it just now I found that one of my client nodes had died and the count of both EBS and EFS plugins for the remaining nodes was correct at 3/3.

@tgross
Copy link
Member

tgross commented Nov 23, 2020

Ok, that also tells me that an update to the plugin seems to correct the counts, so I think we need to make sure we do a reconciliation of those during the periodic system GC as well. (If for no other reason than to clean up pre-1.0 plugins that will have hit the bug I've described above.)

@apollo13
Copy link
Contributor

@tgross As for #9248; I was trying this with no other cluster activity running. I honestly do not think that there was another concurrent update. The interesting thing is that it should be easily reproducable: #9248 (comment) -- did you see different results there?

@tgross
Copy link
Member

tgross commented Nov 24, 2020

From your comment:

Yes, simply running nomad job run (with the job from #9259 (comment)) and then stopping it will show the allocations as "running" as shown in the comment above.

There were multiple allocations running and then you stopped the job, right? That will make at least one update per node (when each Nomad client updates the server that it's stopping the allocation), and there's a very good chance those land concurrently. I haven't been able to reproduce without having multiple node plugin instances running.

@apollo13
Copy link
Contributor

Ah sorry, the way I read your comment I assumed you ment concurrent updates to other jobs. Yes I have a node only plugin running as system job over all three clients in the cluster. Sorry for the confusion.

@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Nov 24, 2020
@tgross tgross added this to the 1.0 milestone Nov 24, 2020
@tgross
Copy link
Member

tgross commented Nov 25, 2020

Closed by #9438, which will ship in Nomad 1.0

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants