Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: allow for concurrent plugin allocations #12078

Merged
merged 6 commits into from
Feb 23, 2022
Merged

Conversation

tgross
Copy link
Member

@tgross tgross commented Feb 16, 2022

Fixes #8628 #11784

The dynamic plugin registry assumes that plugins are singletons, which matches the behavior of other Nomad plugins. But because dynamic plugins like CSI are implemented by allocations, we need to handle the possibility of multiple allocations for a given plugin type + ID, as well as behaviors around interleaved allocation starts and stops.

This is a fairly huge PR but none of it ships safely without the rest, so I've cut it up into hopefully-reviewable chunks.

TODO:

  • Failing tests for concurrency bugs in plugin registration. Done in 0a46a31
  • Update the plugin registry to support a linked list of plugins such that the most recent live alloc ID is what gets dispensed. Done in 1b8eb1e
    • On registry updates, swap out instance managers so that we're always running a single instance manager for the "active" plugin.
    • On plugin restore, do not reorder the linked list.
    • Update the plugin registry to sync to a new client state table with the linked list data structure.
  • Update the plugin supervisor to change the location of the csi.sock to somewhere owned by the allocation. Done in 1f095de
  • Update the client state store to safely handle updating from the previous data structure to the new one. Done in 0cfe94e
  • Changelog entry and upgrade warning docs. Done in 58cccb0

The dynamic plugin registry assumes that plugins are singletons, which
matches the behavior of other Nomad plugins. But because dynamic
plugins like CSI are implemented by allocations, we need to handle the
possibility of multiple allocations for a given plugin type + ID, as
well as behaviors around interleaved allocation starts and stops.
Comment on lines +246 to +247
require.NotNil(t, result)
require.NoError(t, err)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgfa29 I think I'm getting a false positive on the semgrep rule here: https://github.com/hashicorp/nomad/runs/5222371451?check_suite_focus=true Arguably these assertions aren't all that useful, so I wouldn't be broken-hearted if I just had to remove them. But any suggestions for a fix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh interesting, I haven't thought of this pattern before. Feel free to leave the code as-is and ignore the check for now. I will try to adjust the rules to match this.

The dynamic plugin registry assumes that plugins are singletons, which
matches the behavior of other Nomad plugins. But because dynamic
plugins like CSI are implemented by allocations, we need to handle the
possibility of multiple allocations for a given plugin type + ID, as
well as behaviors around interleaved allocation starts and stops.

Update the data structure for the dynamic registry so that more recent
allocations take over as the instance manager singleton, but we still
preserve the previous running allocations so that restores work
without racing.
Multiple allocations can run on a client for the same plugin, even if
only during updates. Provide each plugin task a unique path for the
control socket so that the tasks don't interfere with each other.
Comment on lines +157 to +158
// TODO(tgross): add this TaskPath to the CSIPluginConfig as well
TaskPath: "/local/csi",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: this PR is already huge and adding to this field to the jobspec is both a minor change and a lot of code to review. I'm going to do it in a follow-on PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a silly question, but is there a use case where this would even need to be configurable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a very good use case. 😁

Suppose you had a plugin that required a configuration file that you created via template block, we could conceivably collide with the destination for that template. Ideally the user would just use a different destination of course. But obviously this is low-priority enough that it's not going to be a blocker at all for CSI GA.

Comment on lines +170 to +172
// TODO(tgross): https://github.com/hashicorp/nomad/issues/11786
// If we're already registered, we should be able to update the
// definition in the update hook
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: this is a "blocking GA" issue so I'll have a follow-up PR on that as well.

@tgross tgross marked this pull request as ready for review February 18, 2022 15:56
require.NoError(t, db.Close())
})
}

t.Run("testdata/state-1.2.6.db.gz", func(t *testing.T) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: this file got created from the following test code on a 1.2.6 branch:

TestBoltStateDB_WriteTestData
func TestBoltStateDB_WriteTestData(t *testing.T) {
	os.Mkdir("newdata", 0775)
	dbI, err := NewBoltStateDB(testlog.HCLogger(t), "newdata")
	require.NoError(t, err)

	db := dbI.(*BoltStateDB)
	defer db.Close()

	ps := &dynamicplugins.RegistryState{
		Plugins: map[string]map[string]*dynamicplugins.PluginInfo{
			"csi-controller": {
				"plugin-1": {
					Name:    "plugin-1",
					Type:    string(structs.CSIPluginTypeMonolith),
					Version: "v0.1",
					ConnectionInfo: &dynamicplugins.PluginConnectionInfo{
						SocketPath: "/var/nomad/client/csi/plugin-1/csi.sock",
					},
					AllocID: "489cd5cf-3cfc-6792-3465-822371765b32",
					Options: map[string]string{
						"Provider":            "demo",
						"MountPoint":          "/var/nomad/client/csi/plugin-1",
						"ContainerMountPoint": "/csi",
					},
				},
			},
			"csi-node": {
				"plugin-1": {
					Name:    "plugin-1",
					Type:    string(structs.CSIPluginTypeMonolith),
					Version: "v0.1",
					ConnectionInfo: &dynamicplugins.PluginConnectionInfo{
						SocketPath: "/var/nomad/client/csi/plugin-1/csi.sock",
					},
					AllocID: "489cd5cf-3cfc-6792-3465-822371765b32",
					Options: map[string]string{
						"Provider":            "demo",
						"mountpoint":          "/var/nomad/client/csi/plugin-1",
						"ContainerMountPoint": "/csi",
					},
				},
				"plugin-2": {
					Name:    "plugin-2",
					Type:    string(structs.CSIPluginTypeNode),
					Version: "v0.2",
					ConnectionInfo: &dynamicplugins.PluginConnectionInfo{
						SocketPath: "/var/nomad/client/csi/plugin-2/csi.sock",
					},
					AllocID: "68221094-25a6-eeb6-8490-c71a03d46b0f",
					Options: map[string]string{
						"Provider":            "other-demo",
						"MountPoint":          "/var/nomad/client/csi/plugin-2",
						"ContainerMountPoint": "/csi",
					},
				},
			},
		},
	}
	db.PutDynamicPluginRegistryState(ps)

}

@tgross tgross requested a review from jrasell February 23, 2022 15:20
Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments (mostly questions), but LGTM.

Thanks for breaking down the work into commits 🙂

client/allocrunner/taskrunner/plugin_supervisor_hook.go Outdated Show resolved Hide resolved
Comment on lines +157 to +158
// TODO(tgross): add this TaskPath to the CSIPluginConfig as well
TaskPath: "/local/csi",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a silly question, but is there a use case where this would even need to be configurable?

Comment on lines -320 to +366
"ContainerMountPoint": h.task.CSIPluginConfig.MountDir,
"ContainerMountPoint": "/local/csi",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this value change depending on the file system isolation capability of the task driver?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A task driver without filesystem isolation never creates the mount, so it's a no-op. (Note this is an exotic case because the spec requires that plugins are shipped as OCI containers)

website/content/docs/upgrade/upgrade-specific.mdx Outdated Show resolved Hide resolved
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSI plugin fails to be marked healthy after reboot CSI: plugins with same type and ID collide on client
2 participants