Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: plugins with same type and ID collide on client #8628

Closed
tgross opened this issue Aug 10, 2020 · 7 comments · Fixed by #12078
Closed

CSI: plugins with same type and ID collide on client #8628

tgross opened this issue Aug 10, 2020 · 7 comments · Fixed by #12078
Assignees
Milestone

Comments

@tgross
Copy link
Member

tgross commented Aug 10, 2020

To communicate with CSI plugins, we mount a directory for the plugin at ${nomad_data_dir}/client/csi/${plugin_type}/${plugin_id}/. The plugin is responsible for opening a socket at that path + csi.sock (typically). If a second plugin with the same type and ID lands on the same Nomad client node, then the second allocation can't open the socket.

@tgross
Copy link
Member Author

tgross commented Aug 10, 2020

A workaround for operators until we have this fixed is to either:

  • use a large enough cluster with appropriate constraints so that controllers don't end up on the same node
  • use the ALLOC_INDEX to interpolate the mount_dir field of the jobspec, to avoid collisions. Actually this won't work because the collision is in the Nomad client's directory, which is the same regardless of where that gets mounted into the container.

@tgross
Copy link
Member Author

tgross commented Aug 11, 2020

It looks like this could easily be fixed by adding an alloc index to where the path is defined in the plugin_supervisor_hook.go#L75. At first glance that worries me about backwards compatibility issues through upgraded clients, but when we register a plugin at plugin_supervisor_hook.go#L278 we store the connection info then, and that gets stored in the client-side state store.

We've missed the boat on 0.12.2 for this but it should be straightforward to test this approach.

@tgross tgross self-assigned this Aug 11, 2020
@tgross
Copy link
Member Author

tgross commented Aug 13, 2020

Getting the socket path out of the registry when we instantiate the plugin supervisor hook reveals a related problem: plugins collide in the dynamic plugin registry.

If more than one plugin of the same type and plugin ID lands on the same Nomad client, they both register with the same key in the registry. (see client/dynamicplugins/registry.go#L181-L187). The event broadcaster also assumes that there's only one possible plugin that can be listening on the client.

We could add a feature to the scheduler to make sure that we never schedule the plugin on the same client, but that probably makes plugin upgrades on small clusters difficult. I think we need to rework the client-side registry to store a list of plugins as well. @langmartin do you have thoughts here?

@tgross tgross changed the title CSI: plugins collide over access to csi.sock CSI: plugins with same type and ID collide on client Aug 13, 2020
@tgross
Copy link
Member Author

tgross commented Oct 8, 2020

There's a few complications to a quick fix here:

  • We need to change the dynamic plugin registry's struct that gets stored in the state store, but to avoid a lot of backwards compatibility churn we probably only want to do that once.
  • The general pluginmanger interface also assumes we have one plugin instance per type per name, so we need to change it there too.
  • Although the CSI spec says that plugins should be handling concurrent requests, it's silent on concurrent requests on the same host, and until we've developed some clarity around the expected behavior vs real-world plugins we risk making things worse.
  • We've discovered that some controller plugin developers are using sidecars that perform leader election, which isn't in the CSI spec but is going to complicate decision making around how to do this well. (Ex. should Nomad use an ordered list of controllers so that it always sends RPCs to the same controller?)

So in the meantime I've opened #9052 as a notice to operators.

@tgross tgross removed their assignment Jan 29, 2021
@tgross tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021
@tgross tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021
@tgross tgross self-assigned this Feb 3, 2022
@tgross
Copy link
Member Author

tgross commented Feb 16, 2022

Work in progress in #12078

@tgross tgross added this to the 1.3.0 milestone Feb 17, 2022
@tgross
Copy link
Member Author

tgross commented Feb 23, 2022

Should be fixed by #12078, which will ship in Nomad 1.3.0

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant