Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: disambiguate or merge create and register specs #11899

Closed
NomAnor opened this issue Jan 21, 2022 · 6 comments
Closed

CSI: disambiguate or merge create and register specs #11899

NomAnor opened this issue Jan 21, 2022 · 6 comments

Comments

@NomAnor
Copy link
Contributor

NomAnor commented Jan 21, 2022

Nomad version

Nomad v1.2.3

Issue

If I try to register an already existing volume the ValidateVolumeCapabilities call is using the nomad volume id instead of the volume_id.

As far as I understand it, the ValidateVolumeCapabilities RPC should be used by the CO to verify that the volume has the expected capabilities after it is created. To do so the CO sends the volume_id. This is the internal (to the CSI plugin) id returned from the CreateVolume call. Nomad instead calls it with the volume id that is internal only to nomad (specified in the volume file).

The different ids are a bit confusing. There are three "ids" involved when using CSI volumes in nomad:

  • Volume name: Just a name, the CSI plugin can use it when creating the Volume, but it doesn't need to. Set by the user in the volume definition. But the plugin must use it to gurantee idempotency.
  • Volume id: The internal id in nomad to distinguish the volumes. Set by the user in the volume definition.
  • volume_id: Internal id used by the plugin to distinguish the volumes. Set by the plugin on volume creation. Opaque string to nomad. The user can't control this.

In my case volume_id is the uuid of the lvm volumes and my plugin only uses the name to provide indempotency (currently by using it as the lv name, but I might change that to tags).

Even when registering an already existing volume, nomad should call CreateVolume because that seems to be the only way to get a volume_id for a volume name from the CSI plugin. The call is indempotent based on the name so no new volume would be created.

The lack of a similar issues seems to suggest that other users use the storage provider id as the nomad volume id. If that is intended the documentation should clearly state so. But this is incompatible with volume creation because the storage provider id can't be known before (it is required in the volume definition).

Volume definition

id = "registry-data"
name = "registry_data"
type = "csi"
plugin_id = "lvm"
capacity_max = "256M"
capacity_min = "256M"

capability {
  access_mode = "single-node-reader-only"
  attachment_mode = "file-system"
}

capability {
  access_mode = "single-node-writer"
  attachment_mode = "file-system"
}

mount_options {
  fs_type = "ext4"
  mount_flags = ["noatime"]
}

Plugin logs

This is the request when registering the already existing volume:

DEBUG:lvm_csi.controller:Handling ValidateVolumeCapabilities
volume_id: "registry-data"
volume_capabilities {
  mount {
  }
  access_mode {
    mode: SINGLE_NODE_READER_ONLY
  }
}
volume_capabilities {
  mount {
  }
  access_mode {
    mode: SINGLE_NODE_WRITER
  }
}

This is the request when creating the already existing volume with the same file:

DEBUG:lvm_csi.controller:Handling CreateVolume
name: "registry_data"
capacity_range {
  required_bytes: 256000000
  limit_bytes: 256000000
}
volume_capabilities {
  mount {
    fs_type: "ext4"
    mount_flags: "noatime"
  }
  access_mode {
    mode: SINGLE_NODE_READER_ONLY
  }
}
volume_capabilities {
  mount {
    fs_type: "ext4"
    mount_flags: "noatime"
  }
  access_mode {
    mode: SINGLE_NODE_WRITER
  }
}
accessibility_requirements {
}

The cli then show the volume_id

Created external volume hcc34w-Ayoi-iotp-eLVV-EqZI-Imep-CBbZ3f with ID registry-data

You can then see all three ids in the ui:
volume

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jan 24, 2022
@tgross tgross self-assigned this Jan 24, 2022
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jan 24, 2022
@tgross
Copy link
Member

tgross commented Jan 24, 2022

Hi @NomAnor!

For clarity, let's use the following nomenclature, taken from Nomad's struct definition (ref csi.go#L242-L249):

  • The "Nomad ID" (the ID field in the struct) is the namespace-unique ID used by Nomad's state store.
  • The "Display Name" (the Name field in the struct) is the display name.
  • The "External ID" is the ID determined by the storage provider (the LVM UUID in your case).

The CSI RPC we're talking about is the ValidateVolumeCapabilities RPC. This has a field volume_id which corresponds to the External ID, as the CSI plugin of course has no idea about the Nomad ID or Display Name.

When the Nomad servers construct the client RPC request for validation (ref csi_endpoint.go#L254), it uses the RemoteID() method which you'll note does return the Nomad ID if the External ID is unset. If we called this while creating a volume, we'd run into a problem. However, the CreateVolume workflow doesn't call that validation at all, and instead checks that the plugin supports the required capabilities.

Even when registering an already existing volume, nomad should call CreateVolume because that seems to be the only way to get a volume_id for a volume name from the CSI plugin. The call is indempotent based on the name so no new volume would be created.

The expectation is that the user provides the external_id field in the spec provided in the nomad volume register command. Note the difference between the unused fields for create vs register.

The reason for the split in behavior is unfortunately historical. Originally Nomad didn't support a CreateVolume workflow at all, and only supported registering existing volumes. Now that we've implemented CreateVolume I have to admit that nomad volume register feels like an unfortunate leftover, and we should probably deprecate it. Having it call the CreateVolume RPC and ensuring that the call is idempotent seems like a reasonable first step along that deprecation path. I can't see that landing before 1.4.0 though.

The lack of a similar issues seems to suggest that other users use the storage provider id as the nomad volume id. If that is intended the documentation should clearly state so.

No doubt there's some issue here in the documentation, but it's not that folks are using the External ID as Nomad ID. The AWS EBS and AWS EFS plugins that we test in our nightly E2E test have different External ID and Nomad ID (ex. AWS's ID vol-abcdef123 vs test-volume0). The ./demo/csi folder has a handful of examples that use UUIDs or similar for their External IDs.

But this is incompatible with volume creation because the storage provider id can't be known before (it is required in the volume definition).

It's not required in the spec that nomad volume create expects.

When we use the nomad volume create command, there is a set of unused fields which includes external_id. The external_id field is used by the user in the nomad volume register workflow because in that case the volume already exists from the perspective of the storage provider and we just need to make Nomad aware of it.

That being said, it's very unfortunate that the nomad volume create and nomad volume register specs aren't "round-trippable" with the Read Volume API. This is something I'd definitely like to fix and I think deprecating the separate volume register workflow feels like the first step to do so? As a plugin author how does that sound to you?

@tgross tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Jan 24, 2022
@NomAnor
Copy link
Contributor Author

NomAnor commented Jan 24, 2022

I complete missed the external_id field. With your explanation the behaviour makes total sense.

I seems that nomad volume create can replace nomad volume register if a plugin supports volume creation. In that case only the volume name must match with the storage provider. When no capabilities or sizes are set in the volume definition (all optional) the result of the VolumeCreate call will contain all the information from the existing volume for nomad to register it internally. And if capabilities or sizes don't match you get an error.

If a plugin does not support volume creation nomad must call ValidateVolumeCapabilities to check if the volume is correctly configured and then it can register it internally. To make it easier for the user nomad can also use ListVolumes (if the plugin supports it) to retrieve the volume context so the user only has to supply external_id.

Having only a single command to create a volume in nomad, wether a new one or an existing one, seems for me to be a good solution. The user then has to either supply name or external_id (+ volume_context) and everything else would be the same.

Maybe if you want to deprecate the volume register command, you can update the documentation to recommend volume create as a more future-proof command if the plugin supports it? And then you can merge the register workflow into the create command and deprecate it.

Reporting a warning or an error when external_id is not set when calling volume register would make it more obvious when someone didn't read the documentation carefully enough. As I just demonstrated, sorry.

@tgross
Copy link
Member

tgross commented Jan 26, 2022

It occurs to me that if we merge register and create, that leaves us with unregister and delete, which are a bit harder to merge. With unregister we don't delete the underlying volume, just remove it from Nomad's state store.

@tgross tgross changed the title ValidateVolumeCapabilities RPC using the wrong volume id CSI: disambiguate or merge create and register specs Jan 31, 2022
@tgross
Copy link
Member

tgross commented Jan 31, 2022

I've retitled this issue to more accurately reflect what I want to do with this one. I'm working on a big stretch of CSI work including a lot of polishing, so it seems like this would be good to resolve within that block of work.

@tgross
Copy link
Member

tgross commented Mar 18, 2022

I'm going to mark this as closed by #12167, which did not merge the create and register spec, but tightened up the behavior to be much less surprising. We'll continue the work to handle safe re-registration in #10324 and other future feature expansions. Thanks again for opening this issue @NomAnor, it really helped clarify some of the work to be done here.

@tgross tgross closed this as completed Mar 18, 2022
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants