Registering volume gives error "Unknown volume attachment mode" #10626

henriots · 2021-05-19T19:50:16Z

Hello!

After upgrading to v1.1.0, volume registration gives an error:
Error registering volume: Unexpected response code: 500 (rpc error: controller validate volume: Unknown volume attachment mode: )

It worked with v.1.4.0 when having set in volume configuration:
access_mode = "single-node-writer" attachment_mode = "file-system"

Nomad version

Nomad v1.1.0 (2678c36)

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

type = "csi"
id = "registry-nomad"
name = "registry-nomad"
external_id = "1234567"

capability {
  attachment_mode = "file-system"
  access_mode = "single-node-writer"
}

plugin_id = "csi.hetzner.cloud"

The text was updated successfully, but these errors were encountered:

optiz0r · 2021-05-19T21:38:23Z

Seeing the same here. I'm pretty sure it was working in 1.1.0-rc1, since I had previously amended my volume spec to include the newly required capability block, but after upgrading to 1.1.0 and finding a CSI allocation stuck, I deregistered it, and am now unable to re-register the volume with the same volume spec which had previously worked.

tgross · 2021-05-20T12:29:31Z

Hi @henriots! Can you grab the allocation logs for the plugin? That might help diagnose the problem.

tgross · 2021-05-20T12:30:02Z

Seeing the same here. I'm pretty sure it was working in 1.1.0-rc1, since I had previously amended my volume spec to include the newly required capability block, but after upgrading to 1.1.0 and finding a CSI allocation stuck, I deregistered it, and am now unable to re-register the volume with the same volume spec which had previously worked.

@optiz0r can you provide the error message you're seeing (+ alloc logs for the plugin, if available)

optiz0r · 2021-05-20T12:43:45Z

Volume registration command failing:

$ nomad volume register vol-acme.hcl
Error registering volume: Unexpected response code: 500 (controller validate volume: rpc error: controller validate volume: CSI.ControllerValidateVolume: unknown volume attachment mode: )

CSI controller alloc logs (controller had just been restarted prior to running the above command):

{"service":"democratic-csi","level":"info","message":"initializing csi driver: zfs-generic-nfs"}
{"message":"setting default identity service caps","level":"debug","service":"democratic-csi"}
{"message":"setting default identity volume_expansion caps","level":"debug","service":"democratic-csi"}
{"message":"setting default controller caps","level":"debug","service":"democratic-csi"}
{"message":"setting default node caps","level":"debug","service":"democratic-csi"}
{"service":"democratic-csi","level":"info","message":"starting csi server - name: org.democratic-csi.nfs, version: 1.2.0, driver: zfs-generic-nfs, mode: controller, csi version: 1.2.0, address: 0.0.0.0:9000, socket: unix:///csi-data/csi.sock"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: Probe call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"message":"performing ssh sanity check..","level":"debug","service":"democratic-csi"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: Probe response: {\"ready\":{\"value\":true}}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: GetPluginInfo call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: GetPluginInfo response: {\"name\":\"org.democratic-csi.nfs\",\"vendor_version\":\"1.2.0\"}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: GetPluginCapabilities call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: GetPluginCapabilities response: {\"capabilities\":[{\"service\":{\"type\":\"CONTROLLER_SERVICE\"}},{\"volume_expansion\":{\"type\":\"ONLINE\"}}]}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: Probe call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"message":"performing ssh sanity check..","level":"debug","service":"democratic-csi"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: Probe call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"message":"performing ssh sanity check..","level":"debug","service":"democratic-csi"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: Probe response: {\"ready\":{\"value\":true}}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: ControllerGetCapabilities call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: ControllerGetCapabilities response: {\"capabilities\":[{\"rpc\":{\"type\":\"CREATE_DELETE_VOLUME\"}},{\"rpc\":{\"type\":\"LIST_VOLUMES\"}},{\"rpc\":{\"type\":\"GET_CAPACITY\"}},{\"rpc\":{\"type\":\"CREATE_DELETE_SNAPSHOT\"}},{\"rpc\":{\"type\":\"LIST_SNAPSHOTS\"}},{\"rpc\":{\"type\":\"CLONE_VOLUME\"}},{\"rpc\":{\"type\":\"EXPAND_VOLUME\"}}]}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: Probe response: {\"ready\":{\"value\":true}}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: Probe call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"message":"performing ssh sanity check..","level":"debug","service":"democratic-csi"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: Probe call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"message":"performing ssh sanity check..","level":"debug","service":"democratic-csi"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: Probe response: {\"ready\":{\"value\":true}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: Probe response: {\"ready\":{\"value\":true}}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: ControllerGetCapabilities call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: ControllerGetCapabilities response: {\"capabilities\":[{\"rpc\":{\"type\":\"CREATE_DELETE_VOLUME\"}},{\"rpc\":{\"type\":\"LIST_VOLUMES\"}},{\"rpc\":{\"type\":\"GET_CAPACITY\"}},{\"rpc\":{\"type\":\"CREATE_DELETE_SNAPSHOT\"}},{\"rpc\":{\"type\":\"LIST_SNAPSHOTS\"}},{\"rpc\":{\"type\":\"CLONE_VOLUME\"}},{\"rpc\":{\"type\":\"EXPAND_VOLUME\"}}]}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: Probe call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"message":"performing ssh sanity check..","level":"debug","service":"democratic-csi"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: Probe call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"message":"performing ssh sanity check..","level":"debug","service":"democratic-csi"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: Probe response: {\"ready\":{\"value\":true}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: Probe response: {\"ready\":{\"value\":true}}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: ControllerGetCapabilities call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: ControllerGetCapabilities response: {\"capabilities\":[{\"rpc\":{\"type\":\"CREATE_DELETE_VOLUME\"}},{\"rpc\":{\"type\":\"LIST_VOLUMES\"}},{\"rpc\":{\"type\":\"GET_CAPACITY\"}},{\"rpc\":{\"type\":\"CREATE_DELETE_SNAPSHOT\"}},{\"rpc\":{\"type\":\"LIST_SNAPSHOTS\"}},{\"rpc\":{\"type\":\"CLONE_VOLUME\"}},{\"rpc\":{\"type\":\"EXPAND_VOLUME\"}}]}"}

No relevant logs in the csi node allocs.

nomad leader output when volume register command is run

2021-05-20T12:42:36.551Z [ERROR] http: request failed: method=PUT path=/v1/volume/csi/traefik-acme error="controller validate volume: rpc error: controller validate volume: CSI.ControllerValidateVolume: unknown volume attachment mode: " code=500
    2021-05-20T12:42:36.551Z [DEBUG] http: request complete: method=PUT path=/v1/volume/csi/traefik-acme duration=15.581266ms

Volume spec:

id = "traefik-acme"
name = "traefik-acme"
type = "csi"
plugin_id = "zfs-nfs"
external_id = ""

capability {
    access_mode = "single-node-writer"
    attachment_mode = "file-system"
}

mount_options {
    fs_type = "nfs"
    mount_flags = ["nolock"]
}

context {
    node_attach_driver = "nfs"
    provisionder_driver = "zfs-generic-nfs"
    server = "nfs-server.example.com"
    share = "/pool/democratic/root/traefik-acme"
}

henriots · 2021-05-20T13:57:33Z

CSI alloc logs are just those, so no help:

level=debug ts=2021-05-20T13:54:15.061513695Z component=grpc-server msg="handling request" req=
level=debug ts=2021-05-20T13:54:15.06163949Z component=grpc-server msg="finished handling request"
level=debug ts=2021-05-20T13:54:15.062113677Z component=grpc-server msg="handling request" req=
level=debug ts=2021-05-20T13:54:15.062190551Z component=grpc-server msg="finished handling request"
level=debug ts=2021-05-20T13:54:15.062231988Z component=grpc-server msg="handling request" req=
level=debug ts=2021-05-20T13:54:15.062312238Z component=grpc-server msg="finished handling request"
level=debug ts=2021-05-20T13:54:15.06230784Z component=grpc-server msg="handling request" req=
level=debug ts=2021-05-20T13:54:15.062354127Z component=grpc-server msg="finished handling request"

Nomad logs
May 20 16:56:10 util nomad[13080]: 2021-05-20T16:56:10.219+0300 [ERROR] http: request failed: method=PUT path=/v1/volume/csi/registry-nomad error="rpc error: controller validate volume: rpc error: controller validate volume: Unknown volume attachment mode: " code=500

apollo13 · 2021-05-21T14:55:16Z

Same here, I think this happens in nomad already without talking to the CSI plugin. Can be reproduced with https://gitlab.com/rocketduck/csi-plugin-nfs/-/tree/main/nomad (the example.volume file probably needs adjusting to 1.1 -> capabilities)

apollo13 · 2021-05-21T15:00:10Z

I wonder if this is at fault:

nomad/client/structs/csi.go

Lines 46 to 47 in 771aad2

    
           AttachmentMode structs.CSIVolumeAttachmentMode 
        
           AccessMode     structs.CSIVolumeAccessMode

The code still assumes that an attachment_mode exists while there it should be a list of caps already, no?

bfqrst · 2021-05-21T15:01:58Z

Seeing similar behaviour using the AWS EBS CSI plugin. CSI plugin logs are clean, but the Nomad client that is taking in the register request shows those 500...

On subsequent tries (read keyboard arrow up and ENTER), the error is slightly different in wording:
Error registering volume: Unexpected response code: 500 (rpc error: rpc error: controller validate volume: CSI.ControllerValidateVolume: unknown volume attachment mode: )
Error registering volume: Unexpected response code: 500 (rpc error: controller validate volume: CSI.ControllerValidateVolume: unknown volume attachment mode: )
Error registering volume: Unexpected response code: 500 (rpc error: controller validate volume: rpc error: controller validate volume: CSI.ControllerValidateVolume: unknown volume attachment mode: )

Some more context in my case: https://discuss.hashicorp.com/t/unable-to-get-a-csi-volume-registered/18805/3

bfqrst · 2021-05-21T15:03:58Z

Same here, I think this happens in nomad already without talking to the CSI plugin. Can be reproduced with https://gitlab.com/rocketduck/csi-plugin-nfs/-/tree/main/nomad (the example.volume file probably needs adjusting to 1.1 -> capabilities)

...which would explain why the plugin logs are clean...

apollo13 · 2021-05-21T15:11:01Z

Mhm, now that I managed to recreate the volume via nomad volume create it still shows this in the UI:

So the access mode is empty and it can't be scheduled, looks like I'll have to downgrade? :/

bfqrst · 2021-05-21T15:49:15Z

Here's another twist in the plot. So both the controller and nodes plugins run on the same Nomad worker.

both are healthy, volume registration throwing 500

Controllers Healthy = 1
Controllers Expected = 1
Nodes Healthy = 1
Nodes Expected = 1

you reboot this particular machine, plugins become unhealthy

Controllers Healthy = 0
Controllers Expected = 1
Nodes Healthy = 0
Nodes Expected = 1

now you try to register your volume --> which works !?!

Container Storage Interface
ID Name Plugin ID Schedulable Access Mode
gitea gitea aws-ebs0 false

Might be unrelated but strange still...

apollo13 · 2021-05-21T16:28:53Z

@tgross I have prepared an easy reproducer for you. Deploy this job:

job "storage" {
  datacenters = ["dc1"]
  type        = "system"

  group "storage" {
    task "storage" {
      driver = "docker"

      env {
        ROCKETDUCK_CSI_TEST = "true"
      }

      config {
        image = "registry.gitlab.com/rocketduck/csi-plugin-nfs:0.3.0"

        args = [
          "--type=monolith",
          "--node-id=${attr.unique.hostname}",
          "--nfs-server=/mnt",
          "--log-level=DEBUG",
        ]

        network_mode = "host"

        privileged = true
      }

      csi_plugin {
        id        = "nfs"
        type      = "monolith"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }

    }
  }
}

This job runs a monolith version of my plugin in test mode, it has no dependency on an external NFS server or anything. Then try registering this:

id = "dav"
name = "dav"
type = "csi"
external_id = "dav"
plugin_id = "nfs"

capacity_min = "100MB"
capacity_max = "1GB"

capability {
  access_mode     = "multi-node-multi-writer"
  attachment_mode = "file-system"
}

apollo13 · 2021-05-21T16:45:30Z

This is what is sent via the API to nomad:

{
    "Volumes": [
        {
            "ID": "dav",
            "Name": "dav",
            "ExternalID": "dav",
            "Namespace": "",
            "Topologies": null,
            "AccessMode": "",
            "AttachmentMode": "",
            "MountOptions": null,
            "Secrets": null,
            "Parameters": null,
            "Context": null,
            "Capacity": 0,
            "RequestedCapacityMin": 100000000,
            "RequestedCapacityMax": 1000000000,
            "RequestedCapabilities": [
                {
                    "AccessMode": "multi-node-multi-writer",
                    "AttachmentMode": "file-system"
                }
            ],
            "CloneID": "",
            "SnapshotID": "",
            "ReadAllocs": null,
            "WriteAllocs": null,
            "Allocations": null,
            "Schedulable": false,
            "PluginID": "nfs",
            "Provider": "",
            "ProviderVersion": "",
            "ControllerRequired": false,
            "ControllersHealthy": 0,
            "ControllersExpected": 0,
            "NodesHealthy": 0,
            "NodesExpected": 0,
            "ResourceExhausted": "0001-01-01T00:00:00Z",
            "CreateIndex": 0,
            "ModifyIndex": 0
        }
    ],
    "Region": "",
    "Namespace": "",
    "SecretID": ""
}

apollo13 · 2021-05-21T17:03:26Z

Funny story, after patching my nomad binary locally like this:

diff --git a/command/volume_register_csi.go b/command/volume_register_csi.go
index aa68d6351..c4dcd9290 100644
--- a/command/volume_register_csi.go
+++ b/command/volume_register_csi.go
@@ -91,6 +91,8 @@ func csiDecodeVolume(input *ast.File) (*api.CSIVolume, error) {
                        }
 
                        vol.RequestedCapabilities = append(vol.RequestedCapabilities, cap)
+                       vol.AccessMode = cap.AccessMode
+                       vol.AttachmentMode = cap.AttachmentMode
                }
        }

I was able to properly register the volume. Running status yielded:

ID                   = wiki
Name                 = wiki
External ID          = wiki
Plugin ID            = nfs
Provider             = dev.rocketduck.csi.nfs
Version              = 0.2.0
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 3
Nodes Expected       = 3
Access Mode          = multi-node-multi-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

but when deploying a job against it, it still complained: failed to setup alloc: pre-run hook "csi_hook" failed: unknown volume attachment mode:

and now all of a sudden Access Mode & Attachment mode show nothing:

ID                   = wiki
Name                 = wiki
External ID          = wiki
Plugin ID            = nfs
Provider             = dev.rocketduck.csi.nfs
Version              = 0.2.0
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 3
Nodes Expected       = 3
Access Mode          = <none>
Attachment Mode      = <none>
Mount Options        = <none>
Namespace            = default

apollo13 · 2021-05-21T17:06:20Z

So there are at least two bugs: Register seems to read in "old" pre 1.1 fields that are no longer populated by the CLI tooling when it should read the capabilities. After that something weird in nomad manages to change & loose Access mode & Attachment mode again.

bfqrst · 2021-05-21T17:10:16Z

So there are at least two bugs: Register seems to read in "old" pre 1.1 fields that are no longer populated by the CLI tooling when it should read the capabilities. After that something weird in nomad manages to change & loose Access mode & Attachment mode again.

@apollo13 Can you reboot your node and check if the plugin comes back healthy? Because that happens in my case with the AWS EBS plugin... Might by a (unrelated) third bug.

apollo13 · 2021-05-23T09:29:26Z

After that something weird in nomad manages to change & loose Access mode & Attachment mode again.

Ok, that one is wrong. This was caused by me not adding access_mode/etc to the volume {} stanza in the job.

EDIT:// The main problem here (first and foremost) seems to be that nomad register cannot handle the new capability {} elements correctly and still uses the old empty fields.

khaledabdelaziz · 2021-05-23T16:03:33Z

I'm having same exact issue post upgrade to v1.1.0.
Not even able to downgrade back to v1.0.4 now. Getting this when I try to switch back to v1.0.4.
Nothing has changed apart of nomad binary.

    2021-05-22T08:03:26.572Z [WARN]  nomad.raft: Election timeout reached, restarting election
    2021-05-22T08:03:26.572Z [INFO]  nomad.raft: entering candidate state: node="Node at xx.xx.xx.xx:4647 [Candidate]" term=40
    2021-05-22T08:03:26.581Z [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter xx.xx.xx.xx:4647 xx.xx.xx.xx:4647}" error="dial tcp xx.xx.xx.xx:4647: connect: connection refused"
    2021-05-22T08:03:26.581Z [INFO]  nomad.raft: election won: tally=2
    2021-05-22T08:03:26.581Z [INFO]  nomad.raft: entering leader state: leader="Node at xx.xx.xx.xx:4647 [Leader]"
    2021-05-22T08:03:26.581Z [INFO]  nomad.raft: added peer, starting replication: peer=xx.xx.xx.xx:4647
    2021-05-22T08:03:26.581Z [INFO]  nomad.raft: added peer, starting replication: peer=xx.xx.xx.xx:4647
    2021-05-22T08:03:26.581Z [INFO]  nomad: cluster leadership acquired
    2021-05-22T08:03:26.582Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter xx.xx.xx.xx:4647 xx.xx.xx.xx:4647}" error="dial tcp xx.xx.xx.xx:4647: connect: connection refused"
    2021-05-22T08:03:26.582Z [INFO]  nomad.raft: pipelining replication: peer="{Voter xx.xx.xx.xx:4647 xx.xx.xx.xx:4647}"
panic: failed to apply request: []byte{0x2e, 0x84, 0xa6, 0x52, 0x65, 0x67, 0x69, 0x6f, 0x6e, 0xa6, 0x67, 0x6c, 0x6f, 0x62, 0x61, 0x6c, 0xa9, 0x4e, 0x61, 0x6d, 0x65, 0x73, 0x70, 0x61, 0x63, 0x65, 0xa0, 0xa9, 0x41, 0x75, 0x74, 0x68, 0x54, 0x6f, 0x6b, 0x65, 0x6e, 0xda, 0x0, 0x24, 0x34, 0x65, 0x33, 0x66, 0x62, 0x62, 0x32, 0x63, 0x2d, 0x64, 0x32, 0x65, 0x63, 0x2d, 0x65, 0x37, 0x61, 0x39, 0x2d, 0x65, 0x65, 0x37, 0x37, 0x2d, 0x35, 0x61, 0x36, 0x63, 0x65, 0x62, 0x30, 0x64, 0x33, 0x34, 0x62, 0x63, 0xa9, 0x46, 0x6f, 0x72, 0x77, 0x61, 0x72, 0x64, 0x65, 0x64, 0xc3}

goroutine 89 [running]:
github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply(0xc00024ce70, 0xc000dce7d0, 0x52c4d20, 0xc0224c93a3359aa8)
        github.com/hashicorp/nomad/nomad/fsm.go:315 +0x1b17
github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc0002cc760)
        github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/fsm.go:90 +0x2c2
github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc000554600, 0x40, 0x40)
        github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/fsm.go:113 +0x75
github.com/hashicorp/raft.(*Raft).runFSM(0xc000511b00)
        github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/fsm.go:219 +0x3c4
github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc000511b00, 0xc000730960)
        github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/state.go:146 +0x55
created by github.com/hashicorp/raft.(*raftState).goFunc
        github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/state.go:144 +0x66
WARNING: keyring exists but -encrypt given, using keyring

apollo13 · 2021-05-23T16:18:38Z

FWIW I have the following workaround: Instead of registering the volume I just ran "nomad volume create" -- for any proper CSI driver this will work because the CreateVolume RPC call must be idempotent (see https://github.com/container-storage-interface/spec/blob/master/spec.md#controller-service-rpc). If that doesn't work, you can get around this by applying my patch from #10626 (comment) to the local nomad cli.

tgross · 2021-05-24T18:30:21Z

Thanks for the repro plugin @apollo13. I'm going to take this and (along with the patches merged this morning) see what I can come up with

I'm having same exact issue post upgrade to v1.1.0.
Not even able to downgrade back to v1.0.4 now. Getting this when I try to switch back to v1.0.4.
Nothing has changed apart of nomad binary.

Hi @khaledabdelaziz, just FYI Nomad cannot be downgraded.

tgross · 2021-05-24T21:04:05Z

After that something weird in nomad manages to change & loose Access mode & Attachment mode again.

Ok, so there's definitely a gap in the documentation around how this is supposed to work (and especially how it changed between Nomad 1.0 and Nomad 1.1.0). The original design for Nomad's CSI implementation for better or worse did not intend to implement the volume creation workflow. So when we decided otherwise, we ran into a contradiction between how the access/attach modes were being used and what the CreateVolume RPCs needed. Specifically, when creating a volume you can pass multiple access/attach modes, which are validated (and even potentially used for creation, depending on the plugin). But in Nomad's data model a volume can only have one access/attach mode, which is the one it's mounted with.

So in Nomad 1.1.0 the access/attach mode is removed from the volume when the volume claim is released ref csi.go#L609-L613. We probably should have a sentinel value that makes this clear. But when you register a volume, you're taking a different code path and we're recording that single value in the volume. But it doesn't mean anything once the volume claim is dropped.

Using the hostpath demo, we can see a volume created via nomad volume create:

$ nomad volume status 'test-volume[0]'
ID                   = test-volume[0]
Name                 = test-volume[0]
External ID          = 8811998c-bccd-11eb-b54e-0242ac110002
Plugin ID            = hostpath-plugin0
Provider             = csi-hostpath
Version              = v1.2.0-0-g83590990
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 1
Nodes Expected       = 1
Access Mode          = <none>
Attachment Mode      = <none>
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

But if we try to register a volume we still get the "unknown attachment mode" error:

$ nomad volume register ./volume.hcl
Error registering volume: Unexpected response code: 500 (controller validate volume: CSI.ControllerValidateVolume: unknown volume attachment mode: )

So I'm fairly certain that your patch is on the right track @apollo13, there's just an unfortunately long chain of different RPCs that it needs to get threaded through. I'm getting towards the end of my day here but I'll pick this back up tomorrow morning. Shouldn't be too terrible for me to fix.

khaledabdelaziz · 2021-05-25T06:51:36Z

@tgross Thanks for your input.
Do you mean nomad cannot be downgraded from any later to earlier version ever or just from v1.1.0?

I was able to workaround the downgrade process with the following steps:
1- Take a snapshot of the existing v1.1.0 cluster
2- Shutdown all 3 server nodes of the v1.1.0 cluster
3- Setup 3 new server nodes with v1.0.4
4- Bootstrap the new servers as they came up as new
5- Restore the snapshot it into new nodes

That brought the cluster back with all acls and other settings.

apollo13 · 2021-05-25T08:41:25Z

So in Nomad 1.1.0 the access/attach mode is removed from the volume when the volume claim is released ref csi.go#L609-L613.

And since it failed scheduling in the csi_hook for me because I missed the access/attachment mode in the voulme stanza, that looked like it would "reset"?

tgross · 2021-05-25T12:03:55Z

@khaledabdelaziz said:

Do you mean nomad cannot be downgraded from any later to earlier version ever or just from v1.1.0?

I was able to workaround the downgrade process with the following steps:

I should properly say it's unsupported to downgrade, from any version (or from ENT to OSS). We don't have any guarantee of forward compatibility in the state store and it's entirely possible for that snapshot restore to fail as a result, leaving the server in a crash loop.

@apollo13 said:

And since it failed scheduling in the csi_hook for me because I missed the access/attachment mode in the voulme stanza, that looked like it would "reset"?

Correct!

tgross · 2021-06-03T19:13:14Z

I got pulled off to deal with #10694 for the last week or so, but I'm looking at this one again. Running a Nomad built with #10651 I was able to reproduce the problem fairly easily.

Spun up the hostpath plugin demo in https://github.com/hashicorp/nomad/tree/main/demo/csi/hostpath. This results in the expected volume claims.

successful nomad volume create

$ nomad volume status
Container Storage Interface
ID              Name            Plugin ID         Schedulable  Access Mode
test-volume[0]  test-volume[0]  hostpath-plugin0  true         single-node-reader-only
test-volume[1]  test-volume[1]  hostpath-plugin0  true         single-node-reader-only

$ nomad volume status 'test-volume[0]'
ID                   = test-volume[0]
Name                 = test-volume[0]
External ID          = e84bcaec-c49b-11eb-b9f3-0242ac110002
Plugin ID            = hostpath-plugin0
Provider             = csi-hostpath
Version              = v1.2.0-0-g83590990
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 1
Nodes Expected       = 1
Access Mode          = single-node-reader-only
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
ID                                    Node ID                               Task Group  Version  Desired  Status   Created     Modified
04d00b6c-4f38-a8ea-216b-a4ed7762ce83  c1b80d00-5dd2-7ed6-18ed-58c457e0129e  cache       0        run      running  14m16s ago  14m5s ago

But now let's try to register a volume. First create it in the storage provider:

endpoint=/var/nomad/client/csi/monolith/hostpath-plugin0/csi.sock
uuid=$(sudo csc --endpoint "$endpoint" controller \
    create-volume 'test-volume[2]' --cap 1,2,ext4 \
    | grep -o '".*"' | tr -d '"')

new volume spec

id          = "VOLUME_NAME"
name        = "VOLUME_NAME"
type        = "csi"
plugin_id   = "hostpath-plugin0"
external_id = "VOLUME_UUID"

capacity_min = "1MB"
capacity_max = "1GB"

capability {
  access_mode     = "single-node-reader-only"
  attachment_mode = "file-system"
}

capability {
  access_mode     = "single-node-writer"
  attachment_mode = "file-system"
}

secrets {
  somesecret = "xyzzy"
}

mount_options {
  mount_flags = ["ro"]
}

And when we register that new volume, we get the error reported above:

sed -e "s/VOLUME_UUID/$uuid/" \
    -e "s/VOLUME_NAME/test-volume[2]/" \
    ./demo/csi/hostpath/hostpath-reg.hcl  | nomad volume register -
Error registering volume: Unexpected response code: 500 (controller validate volume: CSI.ControllerValidateVolume: unknown volume attachment mode: )

It looks like @apollo13's patch in #10626 (comment) will "fix" the problem but it won't give semantically correct results. The RequestedCapabilities field contains a list of capabilities and that's how we should be passing these parameters in the controllerValidateVolume method, similar to how we're doing it for createVolume. See the ValidateVolumeCapabilities RPC in the CSI spec.

Should be a smallish fix, so I'll work on that next.

apollo13 · 2021-06-03T19:24:21Z

Yes, my patch was just band-aid -- my volumes contained only a single capacity and the old code only allows for one. I needed a quick way to get my volumes working again, preferably without patching the server :D Thanks for working on this again!

tgross · 2021-06-03T20:43:13Z

I've opened this PR #10703 and I imagine we'll be able to get that into the upcoming Nomad 1.1.1 patch. Thanks for your patience on this one, folks.

github-actions · 2022-10-19T02:44:04Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

henriots added the type/bug label May 19, 2021

tgross added the theme/storage label May 20, 2021

tgross self-assigned this May 20, 2021

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation May 20, 2021

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage May 20, 2021

tgross added the stage/waiting-reply label May 20, 2021

apollo13 mentioned this issue May 23, 2021

[csi] nomad plan does not fail when required fields are missing in the volume stanza #10645

Closed

tgross removed the stage/waiting-reply label May 24, 2021

tgross mentioned this issue Jun 3, 2021

CSI: accept list of caps during validation in volume register #10703

Merged

tgross closed this as completed in #10703 Jun 4, 2021

Nomad - Community Issues Triage automation moved this from In Progress to Done Jun 4, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Registering volume gives error "Unknown volume attachment mode" #10626

Registering volume gives error "Unknown volume attachment mode" #10626

henriots commented May 19, 2021 •

edited

Loading

optiz0r commented May 19, 2021

tgross commented May 20, 2021

tgross commented May 20, 2021

optiz0r commented May 20, 2021

henriots commented May 20, 2021

apollo13 commented May 21, 2021

apollo13 commented May 21, 2021

bfqrst commented May 21, 2021

bfqrst commented May 21, 2021

apollo13 commented May 21, 2021

bfqrst commented May 21, 2021 •

edited

Loading

apollo13 commented May 21, 2021 •

edited

Loading

apollo13 commented May 21, 2021

apollo13 commented May 21, 2021

apollo13 commented May 21, 2021

bfqrst commented May 21, 2021

apollo13 commented May 23, 2021 •

edited

Loading

khaledabdelaziz commented May 23, 2021 •

edited

Loading

apollo13 commented May 23, 2021

tgross commented May 24, 2021

tgross commented May 24, 2021

khaledabdelaziz commented May 25, 2021 •

edited

Loading

apollo13 commented May 25, 2021

tgross commented May 25, 2021

tgross commented Jun 3, 2021

apollo13 commented Jun 3, 2021

tgross commented Jun 3, 2021

github-actions bot commented Oct 19, 2022

Registering volume gives error "Unknown volume attachment mode" #10626

Registering volume gives error "Unknown volume attachment mode" #10626

Comments

henriots commented May 19, 2021 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

optiz0r commented May 19, 2021

tgross commented May 20, 2021

tgross commented May 20, 2021

optiz0r commented May 20, 2021

henriots commented May 20, 2021

apollo13 commented May 21, 2021

apollo13 commented May 21, 2021

bfqrst commented May 21, 2021

bfqrst commented May 21, 2021

apollo13 commented May 21, 2021

bfqrst commented May 21, 2021 • edited Loading

apollo13 commented May 21, 2021 • edited Loading

apollo13 commented May 21, 2021

apollo13 commented May 21, 2021

apollo13 commented May 21, 2021

bfqrst commented May 21, 2021

apollo13 commented May 23, 2021 • edited Loading

khaledabdelaziz commented May 23, 2021 • edited Loading

apollo13 commented May 23, 2021

tgross commented May 24, 2021

tgross commented May 24, 2021

khaledabdelaziz commented May 25, 2021 • edited Loading

apollo13 commented May 25, 2021

tgross commented May 25, 2021

tgross commented Jun 3, 2021

apollo13 commented Jun 3, 2021

tgross commented Jun 3, 2021

github-actions bot commented Oct 19, 2022

henriots commented May 19, 2021 •

edited

Loading

bfqrst commented May 21, 2021 •

edited

Loading

apollo13 commented May 21, 2021 •

edited

Loading

apollo13 commented May 23, 2021 •

edited

Loading

khaledabdelaziz commented May 23, 2021 •

edited

Loading

khaledabdelaziz commented May 25, 2021 •

edited

Loading