Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Registering volume gives error "Unknown volume attachment mode" #10626

Closed
henriots opened this issue May 19, 2021 · 28 comments · Fixed by #10703
Closed

Registering volume gives error "Unknown volume attachment mode" #10626

henriots opened this issue May 19, 2021 · 28 comments · Fixed by #10703

Comments

@henriots
Copy link

henriots commented May 19, 2021

Hello!

After upgrading to v1.1.0, volume registration gives an error:
Error registering volume: Unexpected response code: 500 (rpc error: controller validate volume: Unknown volume attachment mode: )

It worked with v.1.4.0 when having set in volume configuration:
access_mode = "single-node-writer" attachment_mode = "file-system"

Nomad version

Nomad v1.1.0 (2678c36)

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

type = "csi"
id = "registry-nomad"
name = "registry-nomad"
external_id = "1234567"

capability {
  attachment_mode = "file-system"
  access_mode = "single-node-writer"
}

plugin_id = "csi.hetzner.cloud"
@optiz0r
Copy link
Contributor

optiz0r commented May 19, 2021

Seeing the same here. I'm pretty sure it was working in 1.1.0-rc1, since I had previously amended my volume spec to include the newly required capability block, but after upgrading to 1.1.0 and finding a CSI allocation stuck, I deregistered it, and am now unable to re-register the volume with the same volume spec which had previously worked.

@tgross
Copy link
Member

tgross commented May 20, 2021

Hi @henriots! Can you grab the allocation logs for the plugin? That might help diagnose the problem.

@tgross
Copy link
Member

tgross commented May 20, 2021

Seeing the same here. I'm pretty sure it was working in 1.1.0-rc1, since I had previously amended my volume spec to include the newly required capability block, but after upgrading to 1.1.0 and finding a CSI allocation stuck, I deregistered it, and am now unable to re-register the volume with the same volume spec which had previously worked.

@optiz0r can you provide the error message you're seeing (+ alloc logs for the plugin, if available)

@tgross tgross self-assigned this May 20, 2021
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation May 20, 2021
@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage May 20, 2021
@optiz0r
Copy link
Contributor

optiz0r commented May 20, 2021

Volume registration command failing:

$ nomad volume register vol-acme.hcl
Error registering volume: Unexpected response code: 500 (controller validate volume: rpc error: controller validate volume: CSI.ControllerValidateVolume: unknown volume attachment mode: )

CSI controller alloc logs (controller had just been restarted prior to running the above command):

{"service":"democratic-csi","level":"info","message":"initializing csi driver: zfs-generic-nfs"}
{"message":"setting default identity service caps","level":"debug","service":"democratic-csi"}
{"message":"setting default identity volume_expansion caps","level":"debug","service":"democratic-csi"}
{"message":"setting default controller caps","level":"debug","service":"democratic-csi"}
{"message":"setting default node caps","level":"debug","service":"democratic-csi"}
{"service":"democratic-csi","level":"info","message":"starting csi server - name: org.democratic-csi.nfs, version: 1.2.0, driver: zfs-generic-nfs, mode: controller, csi version: 1.2.0, address: 0.0.0.0:9000, socket: unix:///csi-data/csi.sock"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: Probe call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"message":"performing ssh sanity check..","level":"debug","service":"democratic-csi"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: Probe response: {\"ready\":{\"value\":true}}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: GetPluginInfo call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: GetPluginInfo response: {\"name\":\"org.democratic-csi.nfs\",\"vendor_version\":\"1.2.0\"}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: GetPluginCapabilities call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: GetPluginCapabilities response: {\"capabilities\":[{\"service\":{\"type\":\"CONTROLLER_SERVICE\"}},{\"volume_expansion\":{\"type\":\"ONLINE\"}}]}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: Probe call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"message":"performing ssh sanity check..","level":"debug","service":"democratic-csi"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: Probe call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"message":"performing ssh sanity check..","level":"debug","service":"democratic-csi"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: Probe response: {\"ready\":{\"value\":true}}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: ControllerGetCapabilities call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: ControllerGetCapabilities response: {\"capabilities\":[{\"rpc\":{\"type\":\"CREATE_DELETE_VOLUME\"}},{\"rpc\":{\"type\":\"LIST_VOLUMES\"}},{\"rpc\":{\"type\":\"GET_CAPACITY\"}},{\"rpc\":{\"type\":\"CREATE_DELETE_SNAPSHOT\"}},{\"rpc\":{\"type\":\"LIST_SNAPSHOTS\"}},{\"rpc\":{\"type\":\"CLONE_VOLUME\"}},{\"rpc\":{\"type\":\"EXPAND_VOLUME\"}}]}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: Probe response: {\"ready\":{\"value\":true}}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: Probe call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"message":"performing ssh sanity check..","level":"debug","service":"democratic-csi"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: Probe call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"message":"performing ssh sanity check..","level":"debug","service":"democratic-csi"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: Probe response: {\"ready\":{\"value\":true}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: Probe response: {\"ready\":{\"value\":true}}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: ControllerGetCapabilities call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: ControllerGetCapabilities response: {\"capabilities\":[{\"rpc\":{\"type\":\"CREATE_DELETE_VOLUME\"}},{\"rpc\":{\"type\":\"LIST_VOLUMES\"}},{\"rpc\":{\"type\":\"GET_CAPACITY\"}},{\"rpc\":{\"type\":\"CREATE_DELETE_SNAPSHOT\"}},{\"rpc\":{\"type\":\"LIST_SNAPSHOTS\"}},{\"rpc\":{\"type\":\"CLONE_VOLUME\"}},{\"rpc\":{\"type\":\"EXPAND_VOLUME\"}}]}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: Probe call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"message":"performing ssh sanity check..","level":"debug","service":"democratic-csi"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: Probe call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"message":"performing ssh sanity check..","level":"debug","service":"democratic-csi"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: Probe response: {\"ready\":{\"value\":true}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: Probe response: {\"ready\":{\"value\":true}}"}
{"service":"democratic-csi","level":"info","message":"new request - driver: ControllerZfsGenericDriver method: ControllerGetCapabilities call: {\"_events\":{},\"_eventsCount\":1,\"call\":{},\"cancelled\":false,\"metadata\":{\"_internal_repr\":{\"user-agent\":[\"grpc-go/1.27.1\"]},\"flags\":0},\"request\":{}}"}
{"service":"democratic-csi","level":"info","message":"new response - driver: ControllerZfsGenericDriver method: ControllerGetCapabilities response: {\"capabilities\":[{\"rpc\":{\"type\":\"CREATE_DELETE_VOLUME\"}},{\"rpc\":{\"type\":\"LIST_VOLUMES\"}},{\"rpc\":{\"type\":\"GET_CAPACITY\"}},{\"rpc\":{\"type\":\"CREATE_DELETE_SNAPSHOT\"}},{\"rpc\":{\"type\":\"LIST_SNAPSHOTS\"}},{\"rpc\":{\"type\":\"CLONE_VOLUME\"}},{\"rpc\":{\"type\":\"EXPAND_VOLUME\"}}]}"}

No relevant logs in the csi node allocs.

nomad leader output when volume register command is run

2021-05-20T12:42:36.551Z [ERROR] http: request failed: method=PUT path=/v1/volume/csi/traefik-acme error="controller validate volume: rpc error: controller validate volume: CSI.ControllerValidateVolume: unknown volume attachment mode: " code=500
    2021-05-20T12:42:36.551Z [DEBUG] http: request complete: method=PUT path=/v1/volume/csi/traefik-acme duration=15.581266ms

Volume spec:

id = "traefik-acme"
name = "traefik-acme"
type = "csi"
plugin_id = "zfs-nfs"
external_id = ""

capability {
    access_mode = "single-node-writer"
    attachment_mode = "file-system"
}

mount_options {
    fs_type = "nfs"
    mount_flags = ["nolock"]
}

context {
    node_attach_driver = "nfs"
    provisionder_driver = "zfs-generic-nfs"
    server = "nfs-server.example.com"
    share = "/pool/democratic/root/traefik-acme"
}

@henriots
Copy link
Author

CSI alloc logs are just those, so no help:

level=debug ts=2021-05-20T13:54:15.061513695Z component=grpc-server msg="handling request" req=
level=debug ts=2021-05-20T13:54:15.06163949Z component=grpc-server msg="finished handling request"
level=debug ts=2021-05-20T13:54:15.062113677Z component=grpc-server msg="handling request" req=
level=debug ts=2021-05-20T13:54:15.062190551Z component=grpc-server msg="finished handling request"
level=debug ts=2021-05-20T13:54:15.062231988Z component=grpc-server msg="handling request" req=
level=debug ts=2021-05-20T13:54:15.062312238Z component=grpc-server msg="finished handling request"
level=debug ts=2021-05-20T13:54:15.06230784Z component=grpc-server msg="handling request" req=
level=debug ts=2021-05-20T13:54:15.062354127Z component=grpc-server msg="finished handling request"

Nomad logs
May 20 16:56:10 util nomad[13080]: 2021-05-20T16:56:10.219+0300 [ERROR] http: request failed: method=PUT path=/v1/volume/csi/registry-nomad error="rpc error: controller validate volume: rpc error: controller validate volume: Unknown volume attachment mode: " code=500

@apollo13
Copy link
Contributor

Same here, I think this happens in nomad already without talking to the CSI plugin. Can be reproduced with https://gitlab.com/rocketduck/csi-plugin-nfs/-/tree/main/nomad (the example.volume file probably needs adjusting to 1.1 -> capabilities)

@apollo13
Copy link
Contributor

I wonder if this is at fault:

AttachmentMode structs.CSIVolumeAttachmentMode
AccessMode structs.CSIVolumeAccessMode

The code still assumes that an attachment_mode exists while there it should be a list of caps already, no?

@bfqrst
Copy link

bfqrst commented May 21, 2021

Seeing similar behaviour using the AWS EBS CSI plugin. CSI plugin logs are clean, but the Nomad client that is taking in the register request shows those 500...

On subsequent tries (read keyboard arrow up and ENTER), the error is slightly different in wording:
Error registering volume: Unexpected response code: 500 (rpc error: rpc error: controller validate volume: CSI.ControllerValidateVolume: unknown volume attachment mode: )
Error registering volume: Unexpected response code: 500 (rpc error: controller validate volume: CSI.ControllerValidateVolume: unknown volume attachment mode: )
Error registering volume: Unexpected response code: 500 (rpc error: controller validate volume: rpc error: controller validate volume: CSI.ControllerValidateVolume: unknown volume attachment mode: )

Some more context in my case: https://discuss.hashicorp.com/t/unable-to-get-a-csi-volume-registered/18805/3

@bfqrst
Copy link

bfqrst commented May 21, 2021

Same here, I think this happens in nomad already without talking to the CSI plugin. Can be reproduced with https://gitlab.com/rocketduck/csi-plugin-nfs/-/tree/main/nomad (the example.volume file probably needs adjusting to 1.1 -> capabilities)

...which would explain why the plugin logs are clean...

@apollo13
Copy link
Contributor

Mhm, now that I managed to recreate the volume via nomad volume create it still shows this in the UI:

image

So the access mode is empty and it can't be scheduled, looks like I'll have to downgrade? :/

@bfqrst
Copy link

bfqrst commented May 21, 2021

Here's another twist in the plot. So both the controller and nodes plugins run on the same Nomad worker.

  1. both are healthy, volume registration throwing 500

Controllers Healthy = 1
Controllers Expected = 1
Nodes Healthy = 1
Nodes Expected = 1

  1. you reboot this particular machine, plugins become unhealthy

Controllers Healthy = 0
Controllers Expected = 1
Nodes Healthy = 0
Nodes Expected = 1

  1. now you try to register your volume --> which works !?!

Container Storage Interface
ID Name Plugin ID Schedulable Access Mode
gitea gitea aws-ebs0 false

Might be unrelated but strange still...

@apollo13
Copy link
Contributor

apollo13 commented May 21, 2021

@tgross I have prepared an easy reproducer for you. Deploy this job:

job "storage" {
  datacenters = ["dc1"]
  type        = "system"

  group "storage" {
    task "storage" {
      driver = "docker"

      env {
        ROCKETDUCK_CSI_TEST = "true"
      }

      config {
        image = "registry.gitlab.com/rocketduck/csi-plugin-nfs:0.3.0"

        args = [
          "--type=monolith",
          "--node-id=${attr.unique.hostname}",
          "--nfs-server=/mnt",
          "--log-level=DEBUG",
        ]

        network_mode = "host"

        privileged = true
      }

      csi_plugin {
        id        = "nfs"
        type      = "monolith"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }

    }
  }
}

This job runs a monolith version of my plugin in test mode, it has no dependency on an external NFS server or anything. Then try registering this:

id = "dav"
name = "dav"
type = "csi"
external_id = "dav"
plugin_id = "nfs"

capacity_min = "100MB"
capacity_max = "1GB"

capability {
  access_mode     = "multi-node-multi-writer"
  attachment_mode = "file-system"
}

@apollo13
Copy link
Contributor

This is what is sent via the API to nomad:

{
    "Volumes": [
        {
            "ID": "dav",
            "Name": "dav",
            "ExternalID": "dav",
            "Namespace": "",
            "Topologies": null,
            "AccessMode": "",
            "AttachmentMode": "",
            "MountOptions": null,
            "Secrets": null,
            "Parameters": null,
            "Context": null,
            "Capacity": 0,
            "RequestedCapacityMin": 100000000,
            "RequestedCapacityMax": 1000000000,
            "RequestedCapabilities": [
                {
                    "AccessMode": "multi-node-multi-writer",
                    "AttachmentMode": "file-system"
                }
            ],
            "CloneID": "",
            "SnapshotID": "",
            "ReadAllocs": null,
            "WriteAllocs": null,
            "Allocations": null,
            "Schedulable": false,
            "PluginID": "nfs",
            "Provider": "",
            "ProviderVersion": "",
            "ControllerRequired": false,
            "ControllersHealthy": 0,
            "ControllersExpected": 0,
            "NodesHealthy": 0,
            "NodesExpected": 0,
            "ResourceExhausted": "0001-01-01T00:00:00Z",
            "CreateIndex": 0,
            "ModifyIndex": 0
        }
    ],
    "Region": "",
    "Namespace": "",
    "SecretID": ""
}

@apollo13
Copy link
Contributor

Funny story, after patching my nomad binary locally like this:

diff --git a/command/volume_register_csi.go b/command/volume_register_csi.go
index aa68d6351..c4dcd9290 100644
--- a/command/volume_register_csi.go
+++ b/command/volume_register_csi.go
@@ -91,6 +91,8 @@ func csiDecodeVolume(input *ast.File) (*api.CSIVolume, error) {
                        }
 
                        vol.RequestedCapabilities = append(vol.RequestedCapabilities, cap)
+                       vol.AccessMode = cap.AccessMode
+                       vol.AttachmentMode = cap.AttachmentMode
                }
        }

I was able to properly register the volume. Running status yielded:

ID                   = wiki
Name                 = wiki
External ID          = wiki
Plugin ID            = nfs
Provider             = dev.rocketduck.csi.nfs
Version              = 0.2.0
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 3
Nodes Expected       = 3
Access Mode          = multi-node-multi-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

but when deploying a job against it, it still complained: failed to setup alloc: pre-run hook "csi_hook" failed: unknown volume attachment mode:

and now all of a sudden Access Mode & Attachment mode show nothing:

ID                   = wiki
Name                 = wiki
External ID          = wiki
Plugin ID            = nfs
Provider             = dev.rocketduck.csi.nfs
Version              = 0.2.0
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 3
Nodes Expected       = 3
Access Mode          = <none>
Attachment Mode      = <none>
Mount Options        = <none>
Namespace            = default

@apollo13
Copy link
Contributor

So there are at least two bugs: Register seems to read in "old" pre 1.1 fields that are no longer populated by the CLI tooling when it should read the capabilities. After that something weird in nomad manages to change & loose Access mode & Attachment mode again.

@bfqrst
Copy link

bfqrst commented May 21, 2021

So there are at least two bugs: Register seems to read in "old" pre 1.1 fields that are no longer populated by the CLI tooling when it should read the capabilities. After that something weird in nomad manages to change & loose Access mode & Attachment mode again.

@apollo13 Can you reboot your node and check if the plugin comes back healthy? Because that happens in my case with the AWS EBS plugin... Might by a (unrelated) third bug.

@apollo13
Copy link
Contributor

apollo13 commented May 23, 2021

After that something weird in nomad manages to change & loose Access mode & Attachment mode again.

Ok, that one is wrong. This was caused by me not adding access_mode/etc to the volume {} stanza in the job.

EDIT:// The main problem here (first and foremost) seems to be that nomad register cannot handle the new capability {} elements correctly and still uses the old empty fields.

@khaledabdelaziz
Copy link

khaledabdelaziz commented May 23, 2021

I'm having same exact issue post upgrade to v1.1.0.
Not even able to downgrade back to v1.0.4 now. Getting this when I try to switch back to v1.0.4.
Nothing has changed apart of nomad binary.

    2021-05-22T08:03:26.572Z [WARN]  nomad.raft: Election timeout reached, restarting election
    2021-05-22T08:03:26.572Z [INFO]  nomad.raft: entering candidate state: node="Node at xx.xx.xx.xx:4647 [Candidate]" term=40
    2021-05-22T08:03:26.581Z [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter xx.xx.xx.xx:4647 xx.xx.xx.xx:4647}" error="dial tcp xx.xx.xx.xx:4647: connect: connection refused"
    2021-05-22T08:03:26.581Z [INFO]  nomad.raft: election won: tally=2
    2021-05-22T08:03:26.581Z [INFO]  nomad.raft: entering leader state: leader="Node at xx.xx.xx.xx:4647 [Leader]"
    2021-05-22T08:03:26.581Z [INFO]  nomad.raft: added peer, starting replication: peer=xx.xx.xx.xx:4647
    2021-05-22T08:03:26.581Z [INFO]  nomad.raft: added peer, starting replication: peer=xx.xx.xx.xx:4647
    2021-05-22T08:03:26.581Z [INFO]  nomad: cluster leadership acquired
    2021-05-22T08:03:26.582Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter xx.xx.xx.xx:4647 xx.xx.xx.xx:4647}" error="dial tcp xx.xx.xx.xx:4647: connect: connection refused"
    2021-05-22T08:03:26.582Z [INFO]  nomad.raft: pipelining replication: peer="{Voter xx.xx.xx.xx:4647 xx.xx.xx.xx:4647}"
panic: failed to apply request: []byte{0x2e, 0x84, 0xa6, 0x52, 0x65, 0x67, 0x69, 0x6f, 0x6e, 0xa6, 0x67, 0x6c, 0x6f, 0x62, 0x61, 0x6c, 0xa9, 0x4e, 0x61, 0x6d, 0x65, 0x73, 0x70, 0x61, 0x63, 0x65, 0xa0, 0xa9, 0x41, 0x75, 0x74, 0x68, 0x54, 0x6f, 0x6b, 0x65, 0x6e, 0xda, 0x0, 0x24, 0x34, 0x65, 0x33, 0x66, 0x62, 0x62, 0x32, 0x63, 0x2d, 0x64, 0x32, 0x65, 0x63, 0x2d, 0x65, 0x37, 0x61, 0x39, 0x2d, 0x65, 0x65, 0x37, 0x37, 0x2d, 0x35, 0x61, 0x36, 0x63, 0x65, 0x62, 0x30, 0x64, 0x33, 0x34, 0x62, 0x63, 0xa9, 0x46, 0x6f, 0x72, 0x77, 0x61, 0x72, 0x64, 0x65, 0x64, 0xc3}

goroutine 89 [running]:
github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply(0xc00024ce70, 0xc000dce7d0, 0x52c4d20, 0xc0224c93a3359aa8)
        github.com/hashicorp/nomad/nomad/fsm.go:315 +0x1b17
github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc0002cc760)
        github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/fsm.go:90 +0x2c2
github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc000554600, 0x40, 0x40)
        github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/fsm.go:113 +0x75
github.com/hashicorp/raft.(*Raft).runFSM(0xc000511b00)
        github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/fsm.go:219 +0x3c4
github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc000511b00, 0xc000730960)
        github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/state.go:146 +0x55
created by github.com/hashicorp/raft.(*raftState).goFunc
        github.com/hashicorp/raft@v1.1.3-0.20200211192230-365023de17e6/state.go:144 +0x66
WARNING: keyring exists but -encrypt given, using keyring

@apollo13
Copy link
Contributor

FWIW I have the following workaround: Instead of registering the volume I just ran "nomad volume create" -- for any proper CSI driver this will work because the CreateVolume RPC call must be idempotent (see https://github.com/container-storage-interface/spec/blob/master/spec.md#controller-service-rpc). If that doesn't work, you can get around this by applying my patch from #10626 (comment) to the local nomad cli.

@tgross
Copy link
Member

tgross commented May 24, 2021

Thanks for the repro plugin @apollo13. I'm going to take this and (along with the patches merged this morning) see what I can come up with

I'm having same exact issue post upgrade to v1.1.0.
Not even able to downgrade back to v1.0.4 now. Getting this when I try to switch back to v1.0.4.
Nothing has changed apart of nomad binary.

Hi @khaledabdelaziz, just FYI Nomad cannot be downgraded.

@tgross
Copy link
Member

tgross commented May 24, 2021

After that something weird in nomad manages to change & loose Access mode & Attachment mode again.

Ok, so there's definitely a gap in the documentation around how this is supposed to work (and especially how it changed between Nomad 1.0 and Nomad 1.1.0). The original design for Nomad's CSI implementation for better or worse did not intend to implement the volume creation workflow. So when we decided otherwise, we ran into a contradiction between how the access/attach modes were being used and what the CreateVolume RPCs needed. Specifically, when creating a volume you can pass multiple access/attach modes, which are validated (and even potentially used for creation, depending on the plugin). But in Nomad's data model a volume can only have one access/attach mode, which is the one it's mounted with.

So in Nomad 1.1.0 the access/attach mode is removed from the volume when the volume claim is released ref csi.go#L609-L613. We probably should have a sentinel value that makes this clear. But when you register a volume, you're taking a different code path and we're recording that single value in the volume. But it doesn't mean anything once the volume claim is dropped.

Using the hostpath demo, we can see a volume created via nomad volume create:

$ nomad volume status 'test-volume[0]'
ID                   = test-volume[0]
Name                 = test-volume[0]
External ID          = 8811998c-bccd-11eb-b54e-0242ac110002
Plugin ID            = hostpath-plugin0
Provider             = csi-hostpath
Version              = v1.2.0-0-g83590990
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 1
Nodes Expected       = 1
Access Mode          = <none>
Attachment Mode      = <none>
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

But if we try to register a volume we still get the "unknown attachment mode" error:

$ nomad volume register ./volume.hcl
Error registering volume: Unexpected response code: 500 (controller validate volume: CSI.ControllerValidateVolume: unknown volume attachment mode: )

So I'm fairly certain that your patch is on the right track @apollo13, there's just an unfortunately long chain of different RPCs that it needs to get threaded through. I'm getting towards the end of my day here but I'll pick this back up tomorrow morning. Shouldn't be too terrible for me to fix.

@khaledabdelaziz
Copy link

khaledabdelaziz commented May 25, 2021

@tgross Thanks for your input.
Do you mean nomad cannot be downgraded from any later to earlier version ever or just from v1.1.0?

I was able to workaround the downgrade process with the following steps:
1- Take a snapshot of the existing v1.1.0 cluster
2- Shutdown all 3 server nodes of the v1.1.0 cluster
3- Setup 3 new server nodes with v1.0.4
4- Bootstrap the new servers as they came up as new
5- Restore the snapshot it into new nodes

That brought the cluster back with all acls and other settings.

@apollo13
Copy link
Contributor

So in Nomad 1.1.0 the access/attach mode is removed from the volume when the volume claim is released ref csi.go#L609-L613.

And since it failed scheduling in the csi_hook for me because I missed the access/attachment mode in the voulme stanza, that looked like it would "reset"?

@tgross
Copy link
Member

tgross commented May 25, 2021

@khaledabdelaziz said:

Do you mean nomad cannot be downgraded from any later to earlier version ever or just from v1.1.0?

I was able to workaround the downgrade process with the following steps:

I should properly say it's unsupported to downgrade, from any version (or from ENT to OSS). We don't have any guarantee of forward compatibility in the state store and it's entirely possible for that snapshot restore to fail as a result, leaving the server in a crash loop.

@apollo13 said:

And since it failed scheduling in the csi_hook for me because I missed the access/attachment mode in the voulme stanza, that looked like it would "reset"?

Correct!

@tgross
Copy link
Member

tgross commented Jun 3, 2021

I got pulled off to deal with #10694 for the last week or so, but I'm looking at this one again. Running a Nomad built with #10651 I was able to reproduce the problem fairly easily.

Spun up the hostpath plugin demo in https://github.com/hashicorp/nomad/tree/main/demo/csi/hostpath. This results in the expected volume claims.

successful nomad volume create
$ nomad volume status
Container Storage Interface
ID              Name            Plugin ID         Schedulable  Access Mode
test-volume[0]  test-volume[0]  hostpath-plugin0  true         single-node-reader-only
test-volume[1]  test-volume[1]  hostpath-plugin0  true         single-node-reader-only

$ nomad volume status 'test-volume[0]'
ID                   = test-volume[0]
Name                 = test-volume[0]
External ID          = e84bcaec-c49b-11eb-b9f3-0242ac110002
Plugin ID            = hostpath-plugin0
Provider             = csi-hostpath
Version              = v1.2.0-0-g83590990
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 1
Nodes Expected       = 1
Access Mode          = single-node-reader-only
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
ID                                    Node ID                               Task Group  Version  Desired  Status   Created     Modified
04d00b6c-4f38-a8ea-216b-a4ed7762ce83  c1b80d00-5dd2-7ed6-18ed-58c457e0129e  cache       0        run      running  14m16s ago  14m5s ago

But now let's try to register a volume. First create it in the storage provider:

endpoint=/var/nomad/client/csi/monolith/hostpath-plugin0/csi.sock
uuid=$(sudo csc --endpoint "$endpoint" controller \
    create-volume 'test-volume[2]' --cap 1,2,ext4 \
    | grep -o '".*"' | tr -d '"')
new volume spec
id          = "VOLUME_NAME"
name        = "VOLUME_NAME"
type        = "csi"
plugin_id   = "hostpath-plugin0"
external_id = "VOLUME_UUID"

capacity_min = "1MB"
capacity_max = "1GB"

capability {
  access_mode     = "single-node-reader-only"
  attachment_mode = "file-system"
}

capability {
  access_mode     = "single-node-writer"
  attachment_mode = "file-system"
}

secrets {
  somesecret = "xyzzy"
}

mount_options {
  mount_flags = ["ro"]
}

And when we register that new volume, we get the error reported above:

sed -e "s/VOLUME_UUID/$uuid/" \
    -e "s/VOLUME_NAME/test-volume[2]/" \
    ./demo/csi/hostpath/hostpath-reg.hcl  | nomad volume register -
Error registering volume: Unexpected response code: 500 (controller validate volume: CSI.ControllerValidateVolume: unknown volume attachment mode: )

It looks like @apollo13's patch in #10626 (comment) will "fix" the problem but it won't give semantically correct results. The RequestedCapabilities field contains a list of capabilities and that's how we should be passing these parameters in the controllerValidateVolume method, similar to how we're doing it for createVolume. See the ValidateVolumeCapabilities RPC in the CSI spec.

Should be a smallish fix, so I'll work on that next.

@apollo13
Copy link
Contributor

apollo13 commented Jun 3, 2021

Yes, my patch was just band-aid -- my volumes contained only a single capacity and the old code only allows for one. I needed a quick way to get my volumes working again, preferably without patching the server :D Thanks for working on this again!

@tgross
Copy link
Member

tgross commented Jun 3, 2021

I've opened this PR #10703 and I imagine we'll be able to get that into the upcoming Nomad 1.1.1 patch. Thanks for your patience on this one, folks.

Nomad - Community Issues Triage automation moved this from In Progress to Done Jun 4, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants