Unable to register the GCE persistent disk CSI Driver #7734

auto-store · 2020-04-16T15:22:44Z

Nomad version

0.11.0

Issue

When running the GCE Persistent disk driver as a Nomad Job, I am unable to register the CSI plugin with Nomad. Both controller and Node jobs are in a healthy state.


[tomh@hashi-server-1 ~]$ nomad status
ID          Type     Priority  Status   Submit Date
controller  service  50        running  2020-04-16T10:22:29Z
nodes       system   50        running  2020-04-16T10:22:36Z

Job file

Controller Job spec:

job "controller" {
  datacenters = ["london"]

  group "controller" {
    task "plugin" {
      driver = "docker"

      config {
        image = "gcr.io/gke-release/gcp-compute-persistent-disk-csi-driver:v0.7.0-gke.0"

        args = [
          "controller",
          "--endpoint=unix://tmp/csi.sock",
          "--logtostderr"
        ]
      }

      csi_plugin {
        id        = "gcepd"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

Node job spec:

job "nodes" {
  datacenters = ["london"]

  type = "system"

  group "nodes" {
    task "plugin" {
      driver = "docker"

      config {
        image = "gcr.io/gke-release/gcp-compute-persistent-disk-csi-driver:v0.7.0-gke.0"

        args = [
          "node",
          "--endpoint=unix://tmp/csi.sock",
          "--logtostderr"
        ]

        privileged = true
      }

      csi_plugin {
        id        = "gcepd"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

[tomh@hashi-server-1 ~]$ nomad plugin status
Container Storage Interface
No CSI plugins

The text was updated successfully, but these errors were encountered:

tgross · 2020-04-16T15:29:10Z

Hi @auto-store! The thing that immediately jumps out at me is the arguments you're passing to the plugin. If I look at the k8s jobspec it looks like it wants --csi-address and not --endpoint. Each plugin can define its own CLI arguments.

If that doesn't help, can you provide Nomad logs for the clients where these plugin jobs are running? Also, the allocation logs (from nomad alloc log -stderr :alloc_id) would be helpful too.

Also, keep in mind that right now it takes a good ~30s or so for a plugin to be registered. This is #7296

auto-store · 2020-04-17T17:35:05Z

Thanks @tgross.

So I can see the container needs service account credentials.

[tomh@hashi-server-1 ~]$ nomad alloc logs -stderr dc3f3452 W0416 20:18:24.754601 1 gce.go:127] GOOGLE_APPLICATION_CREDENTIALS env var not set

So now the job file looks like this:

job "controller" {
  datacenters = ["london"]

  group "controller" {
    task "plugin" {
      driver = "docker"
      template {
        data = <<EOH
{{ key "service_account" }}
EOH

  destination = "secrets/creds.json"
      }

       env {
           "GOOGLE_APPLICATION_CREDENTIALS" = "/secrets/creds.json"
        }
      config {
        image = "gcr.io/gke-release/gcp-compute-persistent-disk-csi-driver:v0.7.0-gke.0"


       args = [
          "--endpoint=unix://csi/csi.sock",
          "--v=6",
          "--logtostderr",
          "--run-node-service=false"
        ]
      }

      csi_plugin {
        id        = "gcepd"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

note the endpoint is --endpoint found this from looking inside the container and finding the flags that can be passed to the plugin executable.

Both controller and node jobs are running and healthy, but nomad plugin status still isn't showing the plugin registered.

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
controller  1        1       1        0          2020-04-17T15:14:06Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
404f5c07  62d948bd  controller  1        run      running  2h23m ago  2h23m ago
[tomh@hashi-node-1 ~]$
[tomh@hashi-node-1 ~]$
[tomh@hashi-node-1 ~]$ nomad alloc logs -stderr 404
I0417 15:03:56.959778       1 main.go:67] Driver vendor version v0.7.0-gke.0
I0417 15:03:56.959858       1 gce.go:80] Using GCE provider config <nil>
I0417 15:03:56.961170       1 gce.go:125] GOOGLE_APPLICATION_CREDENTIALS env var set /secrets/creds.json
I0417 15:03:56.961191       1 gce.go:129] Using DefaultTokenSource &oauth2.reuseTokenSource{new:jwt.jwtSource{ctx:(*context.cancelCtx)(0xc000234340), conf:(*jwt.Config)(0xc000132640)}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
I0417 15:03:57.347123       1 gce.go:193] Using GCP zone from the Metadata server: "europe-west2-c"
I0417 15:03:57.348417       1 gce.go:208] Using GCP project ID from the Metadata server: "tharris-demo-env"
I0417 15:03:57.348448       1 gce-pd-driver.go:89] Enabling volume access mode: SINGLE_NODE_WRITER
I0417 15:03:57.348454       1 gce-pd-driver.go:89] Enabling volume access mode: MULTI_NODE_READER_ONLY
I0417 15:03:57.348458       1 gce-pd-driver.go:99] Enabling controller service capability: CREATE_DELETE_VOLUME
I0417 15:03:57.348463       1 gce-pd-driver.go:99] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME
I0417 15:03:57.348467       1 gce-pd-driver.go:99] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0417 15:03:57.348471       1 gce-pd-driver.go:99] Enabling controller service capability: LIST_SNAPSHOTS
I0417 15:03:57.348475       1 gce-pd-driver.go:99] Enabling controller service capability: PUBLISH_READONLY
I0417 15:03:57.348479       1 gce-pd-driver.go:99] Enabling controller service capability: EXPAND_VOLUME
I0417 15:03:57.348483       1 gce-pd-driver.go:99] Enabling controller service capability: LIST_VOLUMES
I0417 15:03:57.348487       1 gce-pd-driver.go:99] Enabling controller service capability: LIST_VOLUMES_PUBLISHED_NODES
I0417 15:03:57.348491       1 gce-pd-driver.go:109] Enabling node service capability: STAGE_UNSTAGE_VOLUME
I0417 15:03:57.348496       1 gce-pd-driver.go:109] Enabling node service capability: EXPAND_VOLUME
I0417 15:03:57.348499       1 gce-pd-driver.go:109] Enabling node service capability: GET_VOLUME_STATS
I0417 15:03:57.348508       1 gce-pd-driver.go:156] Driver: pd.csi.storage.gke.io
I0417 15:03:57.348600       1 server.go:106] Start listening with scheme unix, addr /csi.sock
I0417 15:03:57.349121       1 server.go:125] Listening for connections on address: &net.UnixAddr{Name:"/csi.sock", Net:"unix"}

same for the node job. looks like it is set up and listening?

here are is some log output for the node the job is running on:

[tomh@hashi-node-1 ~]$ sudo journalctl -xe -u nomad
Apr 17 15:04:02 hashi-node-1 nomad[2252]: 2020-04-17T15:04:02.340Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e44cbf5
Apr 17 15:04:02 hashi-node-1 nomad[2252]: 2020-04-17T15:04:02.340Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e44cbf5
Apr 17 15:04:02 hashi-node-1 nomad[2252]: 2020/04/17 15:04:02.353071 [INFO] (runner) creating new runner (dry: false, once: false)
Apr 17 15:04:02 hashi-node-1 nomad[2252]: 2020/04/17 15:04:02.353193 [INFO] (runner) creating watcher
Apr 17 15:04:02 hashi-node-1 nomad[2252]: 2020/04/17 15:04:02.353278 [INFO] (runner) starting
Apr 17 15:04:02 hashi-node-1 nomad[2252]: 2020/04/17 15:04:02.472593 [INFO] (runner) rendered "(dynamic)" => "/etc/nomad.d/alloc/6e44cbf5-d6ad-6db8-7eac-6e88a3ae2
Apr 17 15:04:02 hashi-node-1 nomad[2252]: 2020-04-17T15:04:02.499Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=689bfa3f7acc61f
Apr 17 15:04:02 hashi-node-1 nomad[2252]: 2020-04-17T15:04:02.887Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=689bfa3f7acc61f
Apr 17 15:04:27 hashi-node-1 nomad[2252]: 2020-04-17T15:04:27.558Z [ERROR] client.rpc: error performing RPC to server: error="RPC Error:: 400,ACL support disabled
Apr 17 15:04:27 hashi-node-1 nomad[2252]: 2020-04-17T15:04:27.558Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL sup
Apr 17 15:04:27 hashi-node-1 nomad[2252]: 2020-04-17T15:04:27.820Z [ERROR] http: request failed: method=GET path=/v1/namespaces error="Nomad Enterprise only endpo
Apr 17 15:04:31 hashi-node-1 nomad[2252]: 2020-04-17T15:04:31.522Z [INFO]  client: task exec session starting: exec_id=30508b69-ae92-987a-7952-7d3508d4524a alloc_
Apr 17 15:05:10 hashi-node-1 nomad[2252]: 2020-04-17T15:05:10.692Z [ERROR] client.rpc: error performing RPC to server: error="RPC Error:: 400,ACL support disabled
Apr 17 15:05:10 hashi-node-1 nomad[2252]: 2020-04-17T15:05:10.692Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL sup
Apr 17 15:05:24 hashi-node-1 nomad[2252]: 2020/04/17 15:05:24.775804 http: response.WriteHeader on hijacked connection from github.com/hashicorp/nomad/vendor/gith
Apr 17 15:05:24 hashi-node-1 nomad[2252]: 2020/04/17 15:05:24.775926 http: response.Write on hijacked connection from compress/gzip.(*Writer).Write (gzip.go:168)
Apr 17 15:05:24 hashi-node-1 nomad[2252]: 2020-04-17T15:05:24.775Z [ERROR] http: request failed: method=GET path=/v1/client/allocation/404f5c07-994a-439c-440e-2ee
Apr 17 15:05:28 hashi-node-1 nomad[2252]: 2020-04-17T15:05:28.737Z [ERROR] client.rpc: error performing RPC to server: error="RPC Error:: 400,ACL support disabled
Apr 17 15:05:28 hashi-node-1 nomad[2252]: 2020-04-17T15:05:28.737Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL sup
Apr 17 15:05:37 hashi-node-1 nomad[2252]: 2020-04-17T15:05:37.902Z [ERROR] client.rpc: error performing RPC to server: error="RPC Error:: 400,ACL support disabled
Apr 17 15:05:37 hashi-node-1 nomad[2252]: 2020-04-17T15:05:37.902Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL sup
Apr 17 15:05:57 hashi-node-1 nomad[2252]: 2020/04/17 15:05:57.751442 http: response.WriteHeader on hijacked connection from github.com/hashicorp/nomad/vendor/gith
Apr 17 15:05:57 hashi-node-1 nomad[2252]: 2020/04/17 15:05:57.751568 http: response.Write on hijacked connection from compress/gzip.(*Writer).Write (gzip.go:168)
Apr 17 15:05:57 hashi-node-1 nomad[2252]: 2020-04-17T15:05:57.751Z [ERROR] http: request failed: method=GET path=/v1/client/allocation/404f5c07-994a-439c-440e-2ee
Apr 17 15:06:17 hashi-node-1 nomad[2252]: 2020-04-17T15:06:17.574Z [ERROR] client.rpc: error performing RPC to server: error="RPC Error:: 400,ACL support disabled
Apr 17 15:06:17 hashi-node-1 nomad[2252]: 2020-04-17T15:06:17.574Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL sup
Apr 17 15:06:17 hashi-node-1 nomad[2252]: 2020-04-17T15:06:17.743Z [ERROR] http: request failed: method=GET path=/v1/namespaces error="Nomad Enterprise only endpo
Apr 17 15:06:17 hashi-node-1 nomad[2252]: 2020-04-17T15:06:17.989Z [ERROR] http: request failed: method=GET path=/v1/namespace/default error="Nomad Enterprise onl
Apr 17 15:06:27 hashi-node-1 nomad[2252]: 2020-04-17T15:06:27.603Z [ERROR] client.rpc: error performing RPC to server: error="RPC Error:: 400,ACL support disabled
Apr 17 15:06:27 hashi-node-1 nomad[2252]: 2020-04-17T15:06:27.603Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL sup
Apr 17 15:06:45 hashi-node-1 nomad[2252]: 2020-04-17T15:06:45.591Z [ERROR] http: request failed: method=GET path=/v1/namespaces error="Nomad Enterprise only endpo
Apr 17 15:06:45 hashi-node-1 nomad[2252]: 2020-04-17T15:06:45.680Z [INFO]  client: task exec session starting: exec_id=e96750e5-25bc-16d0-428f-d4038a47a8df alloc_
Apr 17 15:12:11 hashi-node-1 nomad[2252]: 2020-04-17T15:12:11.931Z [ERROR] client.rpc: error performing RPC to server: error="RPC Error:: 400,ACL support disabled
Apr 17 15:12:11 hashi-node-1 nomad[2252]: 2020-04-17T15:12:11.931Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL sup
Apr 17 15:14:25 hashi-node-1 nomad[2252]: 2020-04-17T15:14:25.333Z [ERROR] http: request failed: method=GET path=/v1/namespaces error="Nomad Enterprise only endpo
Apr 17 17:11:48 hashi-node-1 nomad[2252]: 2020-04-17T17:11:48.963Z [INFO]  client.gc: garbage collecting allocation: alloc_id=a5cee886-dc0d-244b-da34-bcd8e93cbb01

tgross · 2020-04-17T18:17:58Z

The log from the allocation says:

I0417 15:03:57.348600       1 server.go:106] Start listening with scheme unix, addr /csi.sock
I0417 15:03:57.349121       1 server.go:125] Listening for connections on address: &net.UnixAddr{Name:"/csi.sock", Net:"unix"}

So the plugin is listening on unix:///csi.sock. That is, at the socket file csi.sock in the root directory.

The csi_plugin stanza says:

      csi_plugin {
        id        = "gcepd"
        type      = "node"
        mount_dir = "/csi"
      }

The Nomad client logs you've provided are only the end of the log (that's the -e flag) but if you were to page up through them, you'll probably find that there's a message about trying to connect on unix:///csi/csi.sock and not finding it.

As I mentioned earlier, if I look at the k8s jobspec it looks like it wants --csi-address and not --endpoint. Each plugin can define its own CLI arguments. So I suspect (but can't confirm without the right logs) that you want the args stanza to be:

 args = [
          "--csi-address=unix://csi/csi.sock",
          "--v=6",
          "--logtostderr",
          "--run-node-service=false"
        ]
      }

angrycub · 2020-04-17T19:22:42Z

When we exec'd into the container, the executable expects the flag to be -entrypoint. Is it possible that we are conflating GCE plugins?

tgross · 2020-04-17T19:25:57Z

Could be. No link to the plugin was provided here so I was going from the name. Is there one other than the one in the k8s sigs org that I've linked to?

auto-store · 2020-04-17T20:08:31Z

changed to the args suggested. this is the allocation log for the controller, which is stuck in pending:

tomh@hashi-server-1 ~]$ nomad alloc logs -stderr 75c
flag provided but not defined: -csi-address
Usage of /gce-pd-csi-driver:
  -add_dir_header
    	If true, adds the file directory to the header
  -alsologtostderr
    	log to standard error as well as files
  -cloud-config string
    	Path to GCE cloud provider config
  -endpoint string
    	CSI endpoint (default "unix:/tmp/csi.sock")
  -log_backtrace_at value
    	when logging hits line file:N, emit a stack trace
  -log_dir string
    	If non-empty, write log files in this directory
  -log_file string
    	If non-empty, use this log file
  -log_file_max_size uint
    	Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
  -logtostderr
    	log to standard error instead of files (default true)
  -run-controller-service
    	If set to false then the CSI driver does not activate its controller service (default: true) (default true)
  -run-node-service
    	If set to false then the CSI driver does not activate its node service (default: true) (default true)
  -skip_headers
    	If true, avoid header prefixes in the log messages
  -skip_log_headers
    	If true, avoid headers when opening log files
  -stderrthreshold value
    	logs at or above this threshold go to stderr (default 2)
  -v value
    	number for the log level verbosity
  -vmodule value
    	comma-separated list of pattern=N settings for file-filtered logging
flag provided but not defined: -csi-address
Usage of /gce-pd-csi-driver:
  -add_dir_header
    	If true, adds the file directory to the header
  -alsologtostderr
    	log to standard error as well as files
  -cloud-config string
    	Path to GCE cloud provider config
  -endpoint string
    	CSI endpoint (default "unix:/tmp/csi.sock")
  -log_backtrace_at value
    	when logging hits line file:N, emit a stack trace
  -log_dir string
    	If non-empty, write log files in this directory
  -log_file string
    	If non-empty, use this log file
  -log_file_max_size uint
    	Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
  -logtostderr
    	log to standard error instead of files (default true)
  -run-controller-service
    	If set to false then the CSI driver does not activate its controller service (default: true) (default true)
  -run-node-service
    	If set to false then the CSI driver does not activate its node service (default: true) (default true)
  -skip_headers
    	If true, avoid header prefixes in the log messages
  -skip_log_headers
    	If true, avoid headers when opening log files
  -stderrthreshold value
    	logs at or above this threshold go to stderr (default 2)
  -v value
    	number for the log level verbosity
  -vmodule value
    	comma-separated list of pattern=N settings for file-filtered logging

using the same link as you @tgross

tgross · 2020-04-17T20:46:52Z

Oh yup you're right, the link I have points to the config for the "provisioner" (their non-CSI sidecar). -endpoint looks right: https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/deploy/kubernetes/base/controller.yaml#L57 😊

angrycub · 2020-04-17T20:56:13Z

We were using this container - gcr.io/gke-release/gcp-compute-persistent-disk-csi-driver:v0.7.0-gke.0 (from here.

I popped into the container and ran the plugin by hand to check the flags.

Usage of /gce-pd-csi-driver:
  -add_dir_header
    	If true, adds the file directory to the header
  -alsologtostderr
    	log to standard error as well as files
  -cloud-config string
    	Path to GCE cloud provider config
  -endpoint string
    	CSI endpoint (default "unix:/tmp/csi.sock")
  -log_backtrace_at value
    	when logging hits line file:N, emit a stack trace
  -log_dir string
    	If non-empty, write log files in this directory
  -log_file string
    	If non-empty, use this log file
  -log_file_max_size uint
    	Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
  -logtostderr
    	log to standard error instead of files (default true)
  -run-controller-service
    	If set to false then the CSI driver does not activate its controller service (default: true) (default true)
  -run-node-service
    	If set to false then the CSI driver does not activate its node service (default: true) (default true)
  -skip_headers
    	If true, avoid header prefixes in the log messages
  -skip_log_headers
    	If true, avoid headers when opening log files
  -stderrthreshold value
    	logs at or above this threshold go to stderr (default 2)
  -v value
    	number for the log level verbosity
  -vmodule value
    	comma-separated list of pattern=N settings for file-filtered logging

Gonna try and bring up my own copy

angrycub · 2020-04-20T14:23:02Z

I was able to bring up a test and have a valid configuration. I encountered an issue that I was able to resolve in #7754, and then was able to mount a volume as expected. Gist with working config here

tgross · 2020-04-22T12:11:00Z

This will ship in the 0.11.1 release.

github-actions · 2022-11-08T02:32:18Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

auto-store added theme/storage CSI labels Apr 16, 2020

tgross removed the CSI label Apr 21, 2020

tgross added this to the 0.11.1 milestone Apr 22, 2020

tgross closed this as completed Apr 22, 2020

auto-store mentioned this issue May 13, 2020

GCE CSI driver not claiming volumes successfully registered in Nomad #7901

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to register the GCE persistent disk CSI Driver #7734

Unable to register the GCE persistent disk CSI Driver #7734

auto-store commented Apr 16, 2020 •

edited

Loading

tgross commented Apr 16, 2020

auto-store commented Apr 17, 2020

tgross commented Apr 17, 2020

angrycub commented Apr 17, 2020

tgross commented Apr 17, 2020

auto-store commented Apr 17, 2020

tgross commented Apr 17, 2020 •

edited

Loading

angrycub commented Apr 17, 2020

angrycub commented Apr 20, 2020

tgross commented Apr 22, 2020

github-actions bot commented Nov 8, 2022

Unable to register the GCE persistent disk CSI Driver #7734

Unable to register the GCE persistent disk CSI Driver #7734

Comments

auto-store commented Apr 16, 2020 • edited Loading

Nomad version

Issue

Job file

tgross commented Apr 16, 2020

auto-store commented Apr 17, 2020

tgross commented Apr 17, 2020

angrycub commented Apr 17, 2020

tgross commented Apr 17, 2020

auto-store commented Apr 17, 2020

tgross commented Apr 17, 2020 • edited Loading

angrycub commented Apr 17, 2020

angrycub commented Apr 20, 2020

tgross commented Apr 22, 2020

github-actions bot commented Nov 8, 2022

auto-store commented Apr 16, 2020 •

edited

Loading

tgross commented Apr 17, 2020 •

edited

Loading