Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to register the GCE persistent disk CSI Driver #7734

Closed
auto-store opened this issue Apr 16, 2020 · 11 comments
Closed

Unable to register the GCE persistent disk CSI Driver #7734

auto-store opened this issue Apr 16, 2020 · 11 comments
Milestone

Comments

@auto-store
Copy link

auto-store commented Apr 16, 2020

Nomad version

0.11.0

Issue

When running the GCE Persistent disk driver as a Nomad Job, I am unable to register the CSI plugin with Nomad. Both controller and Node jobs are in a healthy state.


[tomh@hashi-server-1 ~]$ nomad status
ID          Type     Priority  Status   Submit Date
controller  service  50        running  2020-04-16T10:22:29Z
nodes       system   50        running  2020-04-16T10:22:36Z

Job file

Controller Job spec:

job "controller" {
  datacenters = ["london"]

  group "controller" {
    task "plugin" {
      driver = "docker"

      config {
        image = "gcr.io/gke-release/gcp-compute-persistent-disk-csi-driver:v0.7.0-gke.0"

        args = [
          "controller",
          "--endpoint=unix://tmp/csi.sock",
          "--logtostderr"
        ]
      }

      csi_plugin {
        id        = "gcepd"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

Node job spec:

job "nodes" {
  datacenters = ["london"]

  type = "system"

  group "nodes" {
    task "plugin" {
      driver = "docker"

      config {
        image = "gcr.io/gke-release/gcp-compute-persistent-disk-csi-driver:v0.7.0-gke.0"

        args = [
          "node",
          "--endpoint=unix://tmp/csi.sock",
          "--logtostderr"
        ]

        privileged = true
      }

      csi_plugin {
        id        = "gcepd"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

[tomh@hashi-server-1 ~]$ nomad plugin status
Container Storage Interface
No CSI plugins
@tgross
Copy link
Member

tgross commented Apr 16, 2020

Hi @auto-store! The thing that immediately jumps out at me is the arguments you're passing to the plugin. If I look at the k8s jobspec it looks like it wants --csi-address and not --endpoint. Each plugin can define its own CLI arguments.

If that doesn't help, can you provide Nomad logs for the clients where these plugin jobs are running? Also, the allocation logs (from nomad alloc log -stderr :alloc_id) would be helpful too.

Also, keep in mind that right now it takes a good ~30s or so for a plugin to be registered. This is #7296

@auto-store
Copy link
Author

Thanks @tgross.

So I can see the container needs service account credentials.

[tomh@hashi-server-1 ~]$ nomad alloc logs -stderr dc3f3452 W0416 20:18:24.754601 1 gce.go:127] GOOGLE_APPLICATION_CREDENTIALS env var not set

So now the job file looks like this:

job "controller" {
  datacenters = ["london"]

  group "controller" {
    task "plugin" {
      driver = "docker"
      template {
        data = <<EOH
{{ key "service_account" }}
EOH

  destination = "secrets/creds.json"
      }

       env {
           "GOOGLE_APPLICATION_CREDENTIALS" = "/secrets/creds.json"
        }
      config {
        image = "gcr.io/gke-release/gcp-compute-persistent-disk-csi-driver:v0.7.0-gke.0"


       args = [
          "--endpoint=unix://csi/csi.sock",
          "--v=6",
          "--logtostderr",
          "--run-node-service=false"
        ]
      }

      csi_plugin {
        id        = "gcepd"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}


note the endpoint is --endpoint found this from looking inside the container and finding the flags that can be passed to the plugin executable.

Both controller and node jobs are running and healthy, but nomad plugin status still isn't showing the plugin registered.

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
controller  1        1       1        0          2020-04-17T15:14:06Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
404f5c07  62d948bd  controller  1        run      running  2h23m ago  2h23m ago
[tomh@hashi-node-1 ~]$
[tomh@hashi-node-1 ~]$
[tomh@hashi-node-1 ~]$ nomad alloc logs -stderr 404
I0417 15:03:56.959778       1 main.go:67] Driver vendor version v0.7.0-gke.0
I0417 15:03:56.959858       1 gce.go:80] Using GCE provider config <nil>
I0417 15:03:56.961170       1 gce.go:125] GOOGLE_APPLICATION_CREDENTIALS env var set /secrets/creds.json
I0417 15:03:56.961191       1 gce.go:129] Using DefaultTokenSource &oauth2.reuseTokenSource{new:jwt.jwtSource{ctx:(*context.cancelCtx)(0xc000234340), conf:(*jwt.Config)(0xc000132640)}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
I0417 15:03:57.347123       1 gce.go:193] Using GCP zone from the Metadata server: "europe-west2-c"
I0417 15:03:57.348417       1 gce.go:208] Using GCP project ID from the Metadata server: "tharris-demo-env"
I0417 15:03:57.348448       1 gce-pd-driver.go:89] Enabling volume access mode: SINGLE_NODE_WRITER
I0417 15:03:57.348454       1 gce-pd-driver.go:89] Enabling volume access mode: MULTI_NODE_READER_ONLY
I0417 15:03:57.348458       1 gce-pd-driver.go:99] Enabling controller service capability: CREATE_DELETE_VOLUME
I0417 15:03:57.348463       1 gce-pd-driver.go:99] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME
I0417 15:03:57.348467       1 gce-pd-driver.go:99] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0417 15:03:57.348471       1 gce-pd-driver.go:99] Enabling controller service capability: LIST_SNAPSHOTS
I0417 15:03:57.348475       1 gce-pd-driver.go:99] Enabling controller service capability: PUBLISH_READONLY
I0417 15:03:57.348479       1 gce-pd-driver.go:99] Enabling controller service capability: EXPAND_VOLUME
I0417 15:03:57.348483       1 gce-pd-driver.go:99] Enabling controller service capability: LIST_VOLUMES
I0417 15:03:57.348487       1 gce-pd-driver.go:99] Enabling controller service capability: LIST_VOLUMES_PUBLISHED_NODES
I0417 15:03:57.348491       1 gce-pd-driver.go:109] Enabling node service capability: STAGE_UNSTAGE_VOLUME
I0417 15:03:57.348496       1 gce-pd-driver.go:109] Enabling node service capability: EXPAND_VOLUME
I0417 15:03:57.348499       1 gce-pd-driver.go:109] Enabling node service capability: GET_VOLUME_STATS
I0417 15:03:57.348508       1 gce-pd-driver.go:156] Driver: pd.csi.storage.gke.io
I0417 15:03:57.348600       1 server.go:106] Start listening with scheme unix, addr /csi.sock
I0417 15:03:57.349121       1 server.go:125] Listening for connections on address: &net.UnixAddr{Name:"/csi.sock", Net:"unix"}

same for the node job. looks like it is set up and listening?

here are is some log output for the node the job is running on:

[tomh@hashi-node-1 ~]$ sudo journalctl -xe -u nomad
Apr 17 15:04:02 hashi-node-1 nomad[2252]: 2020-04-17T15:04:02.340Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e44cbf5
Apr 17 15:04:02 hashi-node-1 nomad[2252]: 2020-04-17T15:04:02.340Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=6e44cbf5
Apr 17 15:04:02 hashi-node-1 nomad[2252]: 2020/04/17 15:04:02.353071 [INFO] (runner) creating new runner (dry: false, once: false)
Apr 17 15:04:02 hashi-node-1 nomad[2252]: 2020/04/17 15:04:02.353193 [INFO] (runner) creating watcher
Apr 17 15:04:02 hashi-node-1 nomad[2252]: 2020/04/17 15:04:02.353278 [INFO] (runner) starting
Apr 17 15:04:02 hashi-node-1 nomad[2252]: 2020/04/17 15:04:02.472593 [INFO] (runner) rendered "(dynamic)" => "/etc/nomad.d/alloc/6e44cbf5-d6ad-6db8-7eac-6e88a3ae2
Apr 17 15:04:02 hashi-node-1 nomad[2252]: 2020-04-17T15:04:02.499Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=689bfa3f7acc61f
Apr 17 15:04:02 hashi-node-1 nomad[2252]: 2020-04-17T15:04:02.887Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=689bfa3f7acc61f
Apr 17 15:04:27 hashi-node-1 nomad[2252]: 2020-04-17T15:04:27.558Z [ERROR] client.rpc: error performing RPC to server: error="RPC Error:: 400,ACL support disabled
Apr 17 15:04:27 hashi-node-1 nomad[2252]: 2020-04-17T15:04:27.558Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL sup
Apr 17 15:04:27 hashi-node-1 nomad[2252]: 2020-04-17T15:04:27.820Z [ERROR] http: request failed: method=GET path=/v1/namespaces error="Nomad Enterprise only endpo
Apr 17 15:04:31 hashi-node-1 nomad[2252]: 2020-04-17T15:04:31.522Z [INFO]  client: task exec session starting: exec_id=30508b69-ae92-987a-7952-7d3508d4524a alloc_
Apr 17 15:05:10 hashi-node-1 nomad[2252]: 2020-04-17T15:05:10.692Z [ERROR] client.rpc: error performing RPC to server: error="RPC Error:: 400,ACL support disabled
Apr 17 15:05:10 hashi-node-1 nomad[2252]: 2020-04-17T15:05:10.692Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL sup
Apr 17 15:05:24 hashi-node-1 nomad[2252]: 2020/04/17 15:05:24.775804 http: response.WriteHeader on hijacked connection from github.com/hashicorp/nomad/vendor/gith
Apr 17 15:05:24 hashi-node-1 nomad[2252]: 2020/04/17 15:05:24.775926 http: response.Write on hijacked connection from compress/gzip.(*Writer).Write (gzip.go:168)
Apr 17 15:05:24 hashi-node-1 nomad[2252]: 2020-04-17T15:05:24.775Z [ERROR] http: request failed: method=GET path=/v1/client/allocation/404f5c07-994a-439c-440e-2ee
Apr 17 15:05:28 hashi-node-1 nomad[2252]: 2020-04-17T15:05:28.737Z [ERROR] client.rpc: error performing RPC to server: error="RPC Error:: 400,ACL support disabled
Apr 17 15:05:28 hashi-node-1 nomad[2252]: 2020-04-17T15:05:28.737Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL sup
Apr 17 15:05:37 hashi-node-1 nomad[2252]: 2020-04-17T15:05:37.902Z [ERROR] client.rpc: error performing RPC to server: error="RPC Error:: 400,ACL support disabled
Apr 17 15:05:37 hashi-node-1 nomad[2252]: 2020-04-17T15:05:37.902Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL sup
Apr 17 15:05:57 hashi-node-1 nomad[2252]: 2020/04/17 15:05:57.751442 http: response.WriteHeader on hijacked connection from github.com/hashicorp/nomad/vendor/gith
Apr 17 15:05:57 hashi-node-1 nomad[2252]: 2020/04/17 15:05:57.751568 http: response.Write on hijacked connection from compress/gzip.(*Writer).Write (gzip.go:168)
Apr 17 15:05:57 hashi-node-1 nomad[2252]: 2020-04-17T15:05:57.751Z [ERROR] http: request failed: method=GET path=/v1/client/allocation/404f5c07-994a-439c-440e-2ee
Apr 17 15:06:17 hashi-node-1 nomad[2252]: 2020-04-17T15:06:17.574Z [ERROR] client.rpc: error performing RPC to server: error="RPC Error:: 400,ACL support disabled
Apr 17 15:06:17 hashi-node-1 nomad[2252]: 2020-04-17T15:06:17.574Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL sup
Apr 17 15:06:17 hashi-node-1 nomad[2252]: 2020-04-17T15:06:17.743Z [ERROR] http: request failed: method=GET path=/v1/namespaces error="Nomad Enterprise only endpo
Apr 17 15:06:17 hashi-node-1 nomad[2252]: 2020-04-17T15:06:17.989Z [ERROR] http: request failed: method=GET path=/v1/namespace/default error="Nomad Enterprise onl
Apr 17 15:06:27 hashi-node-1 nomad[2252]: 2020-04-17T15:06:27.603Z [ERROR] client.rpc: error performing RPC to server: error="RPC Error:: 400,ACL support disabled
Apr 17 15:06:27 hashi-node-1 nomad[2252]: 2020-04-17T15:06:27.603Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL sup
Apr 17 15:06:45 hashi-node-1 nomad[2252]: 2020-04-17T15:06:45.591Z [ERROR] http: request failed: method=GET path=/v1/namespaces error="Nomad Enterprise only endpo
Apr 17 15:06:45 hashi-node-1 nomad[2252]: 2020-04-17T15:06:45.680Z [INFO]  client: task exec session starting: exec_id=e96750e5-25bc-16d0-428f-d4038a47a8df alloc_
Apr 17 15:12:11 hashi-node-1 nomad[2252]: 2020-04-17T15:12:11.931Z [ERROR] client.rpc: error performing RPC to server: error="RPC Error:: 400,ACL support disabled
Apr 17 15:12:11 hashi-node-1 nomad[2252]: 2020-04-17T15:12:11.931Z [ERROR] http: request failed: method=GET path=/v1/acl/token/self error="RPC Error:: 400,ACL sup
Apr 17 15:14:25 hashi-node-1 nomad[2252]: 2020-04-17T15:14:25.333Z [ERROR] http: request failed: method=GET path=/v1/namespaces error="Nomad Enterprise only endpo
Apr 17 17:11:48 hashi-node-1 nomad[2252]: 2020-04-17T17:11:48.963Z [INFO]  client.gc: garbage collecting allocation: alloc_id=a5cee886-dc0d-244b-da34-bcd8e93cbb01

@tgross
Copy link
Member

tgross commented Apr 17, 2020

The log from the allocation says:

I0417 15:03:57.348600       1 server.go:106] Start listening with scheme unix, addr /csi.sock
I0417 15:03:57.349121       1 server.go:125] Listening for connections on address: &net.UnixAddr{Name:"/csi.sock", Net:"unix"}

So the plugin is listening on unix:///csi.sock. That is, at the socket file csi.sock in the root directory.

The csi_plugin stanza says:

      csi_plugin {
        id        = "gcepd"
        type      = "node"
        mount_dir = "/csi"
      }

The Nomad client logs you've provided are only the end of the log (that's the -e flag) but if you were to page up through them, you'll probably find that there's a message about trying to connect on unix:///csi/csi.sock and not finding it.

As I mentioned earlier, if I look at the k8s jobspec it looks like it wants --csi-address and not --endpoint. Each plugin can define its own CLI arguments. So I suspect (but can't confirm without the right logs) that you want the args stanza to be:

 args = [
          "--csi-address=unix://csi/csi.sock",
          "--v=6",
          "--logtostderr",
          "--run-node-service=false"
        ]
      }

@angrycub
Copy link
Contributor

When we exec'd into the container, the executable expects the flag to be -entrypoint. Is it possible that we are conflating GCE plugins?

@tgross
Copy link
Member

tgross commented Apr 17, 2020

Could be. No link to the plugin was provided here so I was going from the name. Is there one other than the one in the k8s sigs org that I've linked to?

@auto-store
Copy link
Author

changed to the args suggested. this is the allocation log for the controller, which is stuck in pending:

tomh@hashi-server-1 ~]$ nomad alloc logs -stderr 75c
flag provided but not defined: -csi-address
Usage of /gce-pd-csi-driver:
  -add_dir_header
    	If true, adds the file directory to the header
  -alsologtostderr
    	log to standard error as well as files
  -cloud-config string
    	Path to GCE cloud provider config
  -endpoint string
    	CSI endpoint (default "unix:/tmp/csi.sock")
  -log_backtrace_at value
    	when logging hits line file:N, emit a stack trace
  -log_dir string
    	If non-empty, write log files in this directory
  -log_file string
    	If non-empty, use this log file
  -log_file_max_size uint
    	Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
  -logtostderr
    	log to standard error instead of files (default true)
  -run-controller-service
    	If set to false then the CSI driver does not activate its controller service (default: true) (default true)
  -run-node-service
    	If set to false then the CSI driver does not activate its node service (default: true) (default true)
  -skip_headers
    	If true, avoid header prefixes in the log messages
  -skip_log_headers
    	If true, avoid headers when opening log files
  -stderrthreshold value
    	logs at or above this threshold go to stderr (default 2)
  -v value
    	number for the log level verbosity
  -vmodule value
    	comma-separated list of pattern=N settings for file-filtered logging
flag provided but not defined: -csi-address
Usage of /gce-pd-csi-driver:
  -add_dir_header
    	If true, adds the file directory to the header
  -alsologtostderr
    	log to standard error as well as files
  -cloud-config string
    	Path to GCE cloud provider config
  -endpoint string
    	CSI endpoint (default "unix:/tmp/csi.sock")
  -log_backtrace_at value
    	when logging hits line file:N, emit a stack trace
  -log_dir string
    	If non-empty, write log files in this directory
  -log_file string
    	If non-empty, use this log file
  -log_file_max_size uint
    	Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
  -logtostderr
    	log to standard error instead of files (default true)
  -run-controller-service
    	If set to false then the CSI driver does not activate its controller service (default: true) (default true)
  -run-node-service
    	If set to false then the CSI driver does not activate its node service (default: true) (default true)
  -skip_headers
    	If true, avoid header prefixes in the log messages
  -skip_log_headers
    	If true, avoid headers when opening log files
  -stderrthreshold value
    	logs at or above this threshold go to stderr (default 2)
  -v value
    	number for the log level verbosity
  -vmodule value
    	comma-separated list of pattern=N settings for file-filtered logging

using the same link as you @tgross

@tgross
Copy link
Member

tgross commented Apr 17, 2020

Oh yup you're right, the link I have points to the config for the "provisioner" (their non-CSI sidecar). -endpoint looks right: https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/deploy/kubernetes/base/controller.yaml#L57 😊

@angrycub
Copy link
Contributor

We were using this container - gcr.io/gke-release/gcp-compute-persistent-disk-csi-driver:v0.7.0-gke.0 (from here.

I popped into the container and ran the plugin by hand to check the flags.

Usage of /gce-pd-csi-driver:
  -add_dir_header
    	If true, adds the file directory to the header
  -alsologtostderr
    	log to standard error as well as files
  -cloud-config string
    	Path to GCE cloud provider config
  -endpoint string
    	CSI endpoint (default "unix:/tmp/csi.sock")
  -log_backtrace_at value
    	when logging hits line file:N, emit a stack trace
  -log_dir string
    	If non-empty, write log files in this directory
  -log_file string
    	If non-empty, use this log file
  -log_file_max_size uint
    	Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
  -logtostderr
    	log to standard error instead of files (default true)
  -run-controller-service
    	If set to false then the CSI driver does not activate its controller service (default: true) (default true)
  -run-node-service
    	If set to false then the CSI driver does not activate its node service (default: true) (default true)
  -skip_headers
    	If true, avoid header prefixes in the log messages
  -skip_log_headers
    	If true, avoid headers when opening log files
  -stderrthreshold value
    	logs at or above this threshold go to stderr (default 2)
  -v value
    	number for the log level verbosity
  -vmodule value
    	comma-separated list of pattern=N settings for file-filtered logging

Gonna try and bring up my own copy

@angrycub
Copy link
Contributor

I was able to bring up a test and have a valid configuration. I encountered an issue that I was able to resolve in #7754, and then was able to mount a volume as expected. Gist with working config here

@tgross tgross removed the CSI label Apr 21, 2020
@tgross tgross added this to the 0.11.1 milestone Apr 22, 2020
@tgross
Copy link
Member

tgross commented Apr 22, 2020

This will ship in the 0.11.1 release.

@github-actions
Copy link

github-actions bot commented Nov 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants