Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core.sched: failed to GC plugin: plugin_id=<plugin> error="rpc error: Permission denied" #11162

Closed
urog opened this issue Sep 9, 2021 · 12 comments

Comments

@urog
Copy link

urog commented Sep 9, 2021

Nomad version

Output from nomad version
Nomad v1.1.4 (acd3d7889328ad1df2895eb714e2cbe3dd9c6d82)

Operating system and Environment details

Linux 5.4.0-1051-gcp #55~18.04.1-Ubuntu SMP Sun Aug 1 20:38:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Issue

I am seeing these errors in my server logs over and over:

Sep  9 05:34:23 dev-consul-nomad-servers-qwzl nomad[1995]:     2021-09-09T05:34:23.567Z [ERROR] core.sched: failed to GC plugin: plugin_id=gcepd error="rpc error: Permission denied"
Sep  9 05:34:23 dev-consul-nomad-servers-qwzl nomad[1995]:     2021-09-09T05:34:23.567Z [ERROR] worker: error invoking scheduler: error="failed to process evaluation: rpc error: Permission denied"

The plugin the error refers to is the CSI pluginpd.csi.storage.gke.io, and I've tried versions 0.7.0 through to 1.2.0 - all yield the same result. Scheduling jobs with a csi volume mount work just fine. The issue is when the job is stopped/purged and the volume cannot be mounted to any other job because Nomad thinks it's still allocated to the previous job.

I've been experiencing this since nomad version ~ v0.12.* and was following this issue, hoping these would fix it. Nothing has changed.

Nomad has ACLs enabled, and the anonymous policy is disabled.

@lgfa29
Copy link
Contributor

lgfa29 commented Sep 14, 2021

Hi @urog, thanks for the report.

Do you see any error in the plugin logs? Either the node or the controller?

Thank you.

@urog
Copy link
Author

urog commented Sep 15, 2021

Thanks @lgfa29

Some more logs. There are no logs on either the csi controller or nodes that correspond to the errors on the nomad servers.

Server

2021-09-15T00:38:58.504Z [DEBUG] core.sched: eval GC scanning before cutoff index: index=4645340 eval_gc_threshold=1h0m0s
2021-09-15T00:38:58.510Z [DEBUG] core.sched: eval GC found eligibile objects: evals=6 allocs=0
2021-09-15T00:38:58.513Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=redis-dev error="2 errors occurred:
	* Permission denied
	* Permission denied

"

CSI Controller
Here are some boot logs. The last few lines are repeated ever 30 seconds.

I0915 00:54:01.351887       1 main.go:68] Driver vendor version v1.2.0-gke.0-0-gbd7b8c6-dirty
I0915 00:54:01.351936       1 gce.go:84] Using GCE provider config <nil>
I0915 00:54:01.352024       1 gce.go:135] GOOGLE_APPLICATION_CREDENTIALS env var set /secrets/creds.json
I0915 00:54:01.352034       1 gce.go:139] Using DefaultTokenSource &oauth2.reuseTokenSource{new:jwt.jwtSource{ctx:(*context.cancelCtx)(0xc000350800), conf:(*jwt.Config)(0xc000168aa0)}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
I0915 00:54:01.683868       1 gce.go:216] Using GCP zone from the Metadata server: "australia-southeast1-b"
I0915 00:54:01.684563       1 gce.go:231] Using GCP project ID from the Metadata server: "<snip>"
I0915 00:54:01.684584       1 gce-pd-driver.go:90] Enabling volume access mode: SINGLE_NODE_WRITER
I0915 00:54:01.684590       1 gce-pd-driver.go:90] Enabling volume access mode: MULTI_NODE_READER_ONLY
I0915 00:54:01.684597       1 gce-pd-driver.go:90] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0915 00:54:01.684602       1 gce-pd-driver.go:100] Enabling controller service capability: CREATE_DELETE_VOLUME
I0915 00:54:01.684607       1 gce-pd-driver.go:100] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME
I0915 00:54:01.684611       1 gce-pd-driver.go:100] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0915 00:54:01.684616       1 gce-pd-driver.go:100] Enabling controller service capability: LIST_SNAPSHOTS
I0915 00:54:01.684619       1 gce-pd-driver.go:100] Enabling controller service capability: PUBLISH_READONLY
I0915 00:54:01.684624       1 gce-pd-driver.go:100] Enabling controller service capability: EXPAND_VOLUME
I0915 00:54:01.684628       1 gce-pd-driver.go:100] Enabling controller service capability: LIST_VOLUMES
I0915 00:54:01.684632       1 gce-pd-driver.go:100] Enabling controller service capability: LIST_VOLUMES_PUBLISHED_NODES
I0915 00:54:01.684636       1 gce-pd-driver.go:110] Enabling node service capability: STAGE_UNSTAGE_VOLUME
I0915 00:54:01.684651       1 gce-pd-driver.go:110] Enabling node service capability: EXPAND_VOLUME
I0915 00:54:01.684655       1 gce-pd-driver.go:110] Enabling node service capability: GET_VOLUME_STATS
I0915 00:54:01.684660       1 gce-pd-driver.go:157] Driver: pd.csi.storage.gke.io
I0915 00:54:01.684736       1 server.go:106] Start listening with scheme unix, addr /csi/csi.sock
I0915 00:54:01.684917       1 server.go:125] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0915 00:54:07.407516       1 utils.go:55] /csi.v1.Identity/Probe called with request: 
I0915 00:54:07.407554       1 utils.go:60] /csi.v1.Identity/Probe returned with response: 
I0915 00:54:07.415457       1 utils.go:55] /csi.v1.Identity/GetPluginInfo called with request: 
I0915 00:54:07.415478       1 utils.go:60] /csi.v1.Identity/GetPluginInfo returned with response: name:"pd.csi.storage.gke.io" vendor_version:"v1.2.0-gke.0-0-gbd7b8c6-dirty" 
I0915 00:54:07.416343       1 utils.go:55] /csi.v1.Identity/GetPluginCapabilities called with request: 
I0915 00:54:07.416363       1 utils.go:60] /csi.v1.Identity/GetPluginCapabilities returned with response: capabilities:<service:<type:CONTROLLER_SERVICE > > capabilities:<service:<type:VOLUME_ACCESSIBILITY_CONSTRAINTS > > capabilities:<volume_expansion:<type:ONLINE > > capabilities:<volume_expansion:<type:OFFLINE > > 
I0915 00:54:07.416848       1 utils.go:55] /csi.v1.Identity/Probe called with request: 
I0915 00:54:07.416910       1 utils.go:60] /csi.v1.Identity/Probe returned with response: 
I0915 00:54:07.417141       1 utils.go:55] /csi.v1.Controller/ControllerGetCapabilities called with request: 
I0915 00:54:07.417247       1 utils.go:55] /csi.v1.Identity/Probe called with request: 
I0915 00:54:07.417635       1 utils.go:60] /csi.v1.Identity/Probe returned with response: 
I0915 00:54:07.417163       1 utils.go:60] /csi.v1.Controller/ControllerGetCapabilities returned with response: capabilities:<rpc:<type:CREATE_DELETE_VOLUME > > capabilities:<rpc:<type:PUBLISH_UNPUBLISH_VOLUME > > capabilities:<rpc:<type:CREATE_DELETE_SNAPSHOT > > capabilities:<rpc:<type:LIST_SNAPSHOTS > > capabilities:<rpc:<type:PUBLISH_READONLY > > capabilities:<rpc:<type:EXPAND_VOLUME > > capabilities:<rpc:<type:LIST_VOLUMES > > capabilities:<rpc:<type:LIST_VOLUMES_PUBLISHED_NODES > > 

CSI Client

These are the only logs that come through. The last few every 30 seconds.

I0915 00:53:48.974939       1 main.go:68] Driver vendor version v1.2.0-gke.0-0-gbd7b8c6-dirty
I0915 00:53:48.975045       1 mount_linux.go:163] Detected OS without systemd
I0915 00:53:49.000182       1 gce-pd-driver.go:90] Enabling volume access mode: SINGLE_NODE_WRITER
I0915 00:53:49.000198       1 gce-pd-driver.go:90] Enabling volume access mode: MULTI_NODE_READER_ONLY
I0915 00:53:49.000201       1 gce-pd-driver.go:90] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0915 00:53:49.000205       1 gce-pd-driver.go:100] Enabling controller service capability: CREATE_DELETE_VOLUME
I0915 00:53:49.000209       1 gce-pd-driver.go:100] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME
I0915 00:53:49.000212       1 gce-pd-driver.go:100] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0915 00:53:49.000214       1 gce-pd-driver.go:100] Enabling controller service capability: LIST_SNAPSHOTS
I0915 00:53:49.000217       1 gce-pd-driver.go:100] Enabling controller service capability: PUBLISH_READONLY
I0915 00:53:49.000219       1 gce-pd-driver.go:100] Enabling controller service capability: EXPAND_VOLUME
I0915 00:53:49.000222       1 gce-pd-driver.go:100] Enabling controller service capability: LIST_VOLUMES
I0915 00:53:49.000224       1 gce-pd-driver.go:100] Enabling controller service capability: LIST_VOLUMES_PUBLISHED_NODES
I0915 00:53:49.000230       1 gce-pd-driver.go:110] Enabling node service capability: STAGE_UNSTAGE_VOLUME
I0915 00:53:49.000237       1 gce-pd-driver.go:110] Enabling node service capability: EXPAND_VOLUME
I0915 00:53:49.000240       1 gce-pd-driver.go:110] Enabling node service capability: GET_VOLUME_STATS
I0915 00:53:49.000245       1 gce-pd-driver.go:157] Driver: pd.csi.storage.gke.io
I0915 00:53:49.000361       1 server.go:106] Start listening with scheme unix, addr /csi/csi.sock
I0915 00:53:49.000551       1 server.go:125] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0915 00:53:49.010802       1 utils.go:55] /csi.v1.Identity/Probe called with request: 
I0915 00:53:49.010830       1 utils.go:60] /csi.v1.Identity/Probe returned with response: 
I0915 00:53:49.012545       1 utils.go:55] /csi.v1.Identity/GetPluginInfo called with request: 
I0915 00:53:49.012562       1 utils.go:60] /csi.v1.Identity/GetPluginInfo returned with response: name:"pd.csi.storage.gke.io" vendor_version:"v1.2.0-gke.0-0-gbd7b8c6-dirty" 
I0915 00:53:49.013352       1 utils.go:55] /csi.v1.Identity/GetPluginCapabilities called with request: 
I0915 00:53:49.013385       1 utils.go:60] /csi.v1.Identity/GetPluginCapabilities returned with response: capabilities:<service:<type:CONTROLLER_SERVICE > > capabilities:<service:<type:VOLUME_ACCESSIBILITY_CONSTRAINTS > > capabilities:<volume_expansion:<type:ONLINE > > capabilities:<volume_expansion:<type:OFFLINE > > 
I0915 00:53:49.014787       1 utils.go:55] /csi.v1.Node/NodeGetInfo called with request: 
I0915 00:53:49.014847       1 utils.go:60] /csi.v1.Node/NodeGetInfo returned with response: node_id:"projects/<snip>/zones/australia-southeast1-a/instances/dev-consul-nomad-clients-1zk0" max_volumes_per_node:127 accessible_topology:<segments:<key:"topology.gke.io/zone" value:"australia-southeast1-a" > > 
I0915 00:53:49.015592       1 utils.go:55] /csi.v1.Identity/Probe called with request: 
I0915 00:53:49.015666       1 utils.go:60] /csi.v1.Identity/Probe returned with response: 
I0915 00:53:49.015606       1 utils.go:55] /csi.v1.Identity/Probe called with request: 
I0915 00:53:49.015878       1 utils.go:60] /csi.v1.Identity/Probe returned with response: 
I0915 00:53:49.016138       1 utils.go:55] /csi.v1.Node/NodeGetCapabilities called with request: 
I0915 00:53:49.016158       1 utils.go:60] /csi.v1.Node/NodeGetCapabilities returned with response: capabilities:<rpc:<type:STAGE_UNSTAGE_VOLUME > > capabilities:<rpc:<type:EXPAND_VOLUME > > capabilities:<rpc:<type:GET_VOLUME_STATS > > 

@lgfa29
Copy link
Contributor

lgfa29 commented Sep 16, 2021

Thanks for the logs, unfurtunately not much there, and Permission denied is such a generic error that it's pin point the issue.

Would you mind increasing the plugin verbosity to see if that provides more clues?

From the plugin source code it seems like you can go all the way to -v=6.

Thanks!

@urog
Copy link
Author

urog commented Sep 20, 2021

I've run both the controller and nodes with full verbosity. I don't actually see any corresponding events on the CSI controller or node logs when Nomad fails to release the volume.

Here's how I've deployed the CSI controller / nodes

Controller

job "csi-cge-controller" {
  datacenters = [
    "australia-southeast1-a",
    "australia-southeast1-b",
    "australia-southeast1-c",
  ]

  priority = 100

  group "controller" {
    count = 2
    task "plugin" {

      driver = "docker"

      template {
        data = <<EOH
{{ key "platform/csi/gce/service_account" }}
EOH
        destination = "secrets/creds.json"
      }

      env {
        "GOOGLE_APPLICATION_CREDENTIALS" = "/secrets/creds.json"
      }

      config {
        image = "gcr.io/gke-release/gcp-compute-persistent-disk-csi-driver:v1.2.0-gke.0"
        args = [
          "-endpoint=unix:///csi/csi.sock",
          "-v=6",
          "-logtostderr",
          "-run-node-service=false"
        ]
        labels {
          app_name = "csi-cge-controller"
          environment = "dev"
        }
      }

      csi_plugin {
        id        = "gcepd"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 128
      }
    }
  }
}

Nodes

job "csi-gce-nodes" {
  datacenters = [
    "australia-southeast1-a",
    "australia-southeast1-b",
    "australia-southeast1-c",
  ]

  type = "system"
  priority = 100

  # only one plugin of a given type and ID should be deployed on
  # any given client node
  constraint {
    operator = "distinct_hosts"
    value = true
  }

  group "nodes" {
    task "plugin" {
      driver = "docker"
      template {
        data = <<EOH
{{ key "platform/csi/gce/service_account" }}
EOH
        destination = "secrets/creds.json"
      }

      env {
        "GOOGLE_APPLICATION_CREDENTIALS" = "/secrets/creds.json"
      }

      config {
        image = "gcr.io/gke-release/gcp-compute-persistent-disk-csi-driver:v1.2.0-gke.0"
        args = [
          "-endpoint=unix:///csi/csi.sock",
          "-v=6",
          "-logtostderr",
          "-run-controller-service=false"
        ]
        privileged = true
        labels {
          app_name = "csi-cge-nodes"
          environment = "dev"
        }
      }

      csi_plugin {
        id        = "gcepd"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 175
      }
    }
  }
}

@urog
Copy link
Author

urog commented Sep 20, 2021

And these are the logs from the Nomad servers. Over and over:

Sep 20 06:03:58 dev-consul-nomad-servers-7gbk nomad[2195]:     2021-09-20T06:03:58.513Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=redis-dev error="2 errors occurred:
Sep 20 06:03:58 dev-consul-nomad-servers-7gbk nomad[2195]:         * Permission denied
Sep 20 06:03:58 dev-consul-nomad-servers-7gbk nomad[2195]: "
Sep 20 06:08:58 dev-consul-nomad-servers-7gbk nomad[2195]:     2021-09-20T06:08:58.510Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=redis-dev error="2 errors occurred:
Sep 20 06:08:58 dev-consul-nomad-servers-7gbk nomad[2195]:         * Permission denied
Sep 20 06:08:58 dev-consul-nomad-servers-7gbk nomad[2195]: "
Sep 20 06:13:58 dev-consul-nomad-servers-7gbk nomad[2195]:     2021-09-20T06:13:58.512Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=redis-dev error="2 errors occurred:
Sep 20 06:13:58 dev-consul-nomad-servers-7gbk nomad[2195]:         * Permission denied
Sep 20 06:13:58 dev-consul-nomad-servers-7gbk nomad[2195]: "

@tgross tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Nov 9, 2021
@tgross
Copy link
Member

tgross commented Jan 31, 2022

Just doing some issue cleanup and saw this issue. I want to note that this error:

2021-09-15T00:38:58.513Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=redis-dev error="2 errors occurred:
	* Permission denied
	* Permission denied

"

Should be fixed in #11891, which will ship in the upcoming Nomad 1.2.5. That's unrelated to the original problem in this issue, which is:

Sep 9 05:34:23 dev-consul-nomad-servers-qwzl nomad[1995]: 2021-09-09T05:34:23.567Z [ERROR] core.sched: failed to GC plugin: plugin_id=gcepd error="rpc error: Permission denied"
Sep 9 05:34:23 dev-consul-nomad-servers-qwzl nomad[1995]: 2021-09-09T05:34:23.567Z [ERROR] worker: error invoking scheduler: error="failed to process evaluation: rpc error: Permission denied"

That error looks like evals have somehow been created with the wrong leader ACL token.

@urog
Copy link
Author

urog commented Feb 3, 2022

Just tested on Nomad 1.2.5 and it appears to be working. I will test some node draining / job migrations and report back.

@tgross
Copy link
Member

tgross commented Feb 3, 2022

@urog for plugins, not just volumes?

@tgross tgross added stage/waiting-reply stage/needs-verification Issue needs verifying it still exists and removed stage/needs-investigation labels Feb 3, 2022
@tgross tgross self-assigned this Feb 3, 2022
@urog
Copy link
Author

urog commented Feb 7, 2022

I have tested:

  • Rolling new Nomad servers into an existing server cluster, and removing old servers
  • Abruptly terminating clients running Nomad jobs with a volume
  • Gracefully draining clients
  • Force-killing both CSI controllers and agents

All resulted in successful volume mounting / claiming. One thing to note; A couple of times when a job or node was killed, and Nomad was trying to place the job on another node, the following message appears in the logs:

failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: could not claim volume redis-dev: rpc error: rpc error: controller publish: attach volume: controller attach volume: failed to find clients running controller plugin "gcepd"

Even though there were nodes running in the same availability zone as the volume, and the CSI agent was also running on those nodes.

@tgross
Copy link
Member

tgross commented Feb 8, 2022

All resulted in successful volume mounting / claiming.

That's great! But I asked about plugin GC which was the only open topic in this issue.

It looks like we don't have any more data on plugin GC here so I'm going to close this issue out so that we're not side-tracked about . There's some open issues around plugin counts and health that I'm still working through like #11758 #9810 #10073 #11784. If folks have more data to add about plugins, those issues are the best place to add it. Thanks!

@tgross tgross closed this as completed Feb 8, 2022
Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Feb 8, 2022
@urog
Copy link
Author

urog commented Feb 8, 2022

Sorry - was a bit carried away by it all working! The original errors that I posted are all gone. They are replaced with the following The following new logs are present:

2022-02-08T20:56:19.380Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval \"7920f269-0494-c4fa-3a41-ec411edb3cb2\" JobID: \"csi-plugin-gc\" Namespace: \"-\">"
2022-02-08T20:56:40.391Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval \"fa8cd038-d119-678b-8f61-f9ab39101038\" JobID: \"csi-plugin-gc\" Namespace: \"-\">"
2022-02-08T20:57:01.402Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval \"b4ce5512-d6ed-c757-dff8-e5e791920ef5\" JobID: \"csi-plugin-gc\" Namespace: \"-\">"
2022-02-08T20:57:22.410Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval \"7aeb9042-d7fa-4a93-0f6b-4a19beadcb8b\" JobID: \"csi-plugin-gc\" Namespace: \"-\">"

These errors are still present:

2022-02-08T21:00:10.499Z [ERROR] core.sched: failed to GC plugin: plugin_id=gcepd error="Permission denied"
2022-02-08T21:00:10.499Z [ERROR] worker: error invoking scheduler: worker_id=7f77d61b-5b78-1ac6-4bb5-d714ac5111a2 error="failed to process evaluation: Permission denied"

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

3 participants