core.sched: failed to GC plugin: plugin_id=<plugin> error="rpc error: Permission denied" #11162

urog · 2021-09-09T06:31:35Z

Nomad version

Output from nomad version
Nomad v1.1.4 (acd3d7889328ad1df2895eb714e2cbe3dd9c6d82)

Operating system and Environment details

Linux 5.4.0-1051-gcp #55~18.04.1-Ubuntu SMP Sun Aug 1 20:38:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Issue

I am seeing these errors in my server logs over and over:

Sep  9 05:34:23 dev-consul-nomad-servers-qwzl nomad[1995]:     2021-09-09T05:34:23.567Z [ERROR] core.sched: failed to GC plugin: plugin_id=gcepd error="rpc error: Permission denied"
Sep  9 05:34:23 dev-consul-nomad-servers-qwzl nomad[1995]:     2021-09-09T05:34:23.567Z [ERROR] worker: error invoking scheduler: error="failed to process evaluation: rpc error: Permission denied"

The plugin the error refers to is the CSI pluginpd.csi.storage.gke.io, and I've tried versions 0.7.0 through to 1.2.0 - all yield the same result. Scheduling jobs with a csi volume mount work just fine. The issue is when the job is stopped/purged and the volume cannot be mounted to any other job because Nomad thinks it's still allocated to the previous job.

I've been experiencing this since nomad version ~ v0.12.* and was following this issue, hoping these would fix it. Nothing has changed.

Nomad has ACLs enabled, and the anonymous policy is disabled.

The text was updated successfully, but these errors were encountered:

lgfa29 · 2021-09-14T23:50:49Z

Hi @urog, thanks for the report.

Do you see any error in the plugin logs? Either the node or the controller?

Thank you.

urog · 2021-09-15T01:07:42Z

Thanks @lgfa29

Some more logs. There are no logs on either the csi controller or nodes that correspond to the errors on the nomad servers.

Server

2021-09-15T00:38:58.504Z [DEBUG] core.sched: eval GC scanning before cutoff index: index=4645340 eval_gc_threshold=1h0m0s
2021-09-15T00:38:58.510Z [DEBUG] core.sched: eval GC found eligibile objects: evals=6 allocs=0
2021-09-15T00:38:58.513Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=redis-dev error="2 errors occurred:
	* Permission denied
	* Permission denied

"

CSI Controller
Here are some boot logs. The last few lines are repeated ever 30 seconds.

I0915 00:54:01.351887       1 main.go:68] Driver vendor version v1.2.0-gke.0-0-gbd7b8c6-dirty
I0915 00:54:01.351936       1 gce.go:84] Using GCE provider config <nil>
I0915 00:54:01.352024       1 gce.go:135] GOOGLE_APPLICATION_CREDENTIALS env var set /secrets/creds.json
I0915 00:54:01.352034       1 gce.go:139] Using DefaultTokenSource &oauth2.reuseTokenSource{new:jwt.jwtSource{ctx:(*context.cancelCtx)(0xc000350800), conf:(*jwt.Config)(0xc000168aa0)}, mu:sync.Mutex{state:0, sema:0x0}, t:(*oauth2.Token)(nil)}
I0915 00:54:01.683868       1 gce.go:216] Using GCP zone from the Metadata server: "australia-southeast1-b"
I0915 00:54:01.684563       1 gce.go:231] Using GCP project ID from the Metadata server: "<snip>"
I0915 00:54:01.684584       1 gce-pd-driver.go:90] Enabling volume access mode: SINGLE_NODE_WRITER
I0915 00:54:01.684590       1 gce-pd-driver.go:90] Enabling volume access mode: MULTI_NODE_READER_ONLY
I0915 00:54:01.684597       1 gce-pd-driver.go:90] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0915 00:54:01.684602       1 gce-pd-driver.go:100] Enabling controller service capability: CREATE_DELETE_VOLUME
I0915 00:54:01.684607       1 gce-pd-driver.go:100] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME
I0915 00:54:01.684611       1 gce-pd-driver.go:100] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0915 00:54:01.684616       1 gce-pd-driver.go:100] Enabling controller service capability: LIST_SNAPSHOTS
I0915 00:54:01.684619       1 gce-pd-driver.go:100] Enabling controller service capability: PUBLISH_READONLY
I0915 00:54:01.684624       1 gce-pd-driver.go:100] Enabling controller service capability: EXPAND_VOLUME
I0915 00:54:01.684628       1 gce-pd-driver.go:100] Enabling controller service capability: LIST_VOLUMES
I0915 00:54:01.684632       1 gce-pd-driver.go:100] Enabling controller service capability: LIST_VOLUMES_PUBLISHED_NODES
I0915 00:54:01.684636       1 gce-pd-driver.go:110] Enabling node service capability: STAGE_UNSTAGE_VOLUME
I0915 00:54:01.684651       1 gce-pd-driver.go:110] Enabling node service capability: EXPAND_VOLUME
I0915 00:54:01.684655       1 gce-pd-driver.go:110] Enabling node service capability: GET_VOLUME_STATS
I0915 00:54:01.684660       1 gce-pd-driver.go:157] Driver: pd.csi.storage.gke.io
I0915 00:54:01.684736       1 server.go:106] Start listening with scheme unix, addr /csi/csi.sock
I0915 00:54:01.684917       1 server.go:125] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0915 00:54:07.407516       1 utils.go:55] /csi.v1.Identity/Probe called with request: 
I0915 00:54:07.407554       1 utils.go:60] /csi.v1.Identity/Probe returned with response: 
I0915 00:54:07.415457       1 utils.go:55] /csi.v1.Identity/GetPluginInfo called with request: 
I0915 00:54:07.415478       1 utils.go:60] /csi.v1.Identity/GetPluginInfo returned with response: name:"pd.csi.storage.gke.io" vendor_version:"v1.2.0-gke.0-0-gbd7b8c6-dirty" 
I0915 00:54:07.416343       1 utils.go:55] /csi.v1.Identity/GetPluginCapabilities called with request: 
I0915 00:54:07.416363       1 utils.go:60] /csi.v1.Identity/GetPluginCapabilities returned with response: capabilities:<service:<type:CONTROLLER_SERVICE > > capabilities:<service:<type:VOLUME_ACCESSIBILITY_CONSTRAINTS > > capabilities:<volume_expansion:<type:ONLINE > > capabilities:<volume_expansion:<type:OFFLINE > > 
I0915 00:54:07.416848       1 utils.go:55] /csi.v1.Identity/Probe called with request: 
I0915 00:54:07.416910       1 utils.go:60] /csi.v1.Identity/Probe returned with response: 
I0915 00:54:07.417141       1 utils.go:55] /csi.v1.Controller/ControllerGetCapabilities called with request: 
I0915 00:54:07.417247       1 utils.go:55] /csi.v1.Identity/Probe called with request: 
I0915 00:54:07.417635       1 utils.go:60] /csi.v1.Identity/Probe returned with response: 
I0915 00:54:07.417163       1 utils.go:60] /csi.v1.Controller/ControllerGetCapabilities returned with response: capabilities:<rpc:<type:CREATE_DELETE_VOLUME > > capabilities:<rpc:<type:PUBLISH_UNPUBLISH_VOLUME > > capabilities:<rpc:<type:CREATE_DELETE_SNAPSHOT > > capabilities:<rpc:<type:LIST_SNAPSHOTS > > capabilities:<rpc:<type:PUBLISH_READONLY > > capabilities:<rpc:<type:EXPAND_VOLUME > > capabilities:<rpc:<type:LIST_VOLUMES > > capabilities:<rpc:<type:LIST_VOLUMES_PUBLISHED_NODES > >

CSI Client

These are the only logs that come through. The last few every 30 seconds.

I0915 00:53:48.974939       1 main.go:68] Driver vendor version v1.2.0-gke.0-0-gbd7b8c6-dirty
I0915 00:53:48.975045       1 mount_linux.go:163] Detected OS without systemd
I0915 00:53:49.000182       1 gce-pd-driver.go:90] Enabling volume access mode: SINGLE_NODE_WRITER
I0915 00:53:49.000198       1 gce-pd-driver.go:90] Enabling volume access mode: MULTI_NODE_READER_ONLY
I0915 00:53:49.000201       1 gce-pd-driver.go:90] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0915 00:53:49.000205       1 gce-pd-driver.go:100] Enabling controller service capability: CREATE_DELETE_VOLUME
I0915 00:53:49.000209       1 gce-pd-driver.go:100] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME
I0915 00:53:49.000212       1 gce-pd-driver.go:100] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0915 00:53:49.000214       1 gce-pd-driver.go:100] Enabling controller service capability: LIST_SNAPSHOTS
I0915 00:53:49.000217       1 gce-pd-driver.go:100] Enabling controller service capability: PUBLISH_READONLY
I0915 00:53:49.000219       1 gce-pd-driver.go:100] Enabling controller service capability: EXPAND_VOLUME
I0915 00:53:49.000222       1 gce-pd-driver.go:100] Enabling controller service capability: LIST_VOLUMES
I0915 00:53:49.000224       1 gce-pd-driver.go:100] Enabling controller service capability: LIST_VOLUMES_PUBLISHED_NODES
I0915 00:53:49.000230       1 gce-pd-driver.go:110] Enabling node service capability: STAGE_UNSTAGE_VOLUME
I0915 00:53:49.000237       1 gce-pd-driver.go:110] Enabling node service capability: EXPAND_VOLUME
I0915 00:53:49.000240       1 gce-pd-driver.go:110] Enabling node service capability: GET_VOLUME_STATS
I0915 00:53:49.000245       1 gce-pd-driver.go:157] Driver: pd.csi.storage.gke.io
I0915 00:53:49.000361       1 server.go:106] Start listening with scheme unix, addr /csi/csi.sock
I0915 00:53:49.000551       1 server.go:125] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0915 00:53:49.010802       1 utils.go:55] /csi.v1.Identity/Probe called with request: 
I0915 00:53:49.010830       1 utils.go:60] /csi.v1.Identity/Probe returned with response: 
I0915 00:53:49.012545       1 utils.go:55] /csi.v1.Identity/GetPluginInfo called with request: 
I0915 00:53:49.012562       1 utils.go:60] /csi.v1.Identity/GetPluginInfo returned with response: name:"pd.csi.storage.gke.io" vendor_version:"v1.2.0-gke.0-0-gbd7b8c6-dirty" 
I0915 00:53:49.013352       1 utils.go:55] /csi.v1.Identity/GetPluginCapabilities called with request: 
I0915 00:53:49.013385       1 utils.go:60] /csi.v1.Identity/GetPluginCapabilities returned with response: capabilities:<service:<type:CONTROLLER_SERVICE > > capabilities:<service:<type:VOLUME_ACCESSIBILITY_CONSTRAINTS > > capabilities:<volume_expansion:<type:ONLINE > > capabilities:<volume_expansion:<type:OFFLINE > > 
I0915 00:53:49.014787       1 utils.go:55] /csi.v1.Node/NodeGetInfo called with request: 
I0915 00:53:49.014847       1 utils.go:60] /csi.v1.Node/NodeGetInfo returned with response: node_id:"projects/<snip>/zones/australia-southeast1-a/instances/dev-consul-nomad-clients-1zk0" max_volumes_per_node:127 accessible_topology:<segments:<key:"topology.gke.io/zone" value:"australia-southeast1-a" > > 
I0915 00:53:49.015592       1 utils.go:55] /csi.v1.Identity/Probe called with request: 
I0915 00:53:49.015666       1 utils.go:60] /csi.v1.Identity/Probe returned with response: 
I0915 00:53:49.015606       1 utils.go:55] /csi.v1.Identity/Probe called with request: 
I0915 00:53:49.015878       1 utils.go:60] /csi.v1.Identity/Probe returned with response: 
I0915 00:53:49.016138       1 utils.go:55] /csi.v1.Node/NodeGetCapabilities called with request: 
I0915 00:53:49.016158       1 utils.go:60] /csi.v1.Node/NodeGetCapabilities returned with response: capabilities:<rpc:<type:STAGE_UNSTAGE_VOLUME > > capabilities:<rpc:<type:EXPAND_VOLUME > > capabilities:<rpc:<type:GET_VOLUME_STATS > >

lgfa29 · 2021-09-16T15:29:35Z

Thanks for the logs, unfurtunately not much there, and Permission denied is such a generic error that it's pin point the issue.

Would you mind increasing the plugin verbosity to see if that provides more clues?

From the plugin source code it seems like you can go all the way to -v=6.

Thanks!

urog · 2021-09-20T05:08:30Z

I've run both the controller and nodes with full verbosity. I don't actually see any corresponding events on the CSI controller or node logs when Nomad fails to release the volume.

Here's how I've deployed the CSI controller / nodes

Controller

job "csi-cge-controller" {
  datacenters = [
    "australia-southeast1-a",
    "australia-southeast1-b",
    "australia-southeast1-c",
  ]

  priority = 100

  group "controller" {
    count = 2
    task "plugin" {

      driver = "docker"

      template {
        data = <<EOH
{{ key "platform/csi/gce/service_account" }}
EOH
        destination = "secrets/creds.json"
      }

      env {
        "GOOGLE_APPLICATION_CREDENTIALS" = "/secrets/creds.json"
      }

      config {
        image = "gcr.io/gke-release/gcp-compute-persistent-disk-csi-driver:v1.2.0-gke.0"
        args = [
          "-endpoint=unix:///csi/csi.sock",
          "-v=6",
          "-logtostderr",
          "-run-node-service=false"
        ]
        labels {
          app_name = "csi-cge-controller"
          environment = "dev"
        }
      }

      csi_plugin {
        id        = "gcepd"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 128
      }
    }
  }
}

Nodes

job "csi-gce-nodes" {
  datacenters = [
    "australia-southeast1-a",
    "australia-southeast1-b",
    "australia-southeast1-c",
  ]

  type = "system"
  priority = 100

  # only one plugin of a given type and ID should be deployed on
  # any given client node
  constraint {
    operator = "distinct_hosts"
    value = true
  }

  group "nodes" {
    task "plugin" {
      driver = "docker"
      template {
        data = <<EOH
{{ key "platform/csi/gce/service_account" }}
EOH
        destination = "secrets/creds.json"
      }

      env {
        "GOOGLE_APPLICATION_CREDENTIALS" = "/secrets/creds.json"
      }

      config {
        image = "gcr.io/gke-release/gcp-compute-persistent-disk-csi-driver:v1.2.0-gke.0"
        args = [
          "-endpoint=unix:///csi/csi.sock",
          "-v=6",
          "-logtostderr",
          "-run-controller-service=false"
        ]
        privileged = true
        labels {
          app_name = "csi-cge-nodes"
          environment = "dev"
        }
      }

      csi_plugin {
        id        = "gcepd"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 175
      }
    }
  }
}

urog · 2021-09-20T06:16:33Z

And these are the logs from the Nomad servers. Over and over:

Sep 20 06:03:58 dev-consul-nomad-servers-7gbk nomad[2195]:     2021-09-20T06:03:58.513Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=redis-dev error="2 errors occurred:
Sep 20 06:03:58 dev-consul-nomad-servers-7gbk nomad[2195]:         * Permission denied
Sep 20 06:03:58 dev-consul-nomad-servers-7gbk nomad[2195]: "
Sep 20 06:08:58 dev-consul-nomad-servers-7gbk nomad[2195]:     2021-09-20T06:08:58.510Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=redis-dev error="2 errors occurred:
Sep 20 06:08:58 dev-consul-nomad-servers-7gbk nomad[2195]:         * Permission denied
Sep 20 06:08:58 dev-consul-nomad-servers-7gbk nomad[2195]: "
Sep 20 06:13:58 dev-consul-nomad-servers-7gbk nomad[2195]:     2021-09-20T06:13:58.512Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=redis-dev error="2 errors occurred:
Sep 20 06:13:58 dev-consul-nomad-servers-7gbk nomad[2195]:         * Permission denied
Sep 20 06:13:58 dev-consul-nomad-servers-7gbk nomad[2195]: "

tgross · 2022-01-31T16:45:30Z

Just doing some issue cleanup and saw this issue. I want to note that this error:

2021-09-15T00:38:58.513Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=redis-dev error="2 errors occurred:
	* Permission denied
	* Permission denied

"

Should be fixed in #11891, which will ship in the upcoming Nomad 1.2.5. That's unrelated to the original problem in this issue, which is:

Sep 9 05:34:23 dev-consul-nomad-servers-qwzl nomad[1995]: 2021-09-09T05:34:23.567Z [ERROR] core.sched: failed to GC plugin: plugin_id=gcepd error="rpc error: Permission denied"
Sep 9 05:34:23 dev-consul-nomad-servers-qwzl nomad[1995]: 2021-09-09T05:34:23.567Z [ERROR] worker: error invoking scheduler: error="failed to process evaluation: rpc error: Permission denied"

That error looks like evals have somehow been created with the wrong leader ACL token.

urog · 2022-02-03T10:10:21Z

Just tested on Nomad 1.2.5 and it appears to be working. I will test some node draining / job migrations and report back.

tgross · 2022-02-03T13:36:49Z

@urog for plugins, not just volumes?

urog · 2022-02-07T23:46:11Z

I have tested:

Rolling new Nomad servers into an existing server cluster, and removing old servers
Abruptly terminating clients running Nomad jobs with a volume
Gracefully draining clients
Force-killing both CSI controllers and agents

All resulted in successful volume mounting / claiming. One thing to note; A couple of times when a job or node was killed, and Nomad was trying to place the job on another node, the following message appears in the logs:

failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: could not claim volume redis-dev: rpc error: rpc error: controller publish: attach volume: controller attach volume: failed to find clients running controller plugin "gcepd"

Even though there were nodes running in the same availability zone as the volume, and the CSI agent was also running on those nodes.

tgross · 2022-02-08T13:59:11Z

All resulted in successful volume mounting / claiming.

That's great! But I asked about plugin GC which was the only open topic in this issue.

It looks like we don't have any more data on plugin GC here so I'm going to close this issue out so that we're not side-tracked about . There's some open issues around plugin counts and health that I'm still working through like #11758 #9810 #10073 #11784. If folks have more data to add about plugins, those issues are the best place to add it. Thanks!

urog · 2022-02-08T20:59:13Z

Sorry - was a bit carried away by it all working! ~~The original errors that I posted are all gone. They are replaced with the following~~ The following new logs are present:

2022-02-08T20:56:19.380Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval \"7920f269-0494-c4fa-3a41-ec411edb3cb2\" JobID: \"csi-plugin-gc\" Namespace: \"-\">"
2022-02-08T20:56:40.391Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval \"fa8cd038-d119-678b-8f61-f9ab39101038\" JobID: \"csi-plugin-gc\" Namespace: \"-\">"
2022-02-08T20:57:01.402Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval \"b4ce5512-d6ed-c757-dff8-e5e791920ef5\" JobID: \"csi-plugin-gc\" Namespace: \"-\">"
2022-02-08T20:57:22.410Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval \"7aeb9042-d7fa-4a93-0f6b-4a19beadcb8b\" JobID: \"csi-plugin-gc\" Namespace: \"-\">"

These errors are still present:

2022-02-08T21:00:10.499Z [ERROR] core.sched: failed to GC plugin: plugin_id=gcepd error="Permission denied"
2022-02-08T21:00:10.499Z [ERROR] worker: error invoking scheduler: worker_id=7f77d61b-5b78-1ac6-4bb5-d714ac5111a2 error="failed to process evaluation: Permission denied"

github-actions · 2022-10-11T02:43:50Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

urog added the type/bug label Sep 9, 2021

lgfa29 added stage/needs-investigation theme/storage labels Sep 14, 2021

lgfa29 self-assigned this Sep 14, 2021

lgfa29 added this to Needs Triage in Nomad - Community Issues Triage via automation Sep 14, 2021

tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Nov 9, 2021

tgross unassigned lgfa29 Feb 3, 2022

tgross added stage/waiting-reply stage/needs-verification Issue needs verifying it still exists and removed stage/needs-investigation labels Feb 3, 2022

tgross self-assigned this Feb 3, 2022

tgross closed this as completed Feb 8, 2022

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Feb 8, 2022

github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core.sched: failed to GC plugin: plugin_id=<plugin> error="rpc error: Permission denied" #11162

core.sched: failed to GC plugin: plugin_id=<plugin> error="rpc error: Permission denied" #11162

urog commented Sep 9, 2021 •

edited

Loading

lgfa29 commented Sep 14, 2021

urog commented Sep 15, 2021

lgfa29 commented Sep 16, 2021

urog commented Sep 20, 2021

urog commented Sep 20, 2021

tgross commented Jan 31, 2022

urog commented Feb 3, 2022 •

edited

Loading

tgross commented Feb 3, 2022

urog commented Feb 7, 2022

tgross commented Feb 8, 2022

urog commented Feb 8, 2022 •

edited

Loading

github-actions bot commented Oct 11, 2022

core.sched: failed to GC plugin: plugin_id=<plugin> error="rpc error: Permission denied" #11162

core.sched: failed to GC plugin: plugin_id=<plugin> error="rpc error: Permission denied" #11162

Comments

urog commented Sep 9, 2021 • edited Loading

Nomad version

Operating system and Environment details

Issue

lgfa29 commented Sep 14, 2021

urog commented Sep 15, 2021

lgfa29 commented Sep 16, 2021

urog commented Sep 20, 2021

urog commented Sep 20, 2021

tgross commented Jan 31, 2022

urog commented Feb 3, 2022 • edited Loading

tgross commented Feb 3, 2022

urog commented Feb 7, 2022

tgross commented Feb 8, 2022

urog commented Feb 8, 2022 • edited Loading

github-actions bot commented Oct 11, 2022

urog commented Sep 9, 2021 •

edited

Loading

urog commented Feb 3, 2022 •

edited

Loading

urog commented Feb 8, 2022 •

edited

Loading