Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCE CSI driver not claiming volumes successfully registered in Nomad #7901

Closed
auto-store opened this issue May 8, 2020 · 5 comments
Closed

Comments

@auto-store
Copy link

auto-store commented May 8, 2020

###Nomad version
Nomad v0.11.1 (b434570)

Issue

Created persistent disk and successfully registered the volume. When running a Nomad job that should claim the volume, hitting following error:


==> Monitoring evaluation "b8f55763"
    Evaluation triggered by job "nginx"
    Evaluation within deployment: "cbadc171"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "b8f55763" finished with status "complete" but failed to place all allocations:
    Task Group "nginx" (failed to place 1 allocation):
      * Constraint "missing CSI Volume nginx": 2 nodes excluded by filter
    Evaluation "b58c1c1e" waiting for additional capacity to place remainder

i'm using this persistent disk:

[tomh@hashi-server-1 ~]$ gcloud compute disks describe nginx --zone europe-west2-c
creationTimestamp: '2020-05-07T09:34:53.197-07:00'
id: '7565291015210894915'
kind: compute#disk
labelFingerprint: 42WmSpB8rSM=
name: nginx
physicalBlockSizeBytes: '4096'
selfLink: https://www.googleapis.com/compute/v1/projects/tharris-demo-env/zones/europe-west2-c/disks/nginx
sizeGb: '10'
status: READY
type: https://www.googleapis.com/compute/v1/projects/tharris-demo-env/zones/europe-west2-c/diskTypes/pd-standard
zone: https://www.googleapis.com/compute/v1/projects/tharris-demo-env/zones/europe-west2-c

volume registration file:

type = "csi"
id = "7565291015210894915"
name = "nginx"
external_id = "https://www.googleapis.com/compute/v1/projects/tharris-demo-env/zones/europe-west2-c/disks/nginx"
access_mode = "single-node-writer"
attachment_mode = "file-system"
plugin_id = "gcepd"

The volume is showing as registered:

[tomh@hashi-server-1 ~]$ nomad volume status
Container Storage Interface
ID        Name   Plugin ID  Schedulable  Access Mode
31033119  mysql  gcepd      true         single-node-writer
75652910  nginx  gcepd      true         single-node-writer
[tomh@hashi-server-1 ~]$
[tomh@hashi-server-1 ~]$
[tomh@hashi-server-1 ~]$ nomad volume status 75652910
ID                   = 7565291015210894915
Name                 = nginx
External ID          = https://www.googleapis.com/compute/v1/projects/tharris-demo-env/zones/europe-west2-c/disks/nginx
Plugin ID            = gcepd
Provider             = pd.csi.storage.gke.io
Version              = v0.7.0-gke.0
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 2
Nodes Healthy        = 2
Nodes Expected       = 2
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

Job spec:

job "nginx" {
  datacenters = ["london"]
  type = "service"

  group "nginx" {
    count = 1

    volume "nginx" {
      type      = "csi"
      read_only = false
      source    = "nginx"
    }


    task "nginx" {
      driver = "docker"

      volume_mount {
        volume      = "nginx"
        destination = "/etc/nginx/conf.d/default.conf"
        read_only   = false
      }

      config {
        image = "nginx"
        port_map {
          http = 8080
        }
        port_map {
          https = 443
        }
      }
    }


    service {
        name = "nginx"
        tags = [ "nginx" ]
        port = "http"
        check {
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
   }

 }

}

###Job status:

[tomh@hashi-server-1 ~]$ nomad job status nginx
ID            = nginx
Name          = nginx
Submit Date   = 2020-05-08T14:47:43Z
Type          = service
Priority      = 50
Datacenters   = london
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
nginx       1       0         0        0       0         0

Placement Failure
Task Group "nginx":
  * Constraint "missing CSI Volume nginx": 2 nodes excluded by filter

Latest Deployment
ID          = cbadc171
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
nginx       1        0       0        0          N/A

Allocations
No allocations placed
@auto-store auto-store changed the title GCE CSI driver not claiming GCE CSI driver not claiming volumes successfully registered in Nomad May 8, 2020
@auto-store
Copy link
Author

referencing the ID from nomad volume status output rather than the volume name or GCP id associated with the persistent disk squashed the error message.

[tomh@hashi-node-1 ~]$ nomad volume status
Container Storage Interface
ID        Name    Plugin ID  Schedulable  Access Mode
75652910  nginx   gcepd      true         single-node-writer
75998672  alpine  gcepd      true         single-node-writer

excerpt from job file:

group "nginx" {
    count = 1

    volume "nginx" {
      type      = "csi"
      read_only = false
      source    = "75652910"
    }


    task "nginx" {
      driver = "docker"

      volume_mount {
        volume      = "nginx"
        destination = "/etc/nginx/conf.d/default.conf"
        read_only   = false
      }

Closing issue.

@auto-store
Copy link
Author

the above correction placed the allocation but is still not acting as expected, and logs are showing that the volume can still not be found:

client logs:

020/05/12 15:14:31.355826 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/e64bf2cd-3ac8-4698-6a42-f07dbc80b73c/plugin/secrets/creds.json"
2020-05-12T15:14:31.383Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=cf193f36970ab2688ab6b601f4a1ea63b9a0c77681be3502d9f7eae9fb4c768b
2020-05-12T15:14:31.746Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=cf193f36970ab2688ab6b601f4a1ea63b9a0c77681be3502d9f7eae9fb4c768b
2020-05-12T15:14:40.166Z [INFO]  client: node registration complete
2020-05-12T15:17:00.675Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume not found: 75652910" rpc=CSIVolume.Claim server=10.154.0.26:4647
2020-05-12T15:17:00.675Z [ERROR] client.alloc_runner: prerun failed: alloc_id=3f755e3c-f008-0f53-ea10-56884be70169 error="pre-run hook "csi_hook" failed: claim volumes: rpc error: rpc error: volume not found: 75652910"
2020-05-12T15:17:00.682Z [INFO]  client.gc: marking allocation for GC: alloc_id=3f755e3c-f008-0f53-ea10-56884be70169
2020-05-12T15:17:38.254Z [WARN]  client.csi_client: finished client unary call: plugin.name=gcepd plugin.type=controller grpc.code=Internal duration=7.554160355s grpc.service=csi.v1.Controller grpc.method=ControllerPublishVolume
2020-05-12T15:18:38.906Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume not found: 75652910" rpc=CSIVolume.Claim server=10.154.0.36:4647
2020-05-12T15:18:38.906Z [ERROR] client.alloc_runner: prerun failed: alloc_id=6c1b383a-6137-a8c8-050d-2bf04ef9ffb3 error="pre-run hook "csi_hook" failed: claim volumes: rpc error: rpc error: volume not found: 75652910"
2020-05-12T15:18:38.913Z [INFO]  client.gc: marking allocation for GC: alloc_id=6c1b383a-6137-a8c8-050d-2bf04ef9ffb3
[tomh@hashi-server-1 nomad-csi]$ nomad job status nginx
ID            = nginx
Name          = nginx
Submit Date   = 2020-05-12T15:27:05Z
Type          = service
Priority      = 50
Datacenters   = london
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
nginx       0       0         0        1       0         0

Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
nginx       f73b768c  7m4s from now

Latest Deployment
ID          = ba4c0472
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
nginx       1        1       0        1          2020-05-12T15:37:05Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created  Modified
6cd9f6db  dc63068e  nginx       0        run      failed  57s ago  54s ago

jobsec:

job "nginx" {
  datacenters = ["london"]
  type = "service"

  group "nginx" {
    count = 1

    volume "nginx" {
      type      = "csi"
      read_only = false
      source    = "75652910"
    }


    task "nginx" {
      driver = "docker"

      volume_mount {
        volume      = "nginx"
        destination = "/etc/nginx/conf.d/default.conf"
        read_only   = false
      }

      config {
        image = "nginx"
        port_map {
          http = 8080
        }
        port_map {
          https = 443
        }
      }
    }
  }
}

chaning to the volume ID under volume_mount give error running job * Task nginx has a volume mount (0) referencing undefined volume 75652910

@auto-store auto-store reopened this May 12, 2020
@tgross
Copy link
Member

tgross commented May 12, 2020

The jobspec has the following: source = "75652910"

But that's not the ID, it's an ID prefix. You can use the prefix of the ID in commands like nomad volume status as an ergonomic aid in the CLI (although if there's a collision it'll tell you so), but the jobspec needs to include the full ID.

@auto-store
Copy link
Author

auto-store commented May 13, 2020

That fixed it, thanks @tgross

can succesfully attach volume, write some data, purge job, start job and data has persisted, which i have been trying to figure out with this driver since #7734 . Thanks for including that fix from #7734 in the 0.11.1 release and for pointing out my mistake here, much appreciated.

@github-actions
Copy link

github-actions bot commented Nov 7, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants