Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad node status doesn't return in use CSI volumes #17923

Closed
the-nando opened this issue Jul 12, 2023 · 6 comments · Fixed by #17925
Closed

nomad node status doesn't return in use CSI volumes #17923

the-nando opened this issue Jul 12, 2023 · 6 comments · Fixed by #17925
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cli type/bug

Comments

@the-nando
Copy link
Contributor

the-nando commented Jul 12, 2023

Nomad version

Nomad v1.5.6

Issue

A call to nomad node status -verbose <node_id> for a node with an allocation using a CSI volume running on it, shows only the header for the CSI Volumes:

ID              = f65f89ba-1b54-71d2-bb2f-e582dac7b916
Name            = test-client
Class           = default
DC              = us-west-2b
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = aws-ebs,aws-efs
Uptime          = 150h19m3s

Host Volumes
Name                   ReadOnly  Source
test-volume  true      /data/test/foo

CSI Volumes                                                                             <--------
ID  Name  Plugin ID  Schedulable  Provider  Access Mode                                 <--------

Drivers
Driver    Detected  Healthy  Message   Time
docker    true      true     Healthy   2023-07-06T08:35:02Z
[...]

Volume status output:

~ nomad volume status -namespace='*' 'my-vol[1]' 
ID                   = my-vol[1]
Name                 = my-vol[1]
External ID          = vol-xxx
Plugin ID            = aws-ebs
Provider             = ebs.csi.aws.com
Version              = v1.12.1
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 64
Nodes Expected       = 64
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = fs_type: ext4
Namespace            = my-ns

Topologies
Topology  Segments

Allocations
ID        Node ID   Task Group    Version  Desired  Status   Created   Modified
e76669ab  f65f89ba  test  46       run      running  1d6h ago  10m20s ago
~ 

Reproduction steps

  1. Create a CSI volume
  2. Create and run a job which uses such volume
  3. Get the node ID where the allocation is scheduled on
  4. Call nomad node status -verbose <node_id>

Expected Result

nomad node status -verbose shows all CSI volumes in use on the node.

Actual Result

No volumes are displayed for the node.
I believe one of the problems is that the call to /v1/volumes doesn't set namespace=* as query params, but there seems to be more to it as setting it still doesn't return the volumes.

vs, _ := client.Nodes().CSIVolumes(node.ID, &api.QueryOptions{Namespace: "*"})

I also get an empty list if I hit the API directly with a curl:

curl  -H "X-Nomad-Token: ${NOMAD_TOKEN}" "${NOMAD_ADDR}/v1/volumes?type=csi&node_id=f65f89ba-1b54-71d2-bb2f-e582dac7b916&namespace=my-ns&region=us-west-2"
@lgfa29
Copy link
Contributor

lgfa29 commented Jul 12, 2023

Hi @the-nando 👋

I think your analysis is correct, and we're missing the * namespace when querying volumes so I opened #17925 to fix this.

But for you get an empty result is a bit strange 🤔

I don't think that's the problem, but have you tried calling the API with a management token just to see if the problem is not related to permissions?

@lgfa29 lgfa29 added theme/cli stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jul 12, 2023
@lgfa29 lgfa29 self-assigned this Jul 12, 2023
@lgfa29 lgfa29 added this to Needs Triage in Nomad - Community Issues Triage via automation Jul 12, 2023
@lgfa29 lgfa29 moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jul 12, 2023
@the-nando
Copy link
Contributor Author

the-nando commented Jul 13, 2023

Hi @lgfa29, I'm using a management token with the curl. If I try without I get a 403 Forbidden / Permission denied as expected.
I run some additional test and I'm getting partial and inconsistent results when querying /v1/volumes?type=csi&node_id=<node_id>.
This is the script I'm using:

#!/usr/bin/env bash

total=0
while read -r NODE; do
  node_id=${NODE% *}
  volumes_count=$(nomad operator api -X GET '/v1/volumes?type=csi&node_id='${node_id}'&namespace=*' < /dev/null | jq length)
  [[ "${volumes_count}" -gt 0 ]] && echo "Found ${volumes_count} volumes on ${NODE}"
  total=$((total+volumes_count))
done < <(nomad operator api -X GET '/v1/nodes' | jq -r '.[] | "\(.ID) \(.Name)"' )
echo "Total: ${total}"

Sample output, with NOMAD_ADDR pointing to the active leader:

~ ./test_volumes.sh                                                                                                                              
Found 3 volumes on 12a72920-ccc0-713a-f922-ab667ce0d1fc client-172-10-10-22
Found 1 volumes on 4b66a4d3-b2e3-4442-e92c-bc4bafaf5be5 client-172-10-10-61
Found 1 volumes on dde42df5-5815-6daf-802d-590790487e8c client-172-10-10-51
Total: 5
~ ./test_volumes.sh                                                                                                               
Found 1 volumes on 12a72920-ccc0-713a-f922-ab667ce0d1fc client-172-10-10-22
Found 1 volumes on 4b66a4d3-b2e3-4442-e92c-bc4bafaf5be5 client-172-10-10-61
Total: 2
~ ./test_volumes.sh                                                                                                                     
Found 4 volumes on 12a72920-ccc0-713a-f922-ab667ce0d1fc client-172-10-10-22
Found 1 volumes on 4b66a4d3-b2e3-4442-e92c-bc4bafaf5be5 client-172-10-10-61
Found 1 volumes on cfb4a008-517b-4747-a8e5-67a8fe4a6609 client-172-10-10-59
Found 1 volumes on 88ff2c3e-f4e6-4b4a-bfae-5d5fd64c3dbc client-172-10-10-39
Total: 7
~

Would you have the possibility to test this? I'm looking at setting up a local environment but due to the CSI driver / controller requirements it's a bit involved.

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 13, 2023

That's very strange 🤔

Are your allocations stable? Looking at the code I see that only volumes for allocations that are currently running are returned:

if !(a.DesiredStatus == structs.AllocDesiredStatusRun ||
a.ClientStatus == structs.AllocClientStatusRunning) ||
len(tg.Volumes) == 0 {
continue
}

@the-nando
Copy link
Contributor Author

The allocations are stable and I get the same results in different federated regions and environments. I don't actively use that API endpoint but I'll see if I can find why the odd results, thanks so far!

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 25, 2023

Would it be possible for you to provide us with a full list of volume results per node instead of an aggregate count?

@the-nando
Copy link
Contributor Author

Hey @lgfa29 I've just opened a support request and shared the logs there :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cli type/bug
Projects
Development

Successfully merging a pull request may close this issue.

2 participants