Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI volume allocation count not updated properly in UI #9495

Closed
apollo13 opened this issue Dec 2, 2020 · 11 comments
Closed

CSI volume allocation count not updated properly in UI #9495

apollo13 opened this issue Dec 2, 2020 · 11 comments

Comments

@apollo13
Copy link
Contributor

apollo13 commented Dec 2, 2020

This is a follow up to #9215. The volume tab after starting a task looks like this (don't click anywhere before), just open the UI and click Storage:
image

Now when I click on the wiki link it takes me to the volume details where it properly shows the write allocation:
image

And if I then click on Volumes (in the breadcrumbs) to get back to the initial page the count is magically updated:
image

But if I now click on Storage in the left sidebar the alloc count drops back to zero.

Nomad version is: 4d6a166

@tgross
Copy link
Member

tgross commented Dec 2, 2020

Hi @apollo13! Thanks for opening this and we'll take a look.

@backspace
Copy link
Contributor

backspace commented Dec 2, 2020

I don’t know how to run CSI locally 🙃 but my guess would be that the API is returning different values from /volumes and /volumes/csi/:id 😯

ETA: specifically for readAllocations and writeAllocations, maybe the group query doesn’t return them at all?

@tgross
Copy link
Member

tgross commented Dec 3, 2020

Using the test rig I describe below, I can reproduce the exact behavior @apollo13 is reporting on current master. On 1.0.0-beta2 or 0.12.x, the volume list page always shows a count of 0.

So it looks like the fix for #9215 (PR #9377) dealt with the volume detail page correctly but didn't fix the list page. The confusing thing to me is that there's some stateful behavior here:

  • Click on Storage; page URL is updated to /ui/csi/volumes (request to /v1/volumes?type=csi). 0 allocs showing in the count for test-volume0
  • Click on test-volume0; /ui/csi/volumes/test-volume0 (request to /v1/volume/csi%2Ftest-volume0). 2 Read allocs showing in the list as expected.
  • Click on Volumes breadcrumb; page URL is updated to /us/csi/volumes (request to v1/volumes?type=csi). 2 allocs showing in the count for test-volume0.
  • Click on Storage; page URL is unchanged (but there's another request to /v1/volumes?type=csi). 0 allocs showing in the count for test-volume0!

Because my test has two volumes, I see some additional symptoms:

  • Click on Storage; page URL is updated to /ui/csi/volumes. 0 allocs showing in the count for test-volume0 and test-volume1
  • Click on test-volume0; /ui/csi/volumes/test-volume0. 2 Read allocs showing in the list as expected.
  • Click on Volumes breadcrumb; page URL is updated to /us/csi/volumes. 2 allocs showing in the count for test-volume0 and 0 for test-volume1
  • Click on test-volume0; /ui/csi/volumes/test-volume1. 2 Read allocs showing in the list as expected.
  • Click on Volumes breadcrumb; page URL is updated to /us/csi/volumes. Now 2 allocs are showing in the count for both test-volume0 and test-volume1, as expected.
  • Click on Storage; page URL is unchanged. 0 allocs showing in the count for test-volume0 and test-volume1!

So the volume list page is only showing a count when accessed via the breadcrumb from the volume detail page.

The API responses reflect what we'd expect. The list page API doesn't include a total count and neither the struct.CSIVolListStub or api.CSIVolumeListStub include the total count as a field or have ever done. In the browser's network console I don't see any calls to the detail API from the list view. (Which I think is expected because that's 1+n?)

So the confusing bit for me is not that we have a total count of 0 in the list view (which is easy enough to fix by adding CurrentReaders and CurrentWriters), but that somehow this data is getting populated when we hit the detail view and then erased when we go back to the list view. This is probably where I could use some help from @backspace because I'm past my meager web UI skills. 😀

Results of curl -s "localhost:4646/v1/volumes?type=csi" | jq .

[
  {
    "ID": "test-volume0",
    "Namespace": "default",
    "Name": "test-volume0",
    "ExternalID": "631ef179-356d-11eb-a328-0242ac110002",
    "Topologies": [],
    "AccessMode": "single-node-writer",
    "AttachmentMode": "file-system",
    "CurrentReaders": 2,
    "CurrentWriters": 0,
    "Schedulable": true,
    "PluginID": "hostpath-plugin0",
    "Provider": "csi-hostpath",
    "ControllersHealthy": 0,
    "ControllersExpected": 0,
    "NodesHealthy": 1,
    "NodesExpected": 1,
    "CreateIndex": 16,
    "ModifyIndex": 23
  },
  {
    "ID": "test-volume1",
    "Namespace": "default",
    "Name": "test-volume1",
    "ExternalID": "632709a7-356d-11eb-a328-0242ac110002",
    "Topologies": [],
    "AccessMode": "single-node-writer",
    "AttachmentMode": "file-system",
    "CurrentReaders": 2,
    "CurrentWriters": 0,
    "Schedulable": true,
    "PluginID": "hostpath-plugin0",
    "Provider": "csi-hostpath",
    "ControllersHealthy": 0,
    "ControllersExpected": 0,
    "NodesHealthy": 1,
    "NodesExpected": 1,
    "CreateIndex": 17,
    "ModifyIndex": 24
  }
]

Results of curl -s "http://localhost:4646/v1/volume/csi%2Ftest-volume0" | jq .

{
  "AccessMode": "single-node-writer",
  "Allocations": [
    {
      "ClientDescription": "Tasks are running",
      "ClientStatus": "running",
      "CreateIndex": 19,
      "CreateTime": 1607002949687635500,
      "DeploymentStatus": {
        "Canary": false,
        "Healthy": true,
        "ModifyIndex": 32,
        "Timestamp": "2020-12-03T13:42:40.465063115Z"
      },
      "DesiredDescription": "",
      "DesiredStatus": "run",
      "EvalID": "cdfbd5f5-0907-aca6-da49-19a50a7402b4",
      "FollowupEvalID": "",
      "ID": "79b30d92-0055-2086-835f-955ecfd7abc1",
      "JobID": "example",
      "JobType": "service",
      "JobVersion": 0,
      "ModifyIndex": 32,
      "ModifyTime": 1607002960587864800,
      "Name": "example.cache[0]",
      "Namespace": "default",
      "NodeID": "eb894f13-d782-6d1e-fdb4-1088db394fba",
      "NodeName": "devmode",
      "PreemptedAllocations": null,
      "PreemptedByAllocation": "",
      "RescheduleTracker": null,
      "TaskGroup": "cache",
      "TaskStates": {
        "redis": {
          "Events": [
            {
              "Details": {},
              "DiskLimit": 0,
              "DiskSize": 0,
              "DisplayMessage": "Task received by client",
              "DownloadError": "",
              "DriverError": "",
              "DriverMessage": "",
              "ExitCode": 0,
              "FailedSibling": "",
              "FailsTask": false,
              "GenericSource": "",
              "KillError": "",
              "KillReason": "",
              "KillTimeout": 0,
              "Message": "",
              "RestartReason": "",
              "SetupError": "",
              "Signal": 0,
              "StartDelay": 0,
              "TaskSignal": "",
              "TaskSignalReason": "",
              "Time": 1607002949690282500,
              "Type": "Received",
              "ValidationError": "",
              "VaultError": ""
            },
            {
              "Details": {
                "message": "Building Task Directory"
              },
              "DiskLimit": 0,
              "DiskSize": 0,
              "DisplayMessage": "Building Task Directory",
              "DownloadError": "",
              "DriverError": "",
              "DriverMessage": "",
              "ExitCode": 0,
              "FailedSibling": "",
              "FailsTask": false,
              "GenericSource": "",
              "KillError": "",
              "KillReason": "",
              "KillTimeout": 0,
              "Message": "Building Task Directory",
              "RestartReason": "",
              "SetupError": "",
              "Signal": 0,
              "StartDelay": 0,
              "TaskSignal": "",
              "TaskSignalReason": "",
              "Time": 1607002949711526100,
              "Type": "Task Setup",
              "ValidationError": "",
              "VaultError": ""
            },
            {
              "Details": {},
              "DiskLimit": 0,
              "DiskSize": 0,
              "DisplayMessage": "Task started by client",
              "DownloadError": "",
              "DriverError": "",
              "DriverMessage": "",
              "ExitCode": 0,
              "FailedSibling": "",
              "FailsTask": false,
              "GenericSource": "",
              "KillError": "",
              "KillReason": "",
              "KillTimeout": 0,
              "Message": "",
              "RestartReason": "",
              "SetupError": "",
              "Signal": 0,
              "StartDelay": 0,
              "TaskSignal": "",
              "TaskSignalReason": "",
              "Time": 1607002950464377300,
              "Type": "Started",
              "ValidationError": "",
              "VaultError": ""
            }
          ],
          "Failed": false,
          "FinishedAt": null,
          "LastRestart": null,
          "Restarts": 0,
          "StartedAt": "2020-12-03T13:42:30.464381504Z",
          "State": "running"
        }
      }
    },
    {
      "ClientDescription": "Tasks are running",
      "ClientStatus": "running",
      "CreateIndex": 19,
      "CreateTime": 1607002949687635500,
      "DeploymentStatus": {
        "Canary": false,
        "Healthy": true,
        "ModifyIndex": 32,
        "Timestamp": "2020-12-03T13:42:40.436082297Z"
      },
      "DesiredDescription": "",
      "DesiredStatus": "run",
      "EvalID": "cdfbd5f5-0907-aca6-da49-19a50a7402b4",
      "FollowupEvalID": "",
      "ID": "9f7932c3-4193-f6a5-bb81-c1150c39c3d5",
      "JobID": "example",
      "JobType": "service",
      "JobVersion": 0,
      "ModifyIndex": 32,
      "ModifyTime": 1607002960587864800,
      "Name": "example.cache[1]",
      "Namespace": "default",
      "NodeID": "eb894f13-d782-6d1e-fdb4-1088db394fba",
      "NodeName": "devmode",
      "PreemptedAllocations": null,
      "PreemptedByAllocation": "",
      "RescheduleTracker": null,
      "TaskGroup": "cache",
      "TaskStates": {
        "redis": {
          "Events": [
            {
              "Details": {},
              "DiskLimit": 0,
              "DiskSize": 0,
              "DisplayMessage": "Task received by client",
              "DownloadError": "",
              "DriverError": "",
              "DriverMessage": "",
              "ExitCode": 0,
              "FailedSibling": "",
              "FailsTask": false,
              "GenericSource": "",
              "KillError": "",
              "KillReason": "",
              "KillTimeout": 0,
              "Message": "",
              "RestartReason": "",
              "SetupError": "",
              "Signal": 0,
              "StartDelay": 0,
              "TaskSignal": "",
              "TaskSignalReason": "",
              "Time": 1607002949689950500,
              "Type": "Received",
              "ValidationError": "",
              "VaultError": ""
            },
            {
              "Details": {
                "message": "Building Task Directory"
              },
              "DiskLimit": 0,
              "DiskSize": 0,
              "DisplayMessage": "Building Task Directory",
              "DownloadError": "",
              "DriverError": "",
              "DriverMessage": "",
              "ExitCode": 0,
              "FailedSibling": "",
              "FailsTask": false,
              "GenericSource": "",
              "KillError": "",
              "KillReason": "",
              "KillTimeout": 0,
              "Message": "Building Task Directory",
              "RestartReason": "",
              "SetupError": "",
              "Signal": 0,
              "StartDelay": 0,
              "TaskSignal": "",
              "TaskSignalReason": "",
              "Time": 1607002949712533000,
              "Type": "Task Setup",
              "ValidationError": "",
              "VaultError": ""
            },
            {
              "Details": {},
              "DiskLimit": 0,
              "DiskSize": 0,
              "DisplayMessage": "Task started by client",
              "DownloadError": "",
              "DriverError": "",
              "DriverMessage": "",
              "ExitCode": 0,
              "FailedSibling": "",
              "FailsTask": false,
              "GenericSource": "",
              "KillError": "",
              "KillReason": "",
              "KillTimeout": 0,
              "Message": "",
              "RestartReason": "",
              "SetupError": "",
              "Signal": 0,
              "StartDelay": 0,
              "TaskSignal": "",
              "TaskSignalReason": "",
              "Time": 1607002950435578600,
              "Type": "Started",
              "ValidationError": "",
              "VaultError": ""
            }
          ],
          "Failed": false,
          "FinishedAt": null,
          "LastRestart": null,
          "Restarts": 0,
          "StartedAt": "2020-12-03T13:42:30.435582395Z",
          "State": "running"
        }
      }
    }
  ],
  "AttachmentMode": "file-system",
  "Context": {},
  "ControllerRequired": false,
  "ControllersExpected": 0,
  "ControllersHealthy": 0,
  "CreateIndex": 16,
  "ExternalID": "631ef179-356d-11eb-a328-0242ac110002",
  "ID": "test-volume0",
  "ModifyIndex": 23,
  "MountOptions": null,
  "Name": "test-volume0",
  "Namespace": "default",
  "NodesExpected": 1,
  "NodesHealthy": 1,
  "Parameters": {},
  "PluginID": "hostpath-plugin0",
  "Provider": "csi-hostpath",
  "ProviderVersion": "v1.2.0-0-g83590990",
  "ReadAllocs": {
    "79b30d92-0055-2086-835f-955ecfd7abc1": null,
    "9f7932c3-4193-f6a5-bb81-c1150c39c3d5": null
  },
  "ResourceExhausted": null,
  "Schedulable": true,
  "Secrets": null,
  "Topologies": [],
  "WriteAllocs": {}
}

My test environment for this is Nomad running in dev mode on Linux, with csc installed to send the create volume commands to the hostpath CSI plugin that Nomad doesn't yet support. Below are the files I use (sorry that there's no simpler set of reproductions for this sort of thing yet... we need #8212 before we can create volumes directly.) Also, @backspace are we missing Mirage mocks for CSI? Can I help out with that?

test.sh
#!/bin/bash

DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

nomad job run "${DIR}/plugin.nomad"

while :
do
    nomad plugin status hostpath | grep "Nodes Healthy        = 1" && break
    sleep 2
done

nomad plugin status hostpath

# create the volume in the "external provider"
VOLUME_NAME=test-volume

# non-dev mode
#CSI_ENDPOINT="/var/nomad/client/csi/monolith/hostpath-plugin0/csi.sock"

# dev mode path is going to be in a tempdir
PLUGIN_DOCKER_ID=$(docker ps | grep hostpath | awk -F' +' '{print $1}')
CSI_ENDPOINT=$(docker inspect $PLUGIN_DOCKER_ID | jq -r '.[0].Mounts[] | select(.Destination == "/csi") | .Source')/csi.sock

echo "creating volume..."
UUID=$(sudo csc --endpoint "$CSI_ENDPOINT" controller create-volume "${VOLUME_NAME}0" --cap 1,2,ext4 | grep -o '".*"' | tr -d '"')

echo "registering volume $UUID..."
sed -e "s/EXTERNAL_ID/$UUID/" \
    -e "s/VOLUME_NAME/${VOLUME_NAME}0/" \
    "${DIR}/volume.hcl" | nomad volume register -

echo "creating second volume..."
UUID=$(sudo csc --endpoint "$CSI_ENDPOINT" controller create-volume "${VOLUME_NAME}1" --cap 1,2,ext4 | grep -o '".*"' | tr -d '"')

echo "registering volume $UUID..."
sed -e "s/EXTERNAL_ID/$UUID/" \
    -e "s/VOLUME_NAME/${VOLUME_NAME}1/" \
    "${DIR}/volume.hcl" | nomad volume register -

echo "claiming volume..."
nomad job run "${DIR}/example.nomad"
plugin.nomad
job "csi-plugin" {
  datacenters = ["dc1"]

  spread {
    attribute = "${node.unique.id}"
  }

  group "csi" {

    constraint {
      operator = "distinct_hosts"
      value    = true
    }

    task "plugin" {
      driver = "docker"

      config {
        image = "quay.io/k8scsi/hostpathplugin:v1.2.0"

        args = [
          "--drivername=csi-hostpath",
          "--v=5",
          "--endpoint=unix://csi/csi.sock",
          "--nodeid=node-${NOMAD_ALLOC_INDEX}",
        ]

        privileged = true
      }

      meta {
        vers = "1"
      }

      csi_plugin {
        id = "hostpath-plugin0"
        #type = "monolith"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 256
        memory = 128
      }
    }
  }
}
example.nomad
variables {
  volume_ids = ["volume0", "volume1"]
}

job "example" {
  datacenters = ["dc1"]

  group "cache" {

    count = 2

    dynamic "volume" {
      for_each = var.volume_ids
      labels   = [volume.value]
      content {
        type      = "csi"
        source    = "test-${volume.value}"
        read_only = true
      }
    }

    network {
      port "db" {
        to = 6379
      }
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"
        ports = ["db"]
      }

      dynamic "volume_mount" {
        for_each = var.volume_ids
        content {
          volume      = "${volume_mount.value}"
          destination = "${NOMAD_ALLOC_DIR}/${volume_mount.value}"
        }
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}
volume.hcl
id = "VOLUME_NAME"
name = "VOLUME_NAME"
type = "csi"
external_id = "EXTERNAL_ID"
plugin_id = "hostpath-plugin0"
access_mode = "single-node-writer"
attachment_mode = "file-system"

@backspace
Copy link
Contributor

I’ll look more into this but there is a volumes serialiser

  serialize() {
    var json = ApplicationSerializer.prototype.serialize.apply(this, arguments);
    if (json instanceof Array) {
      json.forEach(serializeVolume);
    } else {
      serializeVolume(json);
    }
    return json;
  },
});

function serializeVolume(volume) {
  volume.WriteAllocs = groupBy(volume.WriteAllocs, 'ID');
  volume.ReadAllocs = groupBy(volume.ReadAllocs, 'ID');
}

To accurately reflect the API, serializeVolume shouldn’t be called when there’s an array 😞

I’m not sure if the tests actually make assertions on this, I’ll look into that.

@backspace
Copy link
Contributor

There is indeed an assertion of the allocation count.

As for why fetching the collection after fetching an item from it clears the loaded allocations, I believe it’s these lines in the serialiser. Ember Data is a “single source of truth” so when the collection is loaded and no allocations are included, the relationships are deleted. It might be possible to avoid this by bypassing extraction relationships when the ReadAllocs and WriteAllocs are absent but it wouldn’t at all with count being 0 on the initial load, so it doesn’t seem worth it to me 🤔

Is it prohibitive to include ReadAllocs and WriteAllocs in the API response for the collection? If it is, maybe removing # Allocs from the table is the answer because otherwise we’d be N+1-ing.

@tgross
Copy link
Member

tgross commented Dec 3, 2020

Is it prohibitive to include ReadAllocs and WriteAllocs in the API response for the collection?

We already have the fields CurrentReaders and CurrentWriters with the count. I'd rather not include the whole allocation as ReadAllocs and WriteAllocs if we can avoid it.

For example:

[
  {
    "ID": "test-volume0",
    "Namespace": "default",
    "Name": "test-volume0",
    "ExternalID": "631ef179-356d-11eb-a328-0242ac110002",
    "Topologies": [],
    "AccessMode": "single-node-writer",
    "AttachmentMode": "file-system",
    "CurrentReaders": 2,
    "CurrentWriters": 0,
    "Schedulable": true,
    "PluginID": "hostpath-plugin0",
    "Provider": "csi-hostpath",
    "ControllersHealthy": 0,
    "ControllersExpected": 0,
    "NodesHealthy": 1,
    "NodesExpected": 1,
    "CreateIndex": 16,
    "ModifyIndex": 23
  }
]

@backspace
Copy link
Contributor

oh, interesting, so CurrentReaders + CurrentWriters = ReadAllocs.length + WriteAllocs.length? I am quite ignorant of this realm but if that’s true I could make a quick PR.

@tgross
Copy link
Member

tgross commented Dec 3, 2020

oh, interesting, so CurrentReaders + CurrentWriters = ReadAllocs.length + WriteAllocs.length? I am quite ignorant of this realm but if that’s true I could make a quick PR.

Yup! Sorry, I could have been more clear about that! 😀

(CurrentReaders and CurrentWriters only show on the list API, whereas ReadAllocs and WriteAllocs are on the detail API.)

backspace added a commit that referenced this issue Dec 3, 2020
The API doesn’t return the ReadAllocs and WriteAllocs for the
index query, as discussed here:
#9495 (comment)

So this removes that, which will cause the allocation count in the
table to be 0, as in the bug report.
@tgross
Copy link
Member

tgross commented Dec 7, 2020

Looks like we missed the merge window for rc2 on this one so if there's no rc3 this will land in 1.0.1.

@tgross tgross added this to the 1.0.1 milestone Dec 7, 2020
tgross added a commit that referenced this issue Dec 8, 2020
@tgross tgross modified the milestones: 1.0.1, 1.0 Dec 8, 2020
@tgross
Copy link
Member

tgross commented Dec 8, 2020

Correction: this did land in the merge window for 1.0 GA! Changelog entry added in #9563

tgross added a commit that referenced this issue Dec 8, 2020
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants