Constraint "CSI volume has exhausted its available writer claims": 1 nodes excluded by filter #10927

gregory112 · 2021-07-22T16:50:56Z

Nomad version

Nomad v1.1.2 (60638a0)

Operating system and Environment details

Ubuntu 20.04 LTS

Issue

Cannot re-plan jobs due to CSI volumes being claimed.
I have seen many variations about this issue. I don't know how to debug it. I use ceph-csi plugin to deploy system job on my two Nomad nodes. This result in two controllers and two ceph-csi nodes. I then create a few volumes using nomad volume create command. I then create a job with three tasks that use three volumes. Sometimes, after a while the job may fail, and I stop it. After that when I try to replan the exact same job I get that error.

What confuses me is the warning. It differs every time I run job plan. First I saw

- WARNING: Failed to place all allocations.
  Task Group "zookeeper1" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper1-data has exhausted its available writer claims": 2 nodes excluded by filter

  Task Group "zookeeper2" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper2-data has exhausted its available writer claims": 2 nodes excluded by filter

Then, runnig job plan again a few seconds after, I got

- WARNING: Failed to place all allocations.
  Task Group "zookeeper1" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper1-datalog has exhausted its available writer claims": 2 nodes excluded by filter

  Task Group "zookeeper2" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper2-datalog has exhausted its available writer claims": 2 nodes excluded by filter

Then again,

- WARNING: Failed to place all allocations.
  Task Group "zookeeper1" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper1-data has exhausted its available writer claims": 1 nodes excluded by filter
    * Constraint "CSI volume zookeeper1-datalog has exhausted its available writer claims": 1 nodes excluded by filter

  Task Group "zookeeper2" (failed to place 1 allocation):
    * Constraint "CSI volume zookeeper2-datalog has exhausted its available writer claims": 2 nodes excluded by filter

I have three groups: zookeeper1, zookeeper2, and zookeeper3, each using two volumes (data and datalog). I will just assume from this log that all volumes are non-reclaimable.

This is the output of nomad volume status.

Container Storage Interface
ID                           Name                Plugin ID  Schedulable  Access Mode
zookeeper1-data     zookeeper1-data     ceph-csi   true         single-node-writer
zookeeper1-datalog  zookeeper1-datalog  ceph-csi   true         single-node-writer
zookeeper2-data     zookeeper2-data     ceph-csi   true         single-node-writer
zookeeper2-datalog  zookeeper2-datalog  ceph-csi   true         single-node-writer
zookeeper3-data     zookeeper3-data     ceph-csi   true         <none>
zookeeper3-datalog  zookeeper3-datalog  ceph-csi   true         <none>

It says that they are schedulable. This is the output of nomad volume status zookeeper1-datalog:

ID                   = zookeeper1-datalog
Name                 = zookeeper1-datalog
External ID          = 0001-0024-72f28a72-0434-4045-be3a-b5165287253f-0000000000000003-72ec315b-e9f5-11eb-8af7-0242ac110002
Plugin ID            = ceph-csi
Provider             = cephfs.nomad.example.com
Version              = v3.3.1
Schedulable          = true
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 2
Nodes Expected       = 2
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

It says there, there are no allocations placed.

Reproduction steps

This is unfortunately flaky. But most likely happen due to job failing and then stopped and then replanned. This persists even after I purge the job with nomad job stop -purge. No, doing nomad system gc, nomad system reconcile summary, or restarting Nomad does not work.

Expected Result

Should be able to reclaim the volume again without having to detach or deregister -force and register again. I created the volumes using nomad volume create so those volumes have their external IDs all generated. There are 6 volumes and 2 nodes, I don't want to type detach 12 times everytime this happens (this happens so frequently).

Actual Result

See error logs above.

Job file (if appropriate)

I have three groups (zookeeper1, zookeeper2, zookeeper3) each having volume stanza like this (each with their own volumes, this one is for zookeeper2):

    volume "data" {
      type = "csi"
      read_only = false
      source = "zookeeper2-data"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }
    volume "datalog" {
      type = "csi"
      read_only = false
      source = "zookeeper2-datalog"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

All groups have count = 1.

The text was updated successfully, but these errors were encountered:

bfqrst · 2021-07-26T06:30:31Z

[Nomad 1.1.2]

I have similar issues using the AWS EBS plugin. And I wholeheartedly share the sentiment that it's really hard to reproduce or debug, which is why I've been reluctant to opening an issue. In my case Nomad workers are spot instances which come and go. Non CSI jobs get rescheduled fine, but those which have CSI volumes attached tend to go red on the first reschedule and then eventually succeed. It's almost like the CSI mechanism needs more time to do it's thing, before the job gets restarted...

gregory112 · 2021-07-26T06:53:53Z

I have been investigating it for a while.
A manual solution that works is to deregister the volume with -force, or at least detaching the volume with nomad volume detach. Deregistering means you need to register the volume again and if you used nomad volume create as I did, you will have to get the external ID and reregister the volume again with the same external ID. Now nomad volume detach is probably the most straightforward approach. I learn that what it actually do is to unmount the mount point of the particular volume in the particular node. And this sometimes hang. When hanging, I see an umount process in the node hanging indefinitely. Network connectivity is never reliable, so I think cases like this needs attention.

I don't know what Nomad does with unused volumes after the job is stopped. Does Nomad instruct the plugin to unmount the volumes from nodes? What if it fails to do so then? Maybe a timeout, or more log messages, or anything should be added in nomad alloc status.

umount hangs often especially in the case of mounting network file systems like this. And when umount hangs, nomad volume detach hangs too. The solution in my case is to actually kill the umount process in nodes and retry with umount -f, but I don't know if Nomad can do this, as I think this might be managed by the CSI plugin. But due to the variations that exist in a lot different CSI plugins, I think cases like this should be handled by Nomad somehow.

konsti · 2021-08-14T09:38:18Z

We have the very same issue with Nomad 1.1.3 and AWS EBS CSI Plugin v1.2.0.
We need to revert to host_volumes as this issue threatens our cluster health when important stateful services like Prometheus can't restart correctly.

The volume is unmounted currently in AWS and shows state = "Available", so it's clearly an internal Nomad state issue.

gregory112 · 2021-08-14T09:58:05Z

Well this is bad, as I don't think the issue is solved by Nomad garbage collector. When a volume hangs in one Nomad node, Nomad will just allocate the job to another node. And when it hangs there, it will allocate the job to another node again. Imagine if we have 100 Nomad nodes and the volumes stuck in some of them.

JanMa · 2021-08-16T07:36:59Z

I am seeing the same issues using the gcp-compute-persistent-disk-csi-driver plugin and Nomad v1.1.1. Interestingly the Read Volume API endpoint shows the correct amount of writers but the List Volumes endpoint shows too many CurrentWriters. I guess this is what's stopping the reassignment of the volume

Thunderbottom · 2021-08-16T08:00:50Z

We've faced this issue as well. Seems like Nomad fails to realise that there are not allocations present for the volume.

From what we could make out, both AWS and the CSI plugin report that the volume is available to mount:

$ nomad volume status -verbose -namespace=workspace workspace
ID                   = workspace
Name                 = workspace
External ID          = vol-xxxxxxxxxxxxx
Plugin ID            = aws-ebs0
Provider             = ebs.csi.aws.com
Version              = v0.10.1
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 2
Nodes Expected       = 2
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = workspace

And the AWS console reports the volume is "Available".

Whereas in the Nomad UI, the "Storage" tab reports multiple #Allocs for the aforementioned volume, and clicking on the volume shows that there are neither read nor write mounts, which potentially points to my hunch that Nomad somehow isn't aware that the volume has been detached by the driver and is not in use by any of the allocations. Here's a gist of logs from the CSI controller for the same:

I0816 06:26:49.482851       1 controller.go:329] ControllerUnpublishVolume: called with args {VolumeId:vol-XXXXXXXXXX NodeId:i-YYYYYYYYYYY Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0816 06:26:56.194278       1 controller.go:346] ControllerUnpublishVolume: volume vol-XXXXXXXXXX detached from node i-YYYYYYYYYYY

One thing I have noticed is that on a newly registered CSI volume, the metadata shows up as:

Access Mode          = <none>
Attachment Mode      = <none>

And while on the volume that is "Schedulable" (not really, in terms of Nomad) it shows up as:

Access Mode          = single-node-writer
Attachment Mode      = file-system

Could be pointing to something, perhaps?

The only solution that has worked for us so far is to -force deregister the volume and re-register it. Would love to have this issue solved. Let me know if I can help with any more information.

lgfa29 · 2021-08-17T21:29:50Z

Thank you all for the detailed information (keep them coming if you have more!).

It seems like this issue happens sporadically when allocations that use a volume are restarted. I will try to create a high churn environment and see if I can reproduce it.

lgfa29 · 2021-08-19T13:48:20Z

Hi everyone, just a quick update. I've been running a periodic job for a couple of days now, and so far I haven't seen any issues.

In hindsight I should've probably used a service job since they are likely more common they have some different scheduling logic when compared to batch.

For those who have seen this issue, do you have any kind of update block set? This could affect how new allocations are created and I want to make sure I have a good reproduction environment.

Thanks!

JanMa · 2021-08-19T14:08:04Z

Hey @lgfa29 you can try out the following job. It's a basic prometheus deployment without any special update block.
I regularly have issues when trying to update this for, for example to add more resources.

job "prometheus" {
  datacenters = ["dc1"]
  type = "service"

  group "monitoring" {
    count = 2

    constraint {
      operator = "distinct_hosts"
      value = "true"
    }

    volume "data" {
      type            = "csi"
      source          = "prometheus-disk"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"
      per_alloc       = true
    }

    network {
      port "http" {
        static = 9090
      }
    }

    service {
      name = "prometheus2"
      tags = ["prometheus2"]
      task = "prometheus"
      port = "http"

      check {
        type = "http"
        port = "http"
        path = "/-/ready"
        interval = "10s"
        timeout = "5s"
      }
    }

    task "prometheus" {
      driver = "docker"
      user = "root"

      volume_mount {
        volume      = "data"
        destination = "/prometheus"
      }

      resources {
        memory = 1024
        cpu = 1024
      }

      template {
        data = <<EOT
---
global:
  scrape_interval: 10s
  external_labels:
    __replica__: "{{ env "NOMAD_ALLOC_ID" }}"
scrape_configs:
  - job_name: "prometheus"
    scrape_interval: 10s
    consul_sd_configs:
      - server: "{{ env "attr.unique.network.ip-address" }}:8500"
        services:
          - prometheus2
    relabel_configs:
      - source_labels: ["__meta_consul_node"]
        regex: "(.*)"
        target_label: "node"
        replacement: "$1"
      - source_labels: ["__meta_consul_service_id"]
        regex: "(.*)"
        target_label: "instance"
        replacement: "$1"
      - source_labels: ["__meta_consul_dc"]
        regex: "(.*)"
        target_label: "datacenter"
        replacement: "$1"
  # Nomad metrics
  - job_name: "nomad_metrics"
    consul_sd_configs:
      - server: "{{ env "attr.unique.network.ip-address" }}:8500"
        services: ["nomad-client", "nomad"]
    relabel_configs:
      - source_labels: ["__meta_consul_tags"]
        regex: "(.*)http(.*)"
        action: keep
      - source_labels: ["__meta_consul_service"]
        regex: "(.*)"
        target_label: "job"
        replacement: "$1"
      - source_labels: ["__meta_consul_node"]
        regex: "(.*)"
        target_label: "node"
        replacement: "$1"
      - source_labels: ["__meta_consul_service_id"]
        regex: "(.*)"
        target_label: "instance"
        replacement: "$1"
      - source_labels: ["__meta_consul_dc"]
        regex: "(.*)"
        target_label: "datacenter"
        replacement: "$1"
    scrape_interval: 5s
    metrics_path: /v1/metrics
    params:
      format: ["prometheus"]
  # Consul metrics
  - job_name: "consul_metrics"
    consul_sd_configs:
      - server: "{{ env "attr.unique.network.ip-address" }}:8500"
        services: ["consul-agent"]
    relabel_configs:
      - source_labels: ["__meta_consul_tags"]
        regex: "(.*)http(.*)"
        action: keep
      - source_labels: ["__meta_consul_service"]
        regex: "(.*)"
        target_label: "job"
        replacement: "$1"
      - source_labels: ["__meta_consul_node"]
        regex: "(.*)"
        target_label: "node"
        replacement: "$1"
      - source_labels: ["__meta_consul_service_id"]
        regex: "(.*)"
        target_label: "instance"
        replacement: "$1"
      - source_labels: ["__meta_consul_dc"]
        regex: "(.*)"
        target_label: "datacenter"
        replacement: "$1"
    scrape_interval: 5s
    metrics_path: /v1/agent/metrics
    params:
      format: ["prometheus"]
EOT
        destination = "local/prometheus.yml"
      }
      config {
        image = "quay.io/prometheus/prometheus"
        ports = ["http"]
        args = [
          "--config.file=${NOMAD_TASK_DIR}/prometheus.yml",
          "--log.level=info",
          "--storage.tsdb.retention.time=1d",
          "--storage.tsdb.path=/prometheus",
          "--web.console.libraries=/usr/share/prometheus/console_libraries",
          "--web.console.templates=/usr/share/prometheus/consoles"
        ]
      }
    }
  }
}

gregory112 · 2021-08-19T14:22:07Z

For those who have seen this issue, do you have any kind of update block set? This could affect how new allocations are created and I want to make sure I have a good reproduction environment.

Mine has.

  update {
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "15m"
    progress_deadline = "20m"
  }

Also you can try to fail the job. For example due to pull fail or running fail to the point the deployment exceeds deadline. This is what really makes the error frequently show up. I also have service block set up to provide health checks.

Yes, please try with service job. What made the error show up in my case is task failure, and stopping and replanning the task. Replanning without stopping can sometimes cause the issue to appear too, for example updating the image version used by the job.

Also I have two master nodes. Don't know if this contributes to this but does anyone use two master nodes too for this? Or maybe more? Maybe a race condition error between master nodes? Split brain perhaps?

Also, here's mine

variables {
  mycluster_zookeeper_image = "zookeeper:3.6"
}

job "mycluster-zookeeper" {
  region = "global"
  datacenters = ["dc1"]
  type = "service"

  update {
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "15m"
    progress_deadline = "20m"
  }

  group "zookeeper1" {
    count = 1

    restart {
      interval = "10m"
      attempts = 2
      delay    = "10s"
      mode     = "fail"
    }

    network {
      mode = "cni/weave"
    }

    service {
      name = "mycluster-zookeeper1-peer"
      port = "2888"
      address_mode = "alloc"
    }

    service {
      name = "mycluster-zookeeper1-leader"
      port = "3888"
      address_mode = "alloc"
    }

    service {
      name = "mycluster-zookeeper1-client"
      port = "2181"
      address_mode = "alloc"
    }

    volume "data" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper1-data"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }
    volume "datalog" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper1-datalog"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

    task "zookeeper" {
      driver = "docker"

      volume_mount {
        volume = "data"
        destination = "/data"
        read_only = false
      }
      volume_mount {
        volume = "datalog"
        destination = "/datalog"
        read_only = false
      }

      config {
        image = var.mycluster_zookeeper_image
        image_pull_timeout = "10m"
      }

      template {
        destination = "local/zookeeper.env"
        data = <<EOH
ZOO_SERVERS=server.1:0.0.0.0:2888:3888;2181 server.2:{{ with service "mycluster-zookeeper2-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181 server.3:{{ with service "mycluster-zookeeper3-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181
EOH
        env = true
      }

      env {
        ZOO_MY_ID = "1"
        ZOO_CFG_EXTRA = "minSessionTimeout=10000"
      }

      resources {
        cpu    = 500
        memory = 512
      }
    }
  }

  group "zookeeper2" {
    count = 1

    restart {
      interval = "10m"
      attempts = 2
      delay    = "10s"
      mode     = "fail"
    }

    network {
      mode = "cni/weave"
    }

    service {
      name = "mycluster-zookeeper2-peer"
      port = "2888"
      address_mode = "alloc"
    }

    service {
      name = "mycluster-zookeeper2-leader"
      port = "3888"
      address_mode = "alloc"
    }

    service {
      name = "mycluster-zookeeper2-client"
      port = "2181"
      address_mode = "alloc"
    }

    volume "data" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper2-data"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }
    volume "datalog" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper2-datalog"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

    task "zookeeper" {
      driver = "docker"

      volume_mount {
        volume = "data"
        destination = "/data"
        read_only = false
      }
      volume_mount {
        volume = "datalog"
        destination = "/datalog"
        read_only = false
      }

      config {
        image = var.mycluster_zookeeper_image
        image_pull_timeout = "10m"
      }

      template {
        destination = "local/zookeeper.env"
        data = <<EOH
ZOO_SERVERS=server.1:{{ with service "mycluster-zookeeper1-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181 server.2:0.0.0.0:2888:3888;2181 server.3:{{ with service "mycluster-zookeeper3-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181
EOH
        env = true
      }

      env {
        ZOO_MY_ID = "2"
        ZOO_CFG_EXTRA = "minSessionTimeout=10000"
      }

      resources {
        cpu    = 500
        memory = 512
      }
    }
  }

  group "zookeeper3" {
    count = 1

    restart {
      interval = "10m"
      attempts = 2
      delay    = "10s"
      mode     = "fail"
    }

    network {
      mode = "cni/weave"
    }

    service {
      name = "mycluster-zookeeper3-peer"
      port = "2888"
      address_mode = "alloc"

      #connect {
      #  sidecar_service {
      #    disable_default_tcp_check = true
      #
      #    proxy {
      #      upstreams {
      #        destination_name = "mycluster-zookeeper1-peer"
      #        local_bind_port = "12888"
      #      }
      #      upstreams {
      #        destination_name = "mycluster-zookeeper2-peer"
      #        local_bind_port = "22888"
      #      }
      #    }
      #  }
      #}
    }

    service {
      name = "mycluster-zookeeper3-leader"
      port = "3888"
      address_mode = "alloc"

      #connect {
      #  sidecar_service {
      #    disable_default_tcp_check = true
      #
      #    proxy {
      #      upstreams {
      #        destination_name = "mycluster-zookeeper1-leader"
      #        local_bind_port = "13888"
      #      }
      #      upstreams {
      #        destination_name = "mycluster-zookeeper2-leader"
      #        local_bind_port = "23888"
      #      }
      #    }
      #  }
      #}
    }

    service {
      name = "mycluster-zookeeper3-client"
      port = "2181"
      address_mode = "alloc"

      #connect {
      #  sidecar_service {
      #    disable_default_tcp_check = true
      #    #proxy {
      #    #  upstreams {
      #    #    destination_name = "mycluster-zookeeper1-client"
      #    #    local_bind_port = "12181"
      #    #  }
      #    #  upstreams {
      #    #    destination_name = "mycluster-zookeeper2-client"
      #    #    local_bind_port = "22181"
      #    #  }
      #    #  upstreams {
      #    #    destination_name = "mycluster-zookeeper3-client"
      #    #    local_bind_port = "32181"
      #    #  }
      #    #}
      #  }
      #}
    }

    volume "data" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper3-data"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }
    volume "datalog" {
      type = "csi"
      read_only = false
      source = "mycluster-zookeeper3-datalog"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

    task "zookeeper" {
      driver = "docker"

      volume_mount {
        volume = "data"
        destination = "/data"
        read_only = false
      }
      volume_mount {
        volume = "datalog"
        destination = "/datalog"
        read_only = false
      }

      config {
        image = var.mycluster_zookeeper_image
        image_pull_timeout = "10m"
      }

      template {
        destination = "local/zookeeper.env"
        data = <<EOH
ZOO_SERVERS=server.1:{{ with service "mycluster-zookeeper1-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181 server.2:{{ with service "mycluster-zookeeper2-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181 server.3:0.0.0.0:2888:3888;2181
EOH
        env = true
      }

      env {
        ZOO_MY_ID = "3"
        ZOO_CFG_EXTRA = "minSessionTimeout=10000"
      }

      resources {
        cpu    = 500
        memory = 512
      }
    }
  }
}

And my hcl file used to create all the volumes with nomad volume create:

id = "mycluster-zookeeper1-data"
name = "zookeeper1-data"
type = "csi"
plugin_id = "ceph-csi"
capacity_min = "1G"
capacity_max = "10G"

capability {
  access_mode = "single-node-writer"
  attachment_mode = "file-system"
}

mount_options {
  fs_type = "ext4"
  mount_flags = ["noatime"]
}

secrets {
  adminID = "redacted"
  adminKey = "redacted"
}

parameters {
  clusterID = "redacted"
  fsName = "myfsname"
  mounter = "kernel"
}

lgfa29 · 2021-08-19T21:03:22Z

Thanks for the sample jobs @JanMa and @gregory112.

It seems like update is not part of the problem, so I create this job that has a 10% chance of failing. I will leave it running and see if the issue is triggered.

job "random-fail" {
  datacenters = ["dc1"]
  type        = "service"

  group "random-fail" {
    volume "ebs-vol" {
      type            = "csi"
      read_only       = false
      source          = "ebs-vol"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

    task "random-fail" {
      driver = "docker"

      config {
        image   = "alpine:3.14"
        command = "/bin/ash"
        args    = ["/local/script.sh"]
      }

      template {
        data = <<EOF
#!/usr/bin/env bash

while true;
do
  echo "Rolling the dice..."

  n=$(($RANDOM % 10))
  echo "Got ${n}!"

  if  [[ 0 -eq ${n} ]];
  then
    echo "Bye :wave:"
    exit 1;
  fi

  echo "'Til the next round."
  sleep 10;
done
EOF

        destination = "local/script.sh"
      }

      volume_mount {
        volume      = "ebs-vol"
        destination = "/volume"
        read_only   = false
      }
    }
  }
}

raven-oscar · 2021-08-21T01:06:47Z

For some reason nomad thinks that volume is still in use while it is not. nomad volume deregister returns "Error deregistering volume: Unexpected response code: 500 (rpc error: volume in use: nessus)". nomad volume deregister -force followed by nomad system gc and then registering it again seems to help.

lgfa29 · 2021-08-24T15:49:09Z

I think I was able to reproduce this with a self-updating job:

job "update" {
  datacenters = ["dc1"]
  type        = "service"

  group "update" {
    volume "ebs-vol" {
      type            = "csi"
      read_only       = false
      source          = "ebs-vol2"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"

      mount_options {
        fs_type     = "ext4"
        mount_flags = ["noatime"]
      }
    }

    task "update" {
      driver = "docker"

      config {
        image   = "alpine:3.14"
        command = "/bin/sh"
        args    = ["/local/script.sh"]
      }

      template {
        data = <<EOF
#!/usr/bin/env bash

apk add curl jq

while true;
do
  sleep 10
  jobspec=$(curl http://172.17.0.1:4646/v1/job/${NOMAD_JOB_ID})
  cpu=$(echo $jobspec | jq '.TaskGroups[0].Tasks[0].Resources.CPU')
  if [ $cpu -eq 500 ]
  then
    cpu=600
  else
    cpu=500
  fi
  new_jobspec=$(echo $jobspec | jq ".TaskGroups[0].Tasks[0].Resources.CPU = ${cpu}")
  echo $new_jobspec | jq '{"Job":.}' | curl -H "Content-Type: application/json" -X POST --data @- http:/172.17.0.1:4646/v1/job/${NOMAD_JOB_ID}
done
EOF

        destination = "local/script.sh"
      }

      volume_mount {
        volume      = "ebs-vol"
        destination = "/volume"
        read_only   = false
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

This job will usually fail first, and then work.

It's not exactly the same error message, but maybe it's related:

failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: controller publish: attach volume: controller attach volume: CSI.ControllerAttachVolume: controller plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = Could not attach volume "vol-0fa6e88e4c3b55d29" to node "i-02396ab34ed318944": attachment of disk "vol-0fa6e88e4c3b55d29" failed, expected device to be attached but was detaching

Looking at the volume details page, I see the previous allocation still listed under Write Allocations for a while, even after it's marked as Complete. I don't know if this is related, since another alloc is able to start, but may be be a pointer in the right direction

NathanFlurry · 2021-08-25T20:42:45Z

This nasty workaround seems to be working for DigitalOcean. If your task restarts frequently, it will spam your cluster with jobs, so be careful with that.

your-job.nomad.tpl:

# ...
		task "reregister_volume" {
			lifecycle {
				hook = "poststop"
				sidecar = false
			}

			driver = "docker"

			config {
				image = "alpine:3.14"
				entrypoint = [
					"/bin/sh",
					"-eufc",
					<<-EOF
					apk add curl

					curl --fail --data '@-' -X POST \
						"http://$${attr.unique.network.ip-address}:4646/v1/job/reregister-volume/dispatch" <<-EndOfData
						{
							"Meta": {
								"csi_id": "${csi_id_prefix}[$${NOMAD_ALLOC_INDEX}]",
								"csi_id_uri_component": "${csi_id_prefix}%5B$${NOMAD_ALLOC_INDEX}%5D",
								"volume_name": "${volume_name_prefix}$${NOMAD_ALLOC_INDEX}",
								"volume_plugin_id": "${volume_plugin_id}"
							}
						}
						EndOfData
					EOF
				]
			}
		}
# ...

register-volume.nomad.tpl:

job "reregister-volume" {
	type = "batch"
	
	parameterized {
		payload = "forbidden"
		meta_required = ["csi_id", "csi_id_uri_component", "volume_name", "volume_plugin_id"]
	}

	group "reregister" {
		task "reregister" {
			driver = "docker"

			config {
				image = "alpine:3.14"
				entrypoint = [
					"/bin/sh",
					"-eufc",
					<<-EOF
					sleep 5 # Wait for job to stop

					apk add jq curl

					echo "CSI_ID=$${NOMAD_META_CSI_ID}"
					echo "CSI_ID_URI_COMPONENT=$${NOMAD_META_CSI_ID_URI_COMPONENT}"
					echo "VOLUME_NAME=$${NOMAD_META_VOLUME_NAME}"
					echo "VOLUME_PLUGIN_ID=$${NOMAD_META_VOLUME_PLUGIN_ID}"

					n=0
					until [ "$n" -ge 15 ]; do
						echo "> Checking if volume exists (attempt $n)"
						curl --fail -X GET \
							"http://$${attr.unique.network.ip-address}:4646/v1/volumes?type=csi" \
							| jq -e '. | map(.ID == "$${NOMAD_META_CSI_ID}") | any | not' && break
						n=$((n+1))

						sleep 1
						echo
						echo '> Force detachign volume'
						curl --fail -X DELETE \
							"http://$${attr.unique.network.ip-address}:4646/v1/volume/csi/$${NOMAD_META_CSI_ID_URI_COMPONENT}?force=true" \
							|| echo '  Detaching failed'
					done
					if [ "$n" -ge 15 ]; then
						echo '  Deregister failed too many times, giving up'
						exit 0
					else
						echo '  Deregister complete'
					fi

					echo
					echo '> Fetching external volume ID'
					VOLUME_JSON=$(
						curl --fail -X GET \
							-H "Authorization: Bearer ${digitalocean_token_persistent}" \
							"https://api.digitalocean.com/v2/volumes?name=$${NOMAD_META_VOLUME_NAME}" \
							| jq '.volumes[0]'
					)
					VOLUME_ID=$(
						echo "$VOLUME_JSON" | jq -r '.id'
					)
					VOLUME_REGION=$(
						echo "$VOLUME_JSON" | jq -r '.region.slug'
					)
					VOLUME_DROPLET_ID=$(
						echo "$VOLUME_JSON" | jq -r '.droplet_ids[0] // empty'
					)
					echo "VOLUME_ID=$VOLUME_ID"
					echo "VOLUME_REGION=$VOLUME_ID"
					echo "VOLUME_DROPLET_ID=$VOLUME_DROPLET_ID"

					if [ ! -z "$VOLUME_DROPLET_ID" ]; then
						echo
						echo '> Detaching volume on DigitalOcean'
						curl --fail -X POST \
							-H "Authorization: Bearer ${digitalocean_token_persistent}" \
							-d "{\"type\": \"detach\", \"droplet_id\": \"$VOLUME_DROPLET_ID\", \"region\": \"$VOLUME_REGION\"}" \
							"https://api.digitalocean.com/v2/volumes/$VOLUME_ID/actions"
					fi

					echo
					echo '> Re-registering volume'
					curl --fail --data '@-' -X PUT \
						"http://$${attr.unique.network.ip-address}:4646/v1/volume/csi/$${NOMAD_META_CSI_ID_URI_COMPONENT}" <<-EndOfVolume
						{
							"Volumes": [
								{
									"ID": "$${NOMAD_META_CSI_ID}",
									"Name": "$${NOMAD_META_VOLUME_NAME}",
									"ExternalID": "$VOLUME_ID",
									"PluginID": "$${NOMAD_META_VOLUME_PLUGIN_ID}",
									"RequestedCapabilities": [{
										"AccessMode": "single-node-writer",
										"AttachmentMode": "file-system"
									}]
								}
							]
						}
						EndOfVolume

					echo
					echo '> Reading volume'
					curl --fail -X GET \
						"http://$${attr.unique.network.ip-address}:4646/v1/volume/csi/$${NOMAD_META_CSI_ID_URI_COMPONENT}"

					echo
					echo 'Finished'
					EOF
				]
			}
		}
	}
}

latifrons · 2021-08-30T06:46:34Z

Same issue on my cluster with EFS plugin (docker: amazon/aws-efs-csi-driver:v1.3.3)
It usually happens after a restart of some node. No Write Allocations, No Read Allocations, however failed to mount because CSI volume has exhausted its available writer claims.

One thing needs to be mentioned is that, right after force deregistering the volume, my Node Health recovered from (4/5) to (5/5). EFS plugin cannot be deployed to that node before deregistering.

aroundthfur · 2021-09-01T10:17:42Z

[Nomad v1.1.3]

We have the exact same issue with the Linstor CSI driver (https://linbit.com/linstor/).
Our use-case is to be able to fail-over a job when one server or a DC dies. We are still in the implementation phase so we were intentionally killing servers where nomad jobs are running to reproduce the fail-over use-cases.

Now, it works a few times and nomad does a proper job, but after a random number of re-schedules it fails with the same error message and the same behavior as described in this issue. In the Nomad UI volume list the volume has an allocation set, but when we go into the detailed view there is no allocs present. Basically the same behavior that @Thunderbottom reported.

MagicRB · 2021-09-12T17:24:13Z

Yep, I'm running a really basic setup with just one server/client and one client and this happens to me too, I need to -force deregister the volume and then re-register it to fix it

latifrons · 2021-09-13T08:32:37Z

Just a kind reminder for those who are using AWS EFS:
Make sure you are using the suggested mount options given by AWS.
https://docs.aws.amazon.com/efs/latest/ug/efs-mount-helper.html

especially noresvport flag.

Or it will cause infinite time wait io (100% wa for top) once tcp connection breaks.

We didn't observe any more problem after correcting the mount options.

Same issue on my cluster with EFS plugin (docker: amazon/aws-efs-csi-driver:v1.3.3)
It usually happens after a restart of some node. No Write Allocations, No Read Allocations, however failed to mount because CSI volume has exhausted its available writer claims.

One thing needs to be mentioned is that, right after force deregistering the volume, my Node Health recovered from (4/5) to (5/5). EFS plugin cannot be deployed to that node before deregistering.

tomiles · 2021-09-15T12:40:58Z

Same issue on our v1.1.4 cluster using Ceph RBD and cephfs CSI plugins

tgross · 2022-01-25T16:47:15Z

Ok, the race described above turns out to be pretty straightforward. In the client's (*csiHook) Postrun() method, we make an unpublish RPC that includes a claim in the CSIVolumeClaimStateUnpublishing state and using the mode from the client. But then in the (*CSIVolume) Unpublish RPC handler, we query the volume from the state store (because we only get an ID from the client). And when we make the client RPC for the node unpublish step, we use the current volume's view of the mode (ref nodeUnpublishVolumeImpl). If the volume's mode has been changed before the old allocations can have their claims released, then we end up making a CSI RPC that will never succeed.

Why does this code path get the mode from the volume and not the claim? Because the claim written by the GC job in (*CoreScheduler) csiVolumeClaimGC doesn't have a mode. Instead it just writes a claim in the unpublishing state to ensure the volumewatcher detects a "past claim" change and reaps all the claims on the volumes.

With the patch in CSI: resolve invalid claim states #11890 we should be able to safely ignore the case of the GC claim because all the past claims will be correct, so it should be safe to update nodeUnpublishVolumeImpl to always use the claims mode (maybe falling back to the volume's mode if unset?).
We'll also fix most cases of this via CSI: node unmount from the client before unpublish RPC #11892 because we'll drive the node unpublish step from the client which has the "real" view of what mode the volume was actually mounted with.
I'll take a quick pass at auditing other uses of vol.AccessMode to ensure we're not making this mistake elsewhere.

tgross · 2022-01-25T21:39:25Z

At this point I'm feeling pretty confident CSI: resolve invalid claim states #11890 fixes the primary problem with the state store. I've just marked that as ready for review.
I've pulled out CSI: ensure that PastClaims are populated with correct mode #11932 for the race condition fix
That will unblock the last bit of code I need to finish up CSI: node unmount from the client before unpublish RPC #11892
I've also pulled out CSI: move terminal alloc handling into denormalization #11931 which includes that patch but refactors the "past claims" logic so that we can be sure the core GC job and the volumewatcher are using the same code paths as other consumers of the state store.

tgross · 2022-01-26T20:57:27Z

I've done some testing of #11892 and that's in good shape as well, and I've walked some team mates through the rest of the PRs so that we can get them reviewed and merged one-by-one. Unfortunately I've run into a non-CSI issue around the stop_after_client_disconnect flag #11943 which is going to impact a lot of CSI use cases (we created that flag specifically for stateful workloads). But we should be able to ship an incremental patch release with a bunch of the work done so far. Stay tuned for updates on when that may land as the PRs get merged.

tgross · 2022-01-28T19:19:39Z

The 4 patches for this issue (which also covers #10052 and #10833 and #8734) have been merged to main. We're working up a plan for a patch release shortly, at which point I'll close out this issue. Note this won't yet fix the issues in #8609 but it does make that an automatically recoverable situation. I'm going to be focusing heavily on knocking out the rest of the major CSI issues over the next month or two after we've shipped this.

Thanks again for your patience, folks.

MagicRB · 2022-01-28T19:56:29Z

We thank you dude! I can finally recommend nomad to people without a BUT, i've always wanted be able to do that.

RickyGrassmuck · 2022-01-28T22:22:45Z

This is awesome work @tgross! Highly appreciate the transparency and the depth at which you kept the community updated, this thread should become the gold standard for Organization <--> Community interactions!

tgross · 2022-02-01T19:58:21Z

Nomad 1.2.5 has shipped with the patches described here. We still have a few important CSI issues to close out before we can call CSI GA-ready, but it should be safe to close this issue now. Please let us know if you run into the issue with the current versions of Nomad. Thanks!

m1keil · 2022-03-16T04:32:42Z

@tgross I recently upgrade to v1.2.6 and I'm still hitting the issue.

After a few days of the job running, attempting to deploy ends up with the dreaded "CSI volume has exhausted its available writer claims and is claimed by a garbage collected allocation". Followed by a nomad volume deregister --force ritual.

tgross · 2022-03-16T13:05:45Z

@m1keil the problem isn't entirely eliminated until 1.3.0 (which will get backported). #12113 should cover the remaining cases.

iSchluff · 2022-04-25T10:05:46Z

@tgross I have still managed to reproduce with v1.3.0-beta.1
I think this could be related to a firewall issue I had. The node plugins couldn't reach the storage backend anymore and so the mount was still around but the task allocation was long gone. Whats the expected behavior here? Maybe the volume allocation should be released after a long timeout has expired?

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
registry    1       0         0        0       0         0     0

Placement Failure
Task Group "registry":
  * Constraint "CSI volume ceph-registry has exhausted its available writer claims and is claimed by a garbage collected allocation 45c884a0-e629-b180-eb44-c3c7a5d5e7b7; waiting for claim to be released": 3 nodes excluded by filter

Latest Deployment
ID          = 687ebec8
Status      = running
Description = Deployment is running

# nomad volume status ceph-registry
ID                   = ceph-registry
Name                 = ceph-registry
External ID          = 0001-0024-22b9524c-4a2d-4a56-9f3d-221ee05905a2-0000000000000001-af535f2b-9f9a-11ec-bf1c-0242ac110003
Plugin ID            = ceph-csi
Provider             = cephfs.csi.ceph.com
Version              = v3.6.0
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 4
Nodes Healthy        = 3
Nodes Expected       = 3
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

tgross · 2022-04-25T12:57:06Z

The node plugins couldn't reach the storage backend anymore and so the mount was still around but the task allocation was long gone. Whats the expected behavior here?

A bit more detail would be helpful. Which task allocation do you mean? The plugin or the allocation that mounted the volume? I would not have expected the allocation that mounted the volume to be actually gone because we block for a very long time retrying the unmount unless the client node was also shut down as well.

Did the node plugin for that node get restored to service? If so, the expectation is that the volumewatcher loop will eventually be able to detach it. Or is the client host itself gone too?

Maybe the volume allocation should be released after a long timeout has expired?

Unfortunately many storage providers will return errors if we do this. For example, imagine an AWS EBS volume attached to an EC2 instance where the Nomad client is shut down. We can never physically detach the volume until the client is restored, so all attempts to attach it elsewhere will fail unrecoverably. That's why we provide a nomad volume detach and nomad volume delete command, so that the operator can signal that the volume was detached out-of-band.

iSchluff · 2022-04-25T13:42:11Z

@tgross I mean the allocation that mounted the volume. It was gone because of a node restart and the rescheduled task remained in the stuck state since thursday, so approx. 4 days.

During the whole time the plugin state was shown as healthy in the ui, so nomad was not detecting that there was no actual connection to the backend. I guess ceph-csi just handles all Probe and GetCapabilities calls locally.

The network issue was eventually fixed, but the volume remained stuck.

That's why we provide a nomad volume detach and nomad volume delete command, so that the operator can signal that the volume was detached out-of-band.

I haven't yet had much success with detach, because for me it normally just blocks and doesn't give out any information. For a case such as the one I described it is also hard to know which node the volume was attached to previously after the allocation has been gced. Would it even work if the node has been restarted?

But yes, if such a case can't be handled automatically it would be good if a volume could be unconditionally detached by the operator even if the node has been restarted or is gone altogether.
In my case I needed to deregister/reregister the volume again.

Also the deregister/reregister dance is impossible if the volume has been created with terraform, as the volume will be deleted automatically if the resource is removed.

tgross · 2022-04-25T14:10:35Z

I mean the allocation that mounted the volume. It was gone because of a node restart and the rescheduled task remained in the stuck state since thursday, so approx. 4 days.

During the whole time the plugin state was shown as healthy in the ui, so nomad was not detecting that there was no actual connection to the backend. I guess ceph-csi just handles all Probe and GetCapabilities calls locally.

Ok, so that would also have restarted the Node plugin tasks as well. When the node restarted, did the Node plugin tasks come back? The UI is maybe a little misleading here -- it shows the plugin as healthy if we can schedule volumes with it, not when 100% of plugins are healthy (otherwise we'd block volume operations during plugin deployments).

As for being stuck for 4 days... the evaluation for the rescheduled task hits a retry limit fairly quickly, so I suspect it's not "stuck" so much as "gave up." Once we've got the claim taken care of, re-running the job (or using nomad job eval) should force a retry. But let's figure out the claim problem first. 😀

I haven't yet had much success with detach, because for me it normally just blocks and doesn't give out any information.

It blocks because it's a synchronous operation; it has to make RPC calls to the Node plugin and to the Controller plugin. You should see logs on the Node plugin (first) and then the Controller plugin (second). If both the Node plugin and Controller plugin are live, that should work and you should be able to look at logs in both plugins (and likely the Nomad leader as well) to see what's going on there.

For a case such as the one I described it is also hard to know which node the volume was attached to previously after the allocation has been gced. Would it even work if the node has been restarted?

So long as it restarted and has a running Node plugin, yes that should still work just fine. But fair point about knowing which node the volume was attached to previously.

tgross · 2022-04-25T14:12:43Z

(hit submit too soon 😀 )

But yes, if such a case can't be handled automatically it would be good if a volume could be unconditionally detached by the operator even if the node has been restarted or is gone altogether.

It's not possible for us to unconditionally detach but in theory we could unconditionally free a claim. That's what deregister -force is designed to do, but yeah I realize that's not going to work out so well with Terraform.

github-actions · 2022-10-08T02:36:23Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

gregory112 added the type/bug label Jul 22, 2021

gregory112 mentioned this issue Aug 16, 2021

Ceph RBD not working properly on 1.1.3 #11058

Closed

lgfa29 added this to Needs Triage in Nomad - Community Issues Triage via automation Aug 17, 2021

lgfa29 moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Aug 17, 2021

lgfa29 added stage/needs-investigation theme/storage labels Aug 17, 2021

lgfa29 self-assigned this Aug 17, 2021

lgfa29 moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Aug 24, 2021

lgfa29 added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Aug 24, 2021

rcoder removed this from Needs Roadmapping in Nomad - Community Issues Triage Sep 13, 2021

tgross mentioned this issue Jan 25, 2022

CSI: ensure that PastClaims are populated with correct mode #11932

Merged

tgross closed this as completed in #11932 Jan 27, 2022

tgross reopened this Jan 27, 2022

tgross mentioned this issue Jan 28, 2022

CSI allocations may fail first placement after drain/reschedule #8609

Closed

tgross added this to the 1.2.5 milestone Jan 28, 2022

This was referenced Jan 28, 2022

Referential integrity issue with CSI volumes. #10052

Closed

remove possibility of nil allocs from CSI volume/plugin #8734

Closed

CSI node unpublish/unstage should avoid requiring server #10833

Closed

tgross closed this as completed Feb 1, 2022

tgross unpinned this issue Feb 1, 2022

This was referenced Feb 3, 2022

AWS EBS CSI Volume is still being allocated to the job that already dead/purged #9948

Closed

CSI: report specific capabilities w/ plugin status output #10221

Closed

tgross mentioned this issue Feb 28, 2022

Wrong expected count for CSI controller and node - various volume issues #12142

Closed

nicolasscott mentioned this issue Jun 17, 2022

Allocs using CSI stuck in pending after terminating client node #13416

Closed

github-actions bot locked as resolved and limited conversation to collaborators Oct 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constraint "CSI volume has exhausted its available writer claims": 1 nodes excluded by filter #10927

Constraint "CSI volume has exhausted its available writer claims": 1 nodes excluded by filter #10927

gregory112 commented Jul 22, 2021 •

edited

Loading

bfqrst commented Jul 26, 2021

gregory112 commented Jul 26, 2021 •

edited

Loading

konsti commented Aug 14, 2021

gregory112 commented Aug 14, 2021

JanMa commented Aug 16, 2021

Thunderbottom commented Aug 16, 2021

lgfa29 commented Aug 17, 2021

lgfa29 commented Aug 19, 2021

JanMa commented Aug 19, 2021

gregory112 commented Aug 19, 2021 •

edited

Loading

lgfa29 commented Aug 19, 2021

raven-oscar commented Aug 21, 2021

lgfa29 commented Aug 24, 2021

NathanFlurry commented Aug 25, 2021

latifrons commented Aug 30, 2021

aroundthfur commented Sep 1, 2021 •

edited

Loading

MagicRB commented Sep 12, 2021

latifrons commented Sep 13, 2021

tomiles commented Sep 15, 2021 •

edited

Loading

tgross commented Jan 25, 2022

tgross commented Jan 25, 2022 •

edited

Loading

tgross commented Jan 26, 2022

tgross commented Jan 28, 2022 •

edited

Loading

MagicRB commented Jan 28, 2022

RickyGrassmuck commented Jan 28, 2022

tgross commented Feb 1, 2022

m1keil commented Mar 16, 2022 •

edited

Loading

tgross commented Mar 16, 2022 •

edited

Loading

iSchluff commented Apr 25, 2022 •

edited

Loading

tgross commented Apr 25, 2022

iSchluff commented Apr 25, 2022 •

edited

Loading

tgross commented Apr 25, 2022

tgross commented Apr 25, 2022

github-actions bot commented Oct 8, 2022

Constraint "CSI volume has exhausted its available writer claims": 1 nodes excluded by filter #10927

Constraint "CSI volume has exhausted its available writer claims": 1 nodes excluded by filter #10927

Comments

gregory112 commented Jul 22, 2021 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

bfqrst commented Jul 26, 2021

gregory112 commented Jul 26, 2021 • edited Loading

konsti commented Aug 14, 2021

gregory112 commented Aug 14, 2021

JanMa commented Aug 16, 2021

Thunderbottom commented Aug 16, 2021

lgfa29 commented Aug 17, 2021

lgfa29 commented Aug 19, 2021

JanMa commented Aug 19, 2021

gregory112 commented Aug 19, 2021 • edited Loading

lgfa29 commented Aug 19, 2021

raven-oscar commented Aug 21, 2021

lgfa29 commented Aug 24, 2021

NathanFlurry commented Aug 25, 2021

latifrons commented Aug 30, 2021

aroundthfur commented Sep 1, 2021 • edited Loading

MagicRB commented Sep 12, 2021

latifrons commented Sep 13, 2021

tomiles commented Sep 15, 2021 • edited Loading

tgross commented Jan 25, 2022

tgross commented Jan 25, 2022 • edited Loading

tgross commented Jan 26, 2022

tgross commented Jan 28, 2022 • edited Loading

MagicRB commented Jan 28, 2022

RickyGrassmuck commented Jan 28, 2022

tgross commented Feb 1, 2022

m1keil commented Mar 16, 2022 • edited Loading

tgross commented Mar 16, 2022 • edited Loading

iSchluff commented Apr 25, 2022 • edited Loading

tgross commented Apr 25, 2022

iSchluff commented Apr 25, 2022 • edited Loading

tgross commented Apr 25, 2022

tgross commented Apr 25, 2022

github-actions bot commented Oct 8, 2022

gregory112 commented Jul 22, 2021 •

edited

Loading

gregory112 commented Jul 26, 2021 •

edited

Loading

gregory112 commented Aug 19, 2021 •

edited

Loading

aroundthfur commented Sep 1, 2021 •

edited

Loading

tomiles commented Sep 15, 2021 •

edited

Loading

tgross commented Jan 25, 2022 •

edited

Loading

tgross commented Jan 28, 2022 •

edited

Loading

m1keil commented Mar 16, 2022 •

edited

Loading

tgross commented Mar 16, 2022 •

edited

Loading

iSchluff commented Apr 25, 2022 •

edited

Loading

iSchluff commented Apr 25, 2022 •

edited

Loading