CSI Storage Monolith Providers show incorrect running status (after stopping) #8948

hongkongkiwi · 2020-09-23T02:43:21Z

Nomad version

Nomad v0.12.5 (514b0d6)

Operating system and Environment details

Linux builder0 4.15.0-112-generic 113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Issue

After starting a system job providing CSI storage for DigitalOcean Block Volumes when stopping it seems I cannot actually stop them. They seem to disappear from Docker (when I docker ps) but they appear to be still running on the machine and don't ever signal that they have died.

After stopping the jobs (even when I do a purge) it looks like this:

ID            = csi_digitalocean
Name          = csi_digitalocean
Submit Date   = 2020-09-23T02:15:30Z
Type          = system
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = dead (stopped)
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
monolith    0       0         5        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
f289b00a  35750b4b  monolith    15       stop     running  10h19m ago  12m12s ago
e689cbd1  35750b4b  monolith    7        stop     running  10h22m ago  10h21m ago
80431cab  cc765454  monolith    1        stop     running  19h49m ago  19h2m ago
c3363552  f82affc3  monolith    1        stop     running  19h53m ago  19h2m ago
308960b3  f82affc3  monolith    0        stop     running  19h56m ago  19h53m ago

These allocations remain forever. Even with a nomad system gc. A call to nomad status shows the following:

ID                Type     Priority  Status          Submit Date
csi_digitalocean  system   50        dead (stopped)  2020-09-23T02:15:30Z

Reproduction steps

Run the following job spec and then try a nomad stop csi_digitalocean or a nomad stop -purge csi_digitalocean

Job file (if appropriate)

job "csi_digitalocean" {
  region = "global"
  datacenters = ["dc1"]
  type = "system"
  group "monolith" {
    constraint {
      operator  = "distinct_hosts"
      value     = "true"
    }
    constraint {
      attribute = "${attr.cpu.arch}"
      operator = "="
      value = "amd64"
    }
    constraint {
      attribute = "${attr.kernel.name}"
      operator = "="
      value     = "linux"
    }
    # Only run this on digitalocean ocean droplets
    # e.g. droplets with a droplet_id
    constraint {
      attribute = "${meta.droplet_id}"
      operator = "is_set"
    }
    # Use nomad_storage_drivers list to control which servers these are applied to
    constraint {
      attribute = "${meta.nomad_storage_drivers}"
      operator = "is_set"
    }
    constraint {
      attribute = "${meta.nomad_storage_drivers}"
      operator = "set_contains"
      value = "digitalocean"
    }
    restart {
      attempts = 10
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }
    task "plugin" {
      driver = "docker"
      config {
	 image = "digitalocean/do-csi-plugin:v2.0.0"
        privileged = true
        args = [
          "--endpoint=unix:///var/run/csi.sock",
          "--token=<MY_DO_TOKEN>",
          "--url=https://api.digitalocean.com/"
        ]
      }
      csi_plugin {
        id        = "digitalocean"
        type      = "monolith"
        mount_dir = "/var/run"
      }
      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

Further investigation of a specific allocation with nomad status f289b00a shows the following:

ID                  = f289b00a-b97c-4e5d-5a2c-abec4380ad5c
Eval ID             = 992c239c
Name                = csi_digitalocean.monolith[0]
Node ID             = 35750b4b
Node Name           = <node_name>
Job ID              = csi_digitalocean
Job Version         = 15
Client Status       = running
Client Description  = Tasks are running
Desired Status      = stop
Desired Description = alloc not needed due to job update
Created             = 10h23m ago
Modified            = 17m7s ago

Task "plugin" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/500 MHz  7.7 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2020-09-22T16:48:47Z
Finished At    = N/A
Total Restarts = 1
Last Restart   = 2020-09-22T16:48:18Z

Recent Events:
Time                  Type        Description
2020-09-23T02:23:04Z  Killing     Sent interrupt. Waiting 5s before force killing
2020-09-22T16:48:47Z  Started     Task started by client
2020-09-22T16:48:18Z  Restarting  Task restarting in 27.06382292s
2020-09-22T16:48:18Z  Terminated  Exit Code: 0
2020-09-22T16:16:18Z  Started     Task started by client
2020-09-22T16:16:15Z  Driver      Downloading image
2020-09-22T16:16:15Z  Task Setup  Building Task Directory
2020-09-22T16:16:15Z  Received    Task received by client

Stopping an allocation directly with a command like nomad alloc stop f289b00a shows the same behaviour, even after this the alloc still remains and shows as running (it's not running) and never disappears.

The text was updated successfully, but these errors were encountered:

tgross · 2020-09-23T12:37:45Z

Hi @hongkongkiwi sorry to hear about that.

If the Docker container is stopped, but the allocation is left running, that suggests that something is preventing the allocation from being cleaned up on the host. And given that we're talking about CSI, it's probably a mount. Some information that would help debug this:

Can you get the client logs from the time the allocation was initially stopped?
Can you get the allocation logs for f289b00a? (Especially from when the Docker container was stopped, which should probably be the last logs we saw.)
Can you check mount on the client for mount points for the allocation?

tgross · 2020-11-25T16:42:04Z

In lieu of more data, closing with #9438, which will ship in Nomad 1.0

github-actions · 2022-10-28T02:41:17Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hongkongkiwi changed the title ~~CSI Storage Monolith Providers will not stop~~ CSI Storage Monolith Providers show incorrect running status (after stopping) Sep 23, 2020

tgross added stage/needs-investigation theme/storage type/bug labels Sep 23, 2020

tgross self-assigned this Sep 23, 2020

tgross added the stage/waiting-reply label Sep 23, 2020

tgross mentioned this issue Sep 30, 2020

[question] How to force CSI Node and Plugin jobs start before any other workload #8994

Closed

tgross removed the stage/needs-investigation label Nov 11, 2020

tgross mentioned this issue Nov 16, 2020

Nomad confused about how many CSI plugins should be running #9371

Closed

tgross added this to the 1.0 milestone Nov 24, 2020

tgross closed this as completed Nov 25, 2020

github-actions bot locked as resolved and limited conversation to collaborators Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI Storage Monolith Providers show incorrect running status (after stopping) #8948

CSI Storage Monolith Providers show incorrect running status (after stopping) #8948

hongkongkiwi commented Sep 23, 2020 •

edited

Loading

tgross commented Sep 23, 2020

tgross commented Nov 25, 2020

github-actions bot commented Oct 28, 2022

CSI Storage Monolith Providers show incorrect running status (after stopping) #8948

CSI Storage Monolith Providers show incorrect running status (after stopping) #8948

Comments

hongkongkiwi commented Sep 23, 2020 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

tgross commented Sep 23, 2020

tgross commented Nov 25, 2020

github-actions bot commented Oct 28, 2022

hongkongkiwi commented Sep 23, 2020 •

edited

Loading