Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI Storage Monolith Providers show incorrect running status (after stopping) #8948

Closed
hongkongkiwi opened this issue Sep 23, 2020 · 3 comments

Comments

@hongkongkiwi
Copy link

hongkongkiwi commented Sep 23, 2020

Nomad version

Nomad v0.12.5 (514b0d6)

Operating system and Environment details

Linux builder0 4.15.0-112-generic 113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Issue

After starting a system job providing CSI storage for DigitalOcean Block Volumes when stopping it seems I cannot actually stop them. They seem to disappear from Docker (when I docker ps) but they appear to be still running on the machine and don't ever signal that they have died.

After stopping the jobs (even when I do a purge) it looks like this:

ID            = csi_digitalocean
Name          = csi_digitalocean
Submit Date   = 2020-09-23T02:15:30Z
Type          = system
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = dead (stopped)
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
monolith    0       0         5        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
f289b00a  35750b4b  monolith    15       stop     running  10h19m ago  12m12s ago
e689cbd1  35750b4b  monolith    7        stop     running  10h22m ago  10h21m ago
80431cab  cc765454  monolith    1        stop     running  19h49m ago  19h2m ago
c3363552  f82affc3  monolith    1        stop     running  19h53m ago  19h2m ago
308960b3  f82affc3  monolith    0        stop     running  19h56m ago  19h53m ago

These allocations remain forever. Even with a nomad system gc. A call to nomad status shows the following:

ID                Type     Priority  Status          Submit Date
csi_digitalocean  system   50        dead (stopped)  2020-09-23T02:15:30Z

Reproduction steps

Run the following job spec and then try a nomad stop csi_digitalocean or a nomad stop -purge csi_digitalocean

Job file (if appropriate)

job "csi_digitalocean" {
  region = "global"
  datacenters = ["dc1"]
  type = "system"
  group "monolith" {
    constraint {
      operator  = "distinct_hosts"
      value     = "true"
    }
    constraint {
      attribute = "${attr.cpu.arch}"
      operator = "="
      value = "amd64"
    }
    constraint {
      attribute = "${attr.kernel.name}"
      operator = "="
      value     = "linux"
    }
    # Only run this on digitalocean ocean droplets
    # e.g. droplets with a droplet_id
    constraint {
      attribute = "${meta.droplet_id}"
      operator = "is_set"
    }
    # Use nomad_storage_drivers list to control which servers these are applied to
    constraint {
      attribute = "${meta.nomad_storage_drivers}"
      operator = "is_set"
    }
    constraint {
      attribute = "${meta.nomad_storage_drivers}"
      operator = "set_contains"
      value = "digitalocean"
    }
    restart {
      attempts = 10
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }
    task "plugin" {
      driver = "docker"
      config {
	 image = "digitalocean/do-csi-plugin:v2.0.0"
        privileged = true
        args = [
          "--endpoint=unix:///var/run/csi.sock",
          "--token=<MY_DO_TOKEN>",
          "--url=https://api.digitalocean.com/"
        ]
      }
      csi_plugin {
        id        = "digitalocean"
        type      = "monolith"
        mount_dir = "/var/run"
      }
      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

Further investigation of a specific allocation with nomad status f289b00a shows the following:

ID                  = f289b00a-b97c-4e5d-5a2c-abec4380ad5c
Eval ID             = 992c239c
Name                = csi_digitalocean.monolith[0]
Node ID             = 35750b4b
Node Name           = <node_name>
Job ID              = csi_digitalocean
Job Version         = 15
Client Status       = running
Client Description  = Tasks are running
Desired Status      = stop
Desired Description = alloc not needed due to job update
Created             = 10h23m ago
Modified            = 17m7s ago

Task "plugin" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/500 MHz  7.7 MiB/256 MiB  300 MiB

Task Events:
Started At     = 2020-09-22T16:48:47Z
Finished At    = N/A
Total Restarts = 1
Last Restart   = 2020-09-22T16:48:18Z

Recent Events:
Time                  Type        Description
2020-09-23T02:23:04Z  Killing     Sent interrupt. Waiting 5s before force killing
2020-09-22T16:48:47Z  Started     Task started by client
2020-09-22T16:48:18Z  Restarting  Task restarting in 27.06382292s
2020-09-22T16:48:18Z  Terminated  Exit Code: 0
2020-09-22T16:16:18Z  Started     Task started by client
2020-09-22T16:16:15Z  Driver      Downloading image
2020-09-22T16:16:15Z  Task Setup  Building Task Directory
2020-09-22T16:16:15Z  Received    Task received by client

Stopping an allocation directly with a command like nomad alloc stop f289b00a shows the same behaviour, even after this the alloc still remains and shows as running (it's not running) and never disappears.

@hongkongkiwi hongkongkiwi changed the title CSI Storage Monolith Providers will not stop CSI Storage Monolith Providers show incorrect running status (after stopping) Sep 23, 2020
@tgross
Copy link
Member

tgross commented Sep 23, 2020

Hi @hongkongkiwi sorry to hear about that.

If the Docker container is stopped, but the allocation is left running, that suggests that something is preventing the allocation from being cleaned up on the host. And given that we're talking about CSI, it's probably a mount. Some information that would help debug this:

  • Can you get the client logs from the time the allocation was initially stopped?
  • Can you get the allocation logs for f289b00a? (Especially from when the Docker container was stopped, which should probably be the last logs we saw.)
  • Can you check mount on the client for mount points for the allocation?

@tgross
Copy link
Member

tgross commented Nov 25, 2020

In lieu of more data, closing with #9438, which will ship in Nomad 1.0

@tgross tgross closed this as completed Nov 25, 2020
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants