Unable to normally stop and purge system job with csi plugin #11758

ygersie · 2022-01-03T08:48:13Z

Nomad version

v1.2.3

Operating system and Environment details

MacOS nomad agent -dev setup

Issue

Unable to stop and purge a failed system job which has a csi_plugin stanza and unexpected start of the job when -purge is passed.

Reproduction steps

Run below example job.

job "example" {
  datacenters = ["dc1"]
  type        = "system"

  group "example" {
    task "example" {
      driver = "docker"
      config {
        image = "alpine"
        args  = ["/bin/sh", "-c", "exit 1"]
      }

      restart {
        attempts = 1
        interval = "10s"
        delay    = "5s"
        mode     = "fail"
      }

      csi_plugin {
        id        = "example"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

Now wait until the job transitions to the failed state, then stop + purge the job.

nomad job stop -purge example

Now check the status of the job:

$ nomad job status example
ID            = example
Name          = example
Submit Date   = 2022-01-03T09:33:10+01:00
Type          = system
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
example     0       0         0        1       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created    Modified
86f91d4c  73681f8d  example     0        run      failed  2m52s ago  2m42s ago

This should've returned a not found error but it's still there and the Desired column states run. Re-running nomad job stop -purge example doesn't change the outcome until a GC has been run. Now trigger a GC with nomad system gc and rerun the stop -purge again, the result becomes:

$ nomad job stop -purge example
==> 2022-01-03T09:44:23+01:00: Monitoring evaluation "ed1f4546"
    2022-01-03T09:44:23+01:00: Evaluation triggered by job "example"
    2022-01-03T09:44:23+01:00: Allocation "c8c61822" created: node "73681f8d", group "example"
==> 2022-01-03T09:44:24+01:00: Monitoring evaluation "ed1f4546"
    2022-01-03T09:44:24+01:00: Evaluation status changed: "pending" -> "complete"
==> 2022-01-03T09:44:24+01:00: Evaluation "ed1f4546" finished with status "complete"

Instead of stopping it actually recreates the allocation again..

The text was updated successfully, but these errors were encountered:

jrasell · 2022-01-03T10:25:01Z

Hi @ygersie and thanks for providing such a detailed reproduction. I ran through this locally and got the same results as you detailed. These results are very unexpected.

tgross · 2022-01-03T13:49:30Z

I suspect this is related to another CSI plugin counts issue; I'm taking a pass through our open CSI issues over the next few weeks and will look at this as part of that work.

tgross · 2022-02-03T17:12:31Z

Noting here that I've marked #11114 as a duplicate of this one. #10073 may also ultimately be a duplicate but I'll leave that open for the time being as the cause is subtly different.

It looks like there are two parts to this:

The plugin counts don't accurately match the state of allocations. I'm working up a patch the resolves plugin counts in a similar fashion to what we did for volume claims in CSI: resolve invalid claim states #11890
There may also be a race between how we trigger allocation stops from a job purge, which in turn triggers the plugin GC, and how we purge the job. We can't GC the plugin until the allocation is terminal, but perhaps (incorrectly) can't GC the job until the plugin is GC'd. I'll be looking into this as well.

tgross · 2022-02-08T20:03:30Z

I've opened #12027 which is targeting #9810 and #10073 but may be a partial fix for this issue. Once I've got that merged I'll be digging into this.

tgross · 2022-02-23T20:55:40Z

Ok, so following #12027, #10073, and #12078 we've almost got this one resolved. There's just one bug left, which is that we can't deregister the job because it's looking to delete the plugin that doesn't exist:

2022-02-23T20:51:35.777Z [ERROR] nomad.fsm: deregistering job failed: error="DeleteJob failed: deleting job from plugin: plugin missing: example " job=badplugin namespace=default

tgross · 2022-02-23T21:07:58Z

Fixed in #12114! That'll ship in Nomad 1.3.0

github-actions · 2022-10-11T02:43:10Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

ygersie added the type/bug label Jan 3, 2022

jrasell added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage labels Jan 3, 2022

jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Jan 3, 2022

jrasell moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Jan 3, 2022

apollo13 mentioned this issue Jan 5, 2022

Draining a node confuses CSI plugin node health #9810

Closed

This was referenced Feb 3, 2022

Inability to GC CSI Plugins #11114

Closed

Cinder-CSI Plugin reports 0 healthy nodes/controllers after node decommissioning #10073

Closed

tgross self-assigned this Feb 3, 2022

tgross moved this from Needs Roadmapping to In Progress in Nomad - Community Issues Triage Feb 3, 2022

This was referenced Feb 8, 2022

core.sched: failed to GC plugin: plugin_id=<plugin> error="rpc error: Permission denied" #11162

Closed

CSI: use job status not alloc status for plugin updates from summary #12027

Merged

tgross removed this from In Progress in Nomad - Community Issues Triage Feb 9, 2022

tgross added this to the 1.3.0 milestone Feb 17, 2022

tgross mentioned this issue Feb 23, 2022

CSI: tolerate missing plugins on job delete #12114

Merged

tgross mentioned this issue Feb 24, 2022

Jobs using CSI volume do not recover from the client failure without human intervention #12118

Closed

tgross closed this as completed in #12114 Feb 24, 2022

github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to normally stop and purge system job with csi plugin #11758

Unable to normally stop and purge system job with csi plugin #11758

ygersie commented Jan 3, 2022

jrasell commented Jan 3, 2022

tgross commented Jan 3, 2022

tgross commented Feb 3, 2022

tgross commented Feb 8, 2022

tgross commented Feb 23, 2022

tgross commented Feb 23, 2022 •

edited

Loading

github-actions bot commented Oct 11, 2022

Unable to normally stop and purge system job with csi plugin #11758

Unable to normally stop and purge system job with csi plugin #11758

Comments

ygersie commented Jan 3, 2022

Nomad version

Operating system and Environment details

Issue

Reproduction steps

jrasell commented Jan 3, 2022

tgross commented Jan 3, 2022

tgross commented Feb 3, 2022

tgross commented Feb 8, 2022

tgross commented Feb 23, 2022

tgross commented Feb 23, 2022 • edited Loading

github-actions bot commented Oct 11, 2022

tgross commented Feb 23, 2022 •

edited

Loading