System job keeps running after I try to remove it from a DC #11373

mikehardenize · 2021-10-22T09:32:49Z

Nomad version

Nomad v1.1.5 (117a23d)

Operating system and Environment details

Centos 7

Issue

I have two nomad agents in different DCs. One in us-east4-a and another in us-east4-b.
I created a system job, but it only had datacenters = ["us-east4-a"] so it only ran on one of the agents.
I then updated the job to contain datacenters = ["us-east4-a", "us-east4-b"] and re-ran it. It then started running on both agents (as expected).
However, I then switched it back to datacenters = ["us-east4-a"] and re-ran the job, and it unexpectedly continued running on the us-east4-b agent.
When I do a "nomad status jobname" it has "Datacenters = us-east4-a" in the output, but it also lists an allocation for each agent:

# nomad status traefik
ID            = traefik
Name          = traefik
Submit Date   = 2021-10-22T09:26:15Z
Type          = system
Priority      = 50
Datacenters   = us-east4-a
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
traefik     0       0         2        7       25        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
aa4cea90  004fd7be  traefik     42       run      running  5m52s ago   5m48s ago
629a220d  51bd0d56  traefik     43       run      running  14d18h ago  3m17s ago

The text was updated successfully, but these errors were encountered:

DerekStrickland · 2021-10-22T13:11:54Z

Hi @mikehardenize,

Thanks for using Nomad! Would you mind posting your full job file (without any secrets) for me to take a look at?

mikehardenize · 2021-10-22T13:18:41Z

job "traefik" {

    type = "system"

    datacenters = ["us-east4-a"]
    
    constraint {
        attribute = "${node.class}"
        value     = "job"
    }

    group "traefik" {
        
        network {
            port "http" {
                static = 80
            }
            port "https" {
                static = 443
            }
        }

        volume "traefik" {
            type      = "host"
            read_only = false
            source    = "traefik"
        }

        task "traefik" {
            driver = "docker"

            service {
                name = "traefik-http"
                port = "http"
                check {
                    type     = "http"
                    path     = "/ping"
                    interval = "5s"
                    timeout  = "2s"
                }
            }

            volume_mount {
                volume      = "traefik"
                destination = "/host"
                read_only   = false
            }

            config {
                image = "traefik:2.5"
                cap_add = ["net_raw"]
                ports        = ["http", "https"]
                network_mode = "host"
                dns_servers  = ["127.0.0.1"]
                auth_soft_fail = true
            }

        }
    }
}

notnoop · 2021-10-25T14:56:32Z

Thanks @mikehardenize for reporting the bug. I was able to reproduce it and identify the causes. We'll have a fix PR soon.

The system scheduler should leave allocs on draining nodes as-is, but stop node stop allocs on nodes that are no longer part of the job datacenters. Previously, the scheduler did not make the distinction and left system job allocs intact if they are already running. I've added a failing test first, which you can see in https://app.circleci.com/jobs/github/hashicorp/nomad/179661 . Fixes #11373

github-actions · 2022-10-14T02:45:36Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

mikehardenize added the type/bug label Oct 22, 2021

DerekStrickland self-assigned this Oct 22, 2021

DerekStrickland added this to Needs Triage in Nomad - Community Issues Triage via automation Oct 22, 2021

DerekStrickland moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Oct 22, 2021

DerekStrickland moved this from Triaging to Needs Triage in Nomad - Community Issues Triage Oct 22, 2021

DerekStrickland assigned DerekStrickland and unassigned DerekStrickland Oct 22, 2021

DerekStrickland moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Oct 22, 2021

DerekStrickland removed their assignment Oct 22, 2021

notnoop self-assigned this Oct 24, 2021

notnoop added theme/scheduling theme/system-scheduler labels Oct 25, 2021

notnoop mentioned this issue Oct 26, 2021

scheduler: stop allocs in unrelated nodes #11391

Merged

notnoop closed this as completed in #11391 Oct 27, 2021

tgross added this to the 1.2.0 milestone Nov 8, 2021

tgross removed this from Triaging in Nomad - Community Issues Triage Nov 8, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System job keeps running after I try to remove it from a DC #11373

System job keeps running after I try to remove it from a DC #11373

mikehardenize commented Oct 22, 2021

DerekStrickland commented Oct 22, 2021

mikehardenize commented Oct 22, 2021

notnoop commented Oct 25, 2021

github-actions bot commented Oct 14, 2022

System job keeps running after I try to remove it from a DC #11373

System job keeps running after I try to remove it from a DC #11373

Comments

mikehardenize commented Oct 22, 2021

Nomad version

Operating system and Environment details

Issue

DerekStrickland commented Oct 22, 2021

mikehardenize commented Oct 22, 2021

notnoop commented Oct 25, 2021

github-actions bot commented Oct 14, 2022