System jobs don't start on new agents #6089

spuder · 2019-08-07T23:43:33Z

Nomad version 0.9.4

I have fabio running as a system job in my nomad cluster.

job "fabio" {
  datacenters = ["dc1"]
  type = "system"

  group "fabio" {
    task "fabio" {
      driver = "docker"
      config {
        image = "fabiolb/fabio"
        network_mode = "host"
      }

      resources {
        cpu    = 200
        memory = 128
        network {
          mbits = 20
          port "lb" {
            static = 9999
          }
          port "ui" {
            static = 9998
          }
        }
      }
    }
  }
}

When I add new nomad agents, the system job does not start automatically.

According to the docs

The system scheduler is used to register jobs that should be run on all clients that meet the job's constraints. The system scheduler is also invoked when clients join the cluster or transition into the ready state.

The nomad agent is in 'ready' state.
There are no errors in the logs

Aug 07 23:24:04 agent2 nomad[7296]:     2019-08-07T23:24:00.067Z [INFO ] client.fingerprint_mgr.consul: consul agent is available
Aug 07 23:24:04 agent2 nomad[7296]:     2019-08-07T23:24:00.074Z [WARN ] client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=ens3
Aug 07 23:24:04 agent2 nomad[7296]:     2019-08-07T23:24:00.169Z [INFO ] client.fingerprint_mgr.vault: Vault is available
Aug 07 23:24:04 agent2 nomad[7296]:     2019-08-07T23:24:04.170Z [INFO ] client.plugin: starting plugin manager: plugin-type=driver
Aug 07 23:24:04 agent2 nomad[7296]:     2019-08-07T23:24:04.170Z [INFO ] client.plugin: starting plugin manager: plugin-type=device
Aug 07 23:24:04 agent2 nomad[7296]:     2019-08-07T23:24:04.233Z [INFO ] client: started client: node_id=xxxxxxxxx
Aug 07 23:24:04 agent2 nomad[7296]:     2019-08-07T23:24:04.245Z [INFO ] client: node registration complete
Aug 07 23:24:12 agent2 nomad[7296]:     2019-08-07T23:24:12.916Z [INFO ] client: node registration complete
Aug 07 23:24:34 agent2 nomad[7296]:     2019-08-07T23:24:34.217Z [INFO ] client.driver_mgr: driver health state has changed: driver=docker previous=undetected current=healthy description=Healthy
Aug 07 23:24:40 agent2 nomad[7296]:     2019-08-07T23:24:40.372Z [INFO ] client: node registration complete

Is this expected behavior? Or should the job automatically start?

The text was updated successfully, but these errors were encountered:

langmartin · 2019-08-09T14:32:09Z

The job should automatically start if the resources are available on the new client node and the system job isn't filtered by constraints. When you check the status of the job, does it show blocked allocations?

spuder · 2019-08-09T19:50:59Z

I resubmitted the Fabio job and it began to run on all nodes.

This may have been a consul acl issue, since I forgot to add a consul role to the nomad worker.

Whatever the root cause was, I can no longer reproduce the issue.

spuder · 2020-04-08T21:09:02Z

I've upgraded to nomad 0.10.4 and added a new nomad agent and I find that the 'fabio' job does not get scheduled to this agent. Several days have passed, and other batch jobs are running successfully, but the system job fabio still hasn't started.

langmartin · 2020-04-10T20:18:57Z

Hey! Sorry this has come up again. There's nothing obvious in that job file that would prevent it from running.

Could you provide the output of nomad job status fabio and, if that status shows failed allocations, the output of nomad alloc status <id>? Make sure there's no identifying data in the output.

mikenomitch · 2021-10-18T23:25:48Z

I think this may have been fixed by this PR - #11054

Going to close this out, as its been a while and this PR or others that have shipped in the meantime have likely fixed it.

Please re-open if that is not the case!

github-actions · 2022-10-15T02:44:41Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

langmartin self-assigned this Aug 9, 2019

langmartin added stage/needs-investigation and removed stage/needs-investigation labels Aug 9, 2019

spuder closed this as completed Aug 9, 2019

spuder reopened this Apr 8, 2020

tgross added theme/system-scheduler type/bug labels Jun 22, 2020

tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021

tgross unassigned langmartin Mar 3, 2021

tgross added the stage/needs-verification Issue needs verifying it still exists label Mar 3, 2021

tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 3, 2021

mikenomitch closed this as completed Oct 18, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System jobs don't start on new agents #6089

System jobs don't start on new agents #6089

spuder commented Aug 7, 2019

langmartin commented Aug 9, 2019

spuder commented Aug 9, 2019

spuder commented Apr 8, 2020

langmartin commented Apr 10, 2020

mikenomitch commented Oct 18, 2021

github-actions bot commented Oct 15, 2022

System jobs don't start on new agents #6089

System jobs don't start on new agents #6089

Comments

spuder commented Aug 7, 2019

langmartin commented Aug 9, 2019

spuder commented Aug 9, 2019

spuder commented Apr 8, 2020

langmartin commented Apr 10, 2020

mikenomitch commented Oct 18, 2021

github-actions bot commented Oct 15, 2022