Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System jobs don't start on new agents #6089

Closed
spuder opened this issue Aug 7, 2019 · 6 comments
Closed

System jobs don't start on new agents #6089

spuder opened this issue Aug 7, 2019 · 6 comments
Labels

Comments

@spuder
Copy link
Contributor

spuder commented Aug 7, 2019

Nomad version 0.9.4

I have fabio running as a system job in my nomad cluster.

job "fabio" {
  datacenters = ["dc1"]
  type = "system"

  group "fabio" {
    task "fabio" {
      driver = "docker"
      config {
        image = "fabiolb/fabio"
        network_mode = "host"
      }

      resources {
        cpu    = 200
        memory = 128
        network {
          mbits = 20
          port "lb" {
            static = 9999
          }
          port "ui" {
            static = 9998
          }
        }
      }
    }
  }
}

When I add new nomad agents, the system job does not start automatically.

According to the docs

The system scheduler is used to register jobs that should be run on all clients that meet the job's constraints. The system scheduler is also invoked when clients join the cluster or transition into the ready state.

The nomad agent is in 'ready' state.
There are no errors in the logs

Aug 07 23:24:04 agent2 nomad[7296]:     2019-08-07T23:24:00.067Z [INFO ] client.fingerprint_mgr.consul: consul agent is available
Aug 07 23:24:04 agent2 nomad[7296]:     2019-08-07T23:24:00.074Z [WARN ] client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=ens3
Aug 07 23:24:04 agent2 nomad[7296]:     2019-08-07T23:24:00.169Z [INFO ] client.fingerprint_mgr.vault: Vault is available
Aug 07 23:24:04 agent2 nomad[7296]:     2019-08-07T23:24:04.170Z [INFO ] client.plugin: starting plugin manager: plugin-type=driver
Aug 07 23:24:04 agent2 nomad[7296]:     2019-08-07T23:24:04.170Z [INFO ] client.plugin: starting plugin manager: plugin-type=device
Aug 07 23:24:04 agent2 nomad[7296]:     2019-08-07T23:24:04.233Z [INFO ] client: started client: node_id=xxxxxxxxx
Aug 07 23:24:04 agent2 nomad[7296]:     2019-08-07T23:24:04.245Z [INFO ] client: node registration complete
Aug 07 23:24:12 agent2 nomad[7296]:     2019-08-07T23:24:12.916Z [INFO ] client: node registration complete
Aug 07 23:24:34 agent2 nomad[7296]:     2019-08-07T23:24:34.217Z [INFO ] client.driver_mgr: driver health state has changed: driver=docker previous=undetected current=healthy description=Healthy
Aug 07 23:24:40 agent2 nomad[7296]:     2019-08-07T23:24:40.372Z [INFO ] client: node registration complete

Is this expected behavior? Or should the job automatically start?

@langmartin
Copy link
Contributor

The job should automatically start if the resources are available on the new client node and the system job isn't filtered by constraints. When you check the status of the job, does it show blocked allocations?

@spuder
Copy link
Contributor Author

spuder commented Aug 9, 2019

I resubmitted the Fabio job and it began to run on all nodes.

This may have been a consul acl issue, since I forgot to add a consul role to the nomad worker.

Whatever the root cause was, I can no longer reproduce the issue.

@spuder spuder closed this as completed Aug 9, 2019
@spuder
Copy link
Contributor Author

spuder commented Apr 8, 2020

I've upgraded to nomad 0.10.4 and added a new nomad agent and I find that the 'fabio' job does not get scheduled to this agent. Several days have passed, and other batch jobs are running successfully, but the system job fabio still hasn't started.

@spuder spuder reopened this Apr 8, 2020
@langmartin
Copy link
Contributor

Hey! Sorry this has come up again. There's nothing obvious in that job file that would prevent it from running.

Could you provide the output of nomad job status fabio and, if that status shows failed allocations, the output of nomad alloc status <id>? Make sure there's no identifying data in the output.

@tgross tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021
@tgross tgross added the stage/needs-verification Issue needs verifying it still exists label Mar 3, 2021
@tgross tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 3, 2021
@mikenomitch
Copy link
Contributor

I think this may have been fixed by this PR - #11054

Going to close this out, as its been a while and this PR or others that have shipped in the meantime have likely fixed it.

Please re-open if that is not the case!

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants