Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System jobs may not be allocated on newly initialized nodes #5579

Closed
notnoop opened this issue Apr 18, 2019 · 1 comment · Fixed by #5585 or #5900
Closed

System jobs may not be allocated on newly initialized nodes #5579

notnoop opened this issue Apr 18, 2019 · 1 comment · Fixed by #5585 or #5900

Comments

@notnoop
Copy link
Contributor

notnoop commented Apr 18, 2019

Nomad version

Nomad v0.9.0 (18dd59056ee1d7b2df51256fe900a98460d3d6b9)

Operating system and Environment details

Linux, amd64 on Ubuntu 16.04

Issue

Nodes that spin up after a system job registration may never run the system job.

Looks like there is a race in node placement and node update hooks. However, this races with the scheduler evaluating the placement of a job immediately on registry, and the scheduler would fail to place the evaluation.

Generic schedulers mitigate this scenario by using blocked evaluations to be re-evaluated on node updates. System scheduler doesn't perform blocked evals though, as reported in #4072; so the job never runs on that node.

Initially, I thought it was due to driver manager being async but at a closer look, it might not be the issue. I've added a 30 second sleep in exec fingerprinting, and looking in the client vs server logs, the server only dequeus evals after the node has detected all drivers, rather than immediately when node is up, so probably a race is occuring somewhere else.

[1] #4072

Reproduction steps

  1. Spin up a nomad server with some clients
  2. submit job file below, note that it runs on all existing clients
  3. Spin up a client
  4. run nomad node status and nomad job status example

Expect a new alloc on all nodes but new node doesn't have an alloc:

vagrant@nomad-server-01:~$ nomad node status
ID        DC   Name             Class   Drain  Eligibility  Status
451d1b74  dc1  nomad-client-04  <none>  false  eligible     ready
f02fc6f9  dc1  nomad-client-03  <none>  false  eligible     ready
e2ec56fd  dc1  nomad-client-02  <none>  false  eligible     ready
13720d7f  dc1  nomad-client-02  <none>  false  eligible     down
03ce9e96  dc1  nomad-client-01  <none>  false  eligible     ready
vagrant@nomad-server-01:~$ nomad job status example
ID            = example
Name          = example
Submit Date   = 2019-04-18T14:15:21Z
Type          = system
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         3        0       0         1

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
75601224  f02fc6f9  cache       0        run      running  37m30s ago  37m27s ago
10f01367  e2ec56fd  cache       0        run      running  47m52s ago  47m51s ago
65a89311  13720d7f  cache       0        stop     lost     50m19s ago  49m7s ago
87c23673  03ce9e96  cache       0        run      running  53m5s ago   53m3s ago

Note that node 451d1b74 didn't have a corresponding alloc. Below you can find the failed evaluation for the job and node update.

Job file (if appropriate)

job "example" {
  datacenters = ["dc1"]
  type = "system"

  group "cache" {
    task "redis" {
      driver = "exec"

      config {
        command = "/bin/sleep"
        args = ["50000"]
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

Failed to Place Evaluation

I found the following evaluation for node `451d1b74` and job with failure to update:

{
"AnnotatePlan": false,
"BlockedEval": "",
"ClassEligibility": null,
"CreateIndex": 76,
"DeploymentID": "",
"EscapedComputedClass": false,
"FailedTGAllocs": {
"cache": {
"AllocationTime": 6493,
"ClassExhausted": null,
"ClassFiltered": null,
"CoalescedFailures": 0,
"ConstraintFiltered": {
"missing drivers": 1
},
"DimensionExhausted": null,
"NodesAvailable": {
"dc1": 4
},
"NodesEvaluated": 1,
"NodesExhausted": 0,
"NodesFiltered": 1,
"QuotaExhausted": null,
"ScoreMetaData": null,
"Scores": null
}
},
"ID": "a6bfbdfc-94c4-ac26-7b90-7f3c5d5711c9",
"JobID": "example",
"JobModifyIndex": 0,
"LeaderACL": "",
"ModifyIndex": 77,
"Namespace": "default",
"NextEval": "",
"NodeID": "451d1b74-107e-ba71-7386-9cdc3265094d",
"NodeModifyIndex": 75,
"PreviousEval": "",
"Priority": 50,
"QueuedAllocations": {
"cache": 0
},
"QuotaLimitReached": "",
"SnapshotIndex": 76,
"Status": "complete",
"StatusDescription": "",
"TriggeredBy": "node-update",
"Type": "system",
"Wait": 0,
"WaitUntil": "0001-01-01T00:00:00Z"
},


### Nomad Client logs (if appropriate)

Apr 18 14:37:09 nomad-client-04 nomad[13539]: 2019-04-18T14:37:05.267Z [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Apr 18 14:37:09 nomad-client-04 nomad[13539]: 2019-04-18T14:37:05.267Z [INFO ] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Apr 18 14:37:09 nomad-client-04 nomad[13539]: 2019-04-18T14:37:09.293Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=raw_exec health=undetected description=disabled
Apr 18 14:37:09 nomad-client-04 nomad[13539]: 2019-04-18T14:37:09.294Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=rkt health=undetected description="Failed to execute rkt version: exec: "rkt": executable file not found in $PATH"
Apr 18 14:37:14 nomad-client-04 nomad[13539]: 2019-04-18T14:37:14.293Z [DEBUG] client.driver_mgr: detected drivers: drivers="[mock_driver exec]"
Apr 18 14:37:39 nomad-client-04 nomad[13539]: 2019-04-18T14:37:39.294Z [DEBUG] client.driver_mgr: initial driver fingerprint: driver=exec health=healthy description=Healthy


### Nomad Server logs (if appropriate)

Apr 18 14:37:39 nomad-server-01 nomad[13543]: 2019-04-18T14:37:39.663Z [DEBUG] worker: dequeued evaluation: eval_id=a6bfbdfc-94c4-ac26-7b90-7f3c5d5711c9
Apr 18 14:37:39 nomad-server-01 nomad[13543]: 2019-04-18T14:37:39.664Z [DEBUG] worker.system_sched: reconciled current state with desired state: eval_id=a6bfbdfc-94c4-ac26-7b90-7f3c5d5711c9 job_id=example namespace=default place=1 update=0 migrate=0 stop=0 ignore=3 lost=0
Apr 18 14:37:39 nomad-server-01 nomad[13543]: 2019-04-18T14:37:39.664Z [DEBUG] worker.system_sched: setting eval status: eval_id=a6bfbdfc-94c4-ac26-7b90-7f3c5d5711c9 job_id=example namespace=default status=complete
Apr 18 14:37:39 nomad-server-01 nomad[13543]: 2019-04-18T14:37:39.667Z [DEBUG] worker: updated evaluation: eval="<Eval "a6bfbdfc-94c4-ac26-7b90-7f3c5d5711c9" JobID: "example" Namespace: "default">"
Apr 18 14:37:39 nomad-server-01 nomad[13543]: 2019-04-18T14:37:39.667Z [DEBUG] worker: ack evaluation: eval_id=a6bfbdfc-94c4-ac26-7b90-7f3c5d5711c9

notnoop pushed a commit that referenced this issue Apr 19, 2019
Here we retain 0.8.7 behavior of waiting for driver fingerprints before
registering a node, with some timeout.  This is needed for system jobs,
as system job scheduling for node occur at node registration, and the
race might mean that a system job may not get placed on the node because
of missing drivers.

The timeout isn't strictly necessary, but raising it to 1 minute as it's
closer to indefinitely blocked than 1 second.  We need to keep the value
high enough to capture as much drivers/devices, but low enough that
doesn't risk blocking too long due to misbehaving plugin.

Fixes #5579
preetapan pushed a commit that referenced this issue Apr 22, 2019
Here we retain 0.8.7 behavior of waiting for driver fingerprints before
registering a node, with some timeout.  This is needed for system jobs,
as system job scheduling for node occur at node registration, and the
race might mean that a system job may not get placed on the node because
of missing drivers.

The timeout isn't strictly necessary, but raising it to 1 minute as it's
closer to indefinitely blocked than 1 second.  We need to keep the value
high enough to capture as much drivers/devices, but low enough that
doesn't risk blocking too long due to misbehaving plugin.

Fixes #5579
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
1 participant