Output of "nomad run" seems wrong for system job with constraints. #2381

cyrilgdn · 2017-03-01T13:37:47Z

Reference: https://groups.google.com/forum/#!topic/nomad-tool/t3bFTwSVgdQ

Nomad version

Nomad v0.5.4

Issue

When we run a system job, a placement failure error is raised (status code: 2) if one node has been exclude by a constraint. So it's impossible to know programmatically if at least one allocation was successfully placed or not.
The output displays wrong counts (see below) for filtered nodes.

Quoted from the mailing list:

We have a system job that runs on an auto scaling group (on AWS).
The instances of this group have a nomad class "foo" so the job definition is like:

job "test" {
datacenters = ["dc1"]
type = "system"

constraint {
    attribute = "${node.class}"
    value     = "foo"
}

[...]
}

So the job will be deployed on all servers in the autoscaling group and if we scale up the group,
Nomad automatically deploys the job on the newly instantiated server.

It's really cool but at the job submission, we have strange output.

Here is our (simplified) cluster nodes:

A: Instance with class="bar

B: Instance with class="bar"

C: Instance with class="baz"
Autoscaling group:

D1: Instance with class="foo"

When we run the job above we have the following output:

==> Monitoring evaluation "d1e000cd"
Evaluation triggered by job "test"
Allocation "51b3d960" modified: node "a45700d3", group "test"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "d1e000cd" finished with status "complete" but failed to place all allocations:
Task Group "test" (failed to place 3 allocations):
* Class "bar" filtered 1 nodes
* Constraint "${node.class} = foo" filtered 1 nodes

I think it's because a system job has only one evaluation but these numbers are weird:

Class "bar" really filtered 2 nodes

Constraint node.class filtered 3 nodes (or indeed 1 if we subtract the previous line)

The output contains a specific line for class "bar" but not class "baz" ? (It's pretty weird)

And, our main problem is that the status code of the "nomad run" command is 2.

The text was updated successfully, but these errors were encountered:

dansteen · 2017-06-08T15:29:55Z

+1 to this. The non-zero exist status is the real issue for us.

tgross · 2018-02-23T17:53:56Z

I'm running into this problem on our Nomad clusters at Density with Nomad 0.7. Our CI/CD pipeline attempts to plan and run jobs via the Nomad API and reports failure with system jobs. The Nomad CLI's exit code 2 appears to reflect the failed allocations coming back from the API:

nomad/command/plan.go

Lines 184 to 186 in 6a783e9

    
           if d.Stop+d.Place+d.Migrate+d.DestructiveUpdate+d.Canary > 0 { 
        
           	return 1 
        
           }

nomad/command/monitor.go

Lines 317 to 321 in 6a783e9

    
           // Treat scheduling failures specially using a dedicated exit code. 
        
           // This makes it easier to detect failures from the CLI. 
        
           if schedFailure { 
        
           	return 2 
        
           }

I'd be happy to contribute a fix for this, but it's not totally clear what the correct behavior should be. Should there simply be more exit codes to reflect different kinds of warnings?

stefan-caraiman · 2018-02-28T22:10:47Z

@dadgar any updates regarding this issue? Encountering the same issue with Nomad 0.7.1

gerilya · 2018-03-06T09:41:11Z

We are experiencing this as well.
Server: Nomad v0.7.0-rc3
Client: Nomad v0.7.1 (0b295d3)

jippi · 2018-03-06T09:47:03Z

a "quick" work-around is to submit it over the HTTP API rather than CLI and inspect the evaluation your self

i would expect any placement due to lack of resources for a system job to fail like it does today though

Crypto89 · 2018-03-07T15:56:42Z

I ran into the same issue today as well, it looks like this is more then an exit-code issue. The scheduler reports failed allocations over HTTP API as well. (so you get the same behaviour submitting over HTTP). The allocations do get scheduled properly, but it reports the filtered nodes as failed allocations.

curl -s localhost:4646/v1/evaluation/5d16340b-1ac6-625f-db46-b59d5f8534d6 | jq -r .
{
  "ID": "5d16340b-1ac6-625f-db46-b59d5f8534d6",
  "Type": "system",
  "TriggeredBy": "job-register",
  "JobID": "foo",
...
  "FailedTGAllocs": {
    "tg-foo": {
      "NodesEvaluated": 1,
      "NodesFiltered": 1,
      "NodesAvailable": {
        "zone2": 14,
        "zone3": 14,
        "zone1": 14
      },
      "ClassFiltered": {
        "class-a": 1
      },
      "ConstraintFiltered": {
        "${node.class} = class-b": 1
      },
      "NodesExhausted": 0,
      "ClassExhausted": null,
      "DimensionExhausted": null,
      "QuotaExhausted": null,
      "Scores": null,
      "AllocationTime": 30605,
      "CoalescedFailures": 38
    }
  },
...
}

danlsgiga · 2018-03-19T22:55:49Z

Same thing here... Running Nomad 0.7.1, whenever I use constraints with a system job in the same workflow as described in this issue I get placement errors even though the allocations are successful. It's like Nomad is treating a constrained node as a failure placement on system jobs when actually it is not!

SomKen · 2018-04-25T22:41:10Z

For the record, this error still appears on 0.8.1.

Example code: https://pastebin.com/raw/f7yH5Q4U

SomKen · 2018-04-26T17:20:54Z

Follow up,

After doing a fresh install of Nomad server, running the same job above, no errors exist in the UI. Errors sill persist when running the job via CLI.

dmartin-isp · 2018-09-11T17:29:40Z

I've just run into this as well. I launch my jobs from ansible, and now I have to tell ansible that exit code 2 is OK, which is .. sub optimal.

subvillion · 2018-11-02T10:53:30Z

same with 0.8.4, but errcode 1...
So constrain don't work in system jobs truth CI for me

jsaintro · 2018-11-14T22:25:03Z

Come on guys, this is really a bug and should be dealt with. Many, if not most, people that run a service are going to constrain it to a subset of nodes. Having it throw an error for such a common use case isn't good. Here's my workaround/backflip to take care of this in Ansible. At least it'll let some errors get trapped.

failed_when: 'jobresult.rc != 0 and not jobresult.stdout.find("finished with status \"complete\" but failed to place all allocations:") > -1'

SomKen · 2019-01-15T03:39:55Z

Any news on this?

preetapan · 2019-01-15T15:43:28Z

@SomKen @jsaintro we will tackle this in the next minor release (0.9.1) after 0.9 is out. Sorry for the delay but our highest priority now is to finish the large 0.9 release which brings in GPU support, runtime plugins and more advanced scheduling improvements.

skyrocknroll · 2019-04-01T17:33:42Z

We are getting hit by this. Hope you fix this fast

github-actions · 2022-11-24T02:19:51Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

cyrilgdn changed the title ~~Output of "nomad run" seems wrong for system job.~~ Output of "nomad run" seems wrong for system job with constraints. Mar 1, 2017

dadgar added theme/cli type/enhancement labels Mar 1, 2017

langmartin mentioned this issue Apr 30, 2019

fix system sched constraint errors #5631

Merged

langmartin closed this as completed in #5631 May 6, 2019

github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output of "nomad run" seems wrong for system job with constraints. #2381

Output of "nomad run" seems wrong for system job with constraints. #2381

cyrilgdn commented Mar 1, 2017

dansteen commented Jun 8, 2017 •

edited

Loading

tgross commented Feb 23, 2018

stefan-caraiman commented Feb 28, 2018

gerilya commented Mar 6, 2018

jippi commented Mar 6, 2018 •

edited

Loading

Crypto89 commented Mar 7, 2018 •

edited

Loading

danlsgiga commented Mar 19, 2018

SomKen commented Apr 25, 2018 •

edited

Loading

SomKen commented Apr 26, 2018

dmartin-isp commented Sep 11, 2018

subvillion commented Nov 2, 2018 •

edited

Loading

jsaintro commented Nov 14, 2018

SomKen commented Jan 15, 2019

preetapan commented Jan 15, 2019

skyrocknroll commented Apr 1, 2019

github-actions bot commented Nov 24, 2022

Output of "nomad run" seems wrong for system job with constraints. #2381

Output of "nomad run" seems wrong for system job with constraints. #2381

Comments

cyrilgdn commented Mar 1, 2017

Nomad version

Issue

dansteen commented Jun 8, 2017 • edited Loading

tgross commented Feb 23, 2018

stefan-caraiman commented Feb 28, 2018

gerilya commented Mar 6, 2018

jippi commented Mar 6, 2018 • edited Loading

Crypto89 commented Mar 7, 2018 • edited Loading

danlsgiga commented Mar 19, 2018

SomKen commented Apr 25, 2018 • edited Loading

SomKen commented Apr 26, 2018

dmartin-isp commented Sep 11, 2018

subvillion commented Nov 2, 2018 • edited Loading

jsaintro commented Nov 14, 2018

SomKen commented Jan 15, 2019

preetapan commented Jan 15, 2019

skyrocknroll commented Apr 1, 2019

github-actions bot commented Nov 24, 2022

dansteen commented Jun 8, 2017 •

edited

Loading

jippi commented Mar 6, 2018 •

edited

Loading

Crypto89 commented Mar 7, 2018 •

edited

Loading

SomKen commented Apr 25, 2018 •

edited

Loading

subvillion commented Nov 2, 2018 •

edited

Loading