Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output of "nomad run" seems wrong for system job with constraints. #2381

Closed
cyrilgdn opened this issue Mar 1, 2017 · 16 comments · Fixed by #5631
Closed

Output of "nomad run" seems wrong for system job with constraints. #2381

cyrilgdn opened this issue Mar 1, 2017 · 16 comments · Fixed by #5631

Comments

@cyrilgdn
Copy link

cyrilgdn commented Mar 1, 2017

Reference: https://groups.google.com/forum/#!topic/nomad-tool/t3bFTwSVgdQ

Nomad version

Nomad v0.5.4

Issue

  • When we run a system job, a placement failure error is raised (status code: 2) if one node has been exclude by a constraint. So it's impossible to know programmatically if at least one allocation was successfully placed or not.
  • The output displays wrong counts (see below) for filtered nodes.

Quoted from the mailing list:

We have a system job that runs on an auto scaling group (on AWS).
The instances of this group have a nomad class "foo" so the job definition is like:

job "test" {
datacenters = ["dc1"]

type = "system"

constraint {
    attribute = "${node.class}"
    value     = "foo"
}

[...]

}

So the job will be deployed on all servers in the autoscaling group and if we scale up the group,
Nomad automatically deploys the job on the newly instantiated server.

It's really cool but at the job submission, we have strange output.

Here is our (simplified) cluster nodes:

  • A: Instance with class="bar
  • B: Instance with class="bar"
  • C: Instance with class="baz"
    Autoscaling group:
  • D1: Instance with class="foo"

When we run the job above we have the following output:

==> Monitoring evaluation "d1e000cd"
Evaluation triggered by job "test"
Allocation "51b3d960" modified: node "a45700d3", group "test"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "d1e000cd" finished with status "complete" but failed to place all allocations:
Task Group "test" (failed to place 3 allocations):
* Class "bar" filtered 1 nodes
* Constraint "${node.class} = foo" filtered 1 nodes

I think it's because a system job has only one evaluation but these numbers are weird:

  • Class "bar" really filtered 2 nodes
  • Constraint node.class filtered 3 nodes (or indeed 1 if we subtract the previous line)

The output contains a specific line for class "bar" but not class "baz" ? (It's pretty weird)

And, our main problem is that the status code of the "nomad run" command is 2.

@cyrilgdn cyrilgdn changed the title Output of "nomad run" seems wrong for system job. Output of "nomad run" seems wrong for system job with constraints. Mar 1, 2017
@dansteen
Copy link

dansteen commented Jun 8, 2017

+1 to this. The non-zero exist status is the real issue for us.

@tgross
Copy link
Member

tgross commented Feb 23, 2018

I'm running into this problem on our Nomad clusters at Density with Nomad 0.7. Our CI/CD pipeline attempts to plan and run jobs via the Nomad API and reports failure with system jobs. The Nomad CLI's exit code 2 appears to reflect the failed allocations coming back from the API:

nomad/command/plan.go

Lines 184 to 186 in 6a783e9

if d.Stop+d.Place+d.Migrate+d.DestructiveUpdate+d.Canary > 0 {
return 1
}

nomad/command/monitor.go

Lines 317 to 321 in 6a783e9

// Treat scheduling failures specially using a dedicated exit code.
// This makes it easier to detect failures from the CLI.
if schedFailure {
return 2
}

I'd be happy to contribute a fix for this, but it's not totally clear what the correct behavior should be. Should there simply be more exit codes to reflect different kinds of warnings?

@stefan-caraiman
Copy link

@dadgar any updates regarding this issue? Encountering the same issue with Nomad 0.7.1

@gerilya
Copy link

gerilya commented Mar 6, 2018

We are experiencing this as well.
Server: Nomad v0.7.0-rc3
Client: Nomad v0.7.1 (0b295d3)

@jippi
Copy link
Contributor

jippi commented Mar 6, 2018

a "quick" work-around is to submit it over the HTTP API rather than CLI and inspect the evaluation your self

i would expect any placement due to lack of resources for a system job to fail like it does today though

@Crypto89
Copy link
Contributor

Crypto89 commented Mar 7, 2018

I ran into the same issue today as well, it looks like this is more then an exit-code issue. The scheduler reports failed allocations over HTTP API as well. (so you get the same behaviour submitting over HTTP). The allocations do get scheduled properly, but it reports the filtered nodes as failed allocations.

curl -s localhost:4646/v1/evaluation/5d16340b-1ac6-625f-db46-b59d5f8534d6 | jq -r .
{
  "ID": "5d16340b-1ac6-625f-db46-b59d5f8534d6",
  "Type": "system",
  "TriggeredBy": "job-register",
  "JobID": "foo",
...
  "FailedTGAllocs": {
    "tg-foo": {
      "NodesEvaluated": 1,
      "NodesFiltered": 1,
      "NodesAvailable": {
        "zone2": 14,
        "zone3": 14,
        "zone1": 14
      },
      "ClassFiltered": {
        "class-a": 1
      },
      "ConstraintFiltered": {
        "${node.class} = class-b": 1
      },
      "NodesExhausted": 0,
      "ClassExhausted": null,
      "DimensionExhausted": null,
      "QuotaExhausted": null,
      "Scores": null,
      "AllocationTime": 30605,
      "CoalescedFailures": 38
    }
  },
...
}

@danlsgiga
Copy link
Contributor

Same thing here... Running Nomad 0.7.1, whenever I use constraints with a system job in the same workflow as described in this issue I get placement errors even though the allocations are successful. It's like Nomad is treating a constrained node as a failure placement on system jobs when actually it is not!

@SomKen
Copy link

SomKen commented Apr 25, 2018

For the record, this error still appears on 0.8.1.

Example code: https://pastebin.com/raw/f7yH5Q4U

@SomKen
Copy link

SomKen commented Apr 26, 2018

Follow up,

After doing a fresh install of Nomad server, running the same job above, no errors exist in the UI. Errors sill persist when running the job via CLI.

@dmartin-isp
Copy link

I've just run into this as well. I launch my jobs from ansible, and now I have to tell ansible that exit code 2 is OK, which is .. sub optimal.

@subvillion
Copy link

subvillion commented Nov 2, 2018

same with 0.8.4, but errcode 1...
So constrain don't work in system jobs truth CI for me

@jsaintro
Copy link

Come on guys, this is really a bug and should be dealt with. Many, if not most, people that run a service are going to constrain it to a subset of nodes. Having it throw an error for such a common use case isn't good. Here's my workaround/backflip to take care of this in Ansible. At least it'll let some errors get trapped.

failed_when: 'jobresult.rc != 0 and not jobresult.stdout.find("finished with status \"complete\" but failed to place all allocations:") > -1'

@SomKen
Copy link

SomKen commented Jan 15, 2019

Any news on this?

@preetapan
Copy link
Member

@SomKen @jsaintro we will tackle this in the next minor release (0.9.1) after 0.9 is out. Sorry for the delay but our highest priority now is to finish the large 0.9 release which brings in GPU support, runtime plugins and more advanced scheduling improvements.

@skyrocknroll
Copy link

We are getting hit by this. Hope you fix this fast

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.