Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON job with ParentID set does not run #10422

Closed
angrycub opened this issue Apr 21, 2021 · 2 comments · Fixed by #10424
Closed

JSON job with ParentID set does not run #10422

angrycub opened this issue Apr 21, 2021 · 2 comments · Fixed by #10424
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. type/bug

Comments

@angrycub
Copy link
Contributor

Issue

When submitting a job with a ParentID set to a non-empty value, versions of Nomad > v.0.12.1 do not create an evaluation and consequently the job does not run. This is a change in behavior from Nomad v0.12.0 and earlier. Removing the ParentID from the job causes the job to work as expected.

Reproduction steps

Expected Result (pre-v0.12.1)

Run a Nomad v0.12.0 dev agent with the included configuration

$ ./nomad_0.12.0 agent -config=nomad.hcl -data-dir=$(pwd)/good

Submit the job to the agent

$ curl -X PUT -d @jobspec.json http://127.0.0.1:4646/v1/jobs
{"EvalCreateIndex":11,"EvalID":"129d7bb2-a028-553b-b986-b9e82806eb9a","Index":11,"JobModifyIndex":10,"KnownLeader":false,"LastContact":0,"Warnings":""}%                      

Server logs around the submit event:

    2021-04-20T19:17:01.501-0400 [TRACE] nomad.job: job mutate results: mutator=canonicalize warnings=[] error=<nil>
    2021-04-20T19:17:01.502-0400 [TRACE] nomad.job: job mutate results: mutator=connect warnings=[] error=<nil>
    2021-04-20T19:17:01.502-0400 [TRACE] nomad.job: job mutate results: mutator=expose-check warnings=[] error=<nil>
    2021-04-20T19:17:01.502-0400 [TRACE] nomad.job: job mutate results: mutator=constraints warnings=[] error=<nil>
    2021-04-20T19:17:01.502-0400 [TRACE] nomad.job: job validate results: validator=connect warnings=[] error=<nil>
    2021-04-20T19:17:01.502-0400 [TRACE] nomad.job: job validate results: validator=expose-check warnings=[] error=<nil>
    2021-04-20T19:17:01.502-0400 [TRACE] nomad.job: job validate results: validator=validate warnings=[] error=<nil>
    2021-04-20T19:17:01.504-0400 [DEBUG] worker: dequeued evaluation: eval_id=f049d66e-1ae9-d83c-1eab-cb4c0e2b9003

Check the status of the job lightningCollector-lightningCollector. Note that it has been evaluated and made a blocked eval based on the constaint.

$ nomad job status lightningCollector-lightningCollector 
ID            = lightningCollector-lightningCollector
Name          = lightningCollector-lightningCollector
Submit Date   = 2021-04-21T10:09:22-04:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group                               Queued  Starting  Running  Failed  Complete  Lost
lightningCollector-lightningCollector-0  1       0         0        0       0         0

Placement Failure
Task Group "lightningCollector-lightningCollector-0":
  * Constraint "${attr.unique.hostname} = blade1.lab.bulb.hr": 1 nodes excluded by filter

Latest Deployment
ID          = 254886f6
Status      = running
Description = Deployment is running

Deployed
Task Group                               Desired  Placed  Healthy  Unhealthy  Progress Deadline
lightningCollector-lightningCollector-0  1        0       0        0          N/A

Allocations
No allocations placed

Fetch the evaluation via the API

$ curl -v http://127.0.0.1:4646/v1/evaluation/129d7bb2-a028-553b-b986-b9e82806eb9a
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 4646 (#0)
> GET /v1/evaluation/129d7bb2-a028-553b-b986-b9e82806eb9a HTTP/1.1
> Host: 127.0.0.1:4646
> User-Agent: curl/7.64.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: application/json
< Vary: Accept-Encoding
< X-Nomad-Index: 14
< X-Nomad-Knownleader: true
< X-Nomad-Lastcontact: 0
< Date: Wed, 21 Apr 2021 14:09:54 GMT
< Content-Length: 896
< 
* Connection #0 to host 127.0.0.1 left intact
{"BlockedEval":"1f2ab78e-8128-4cf1-42bd-290773d2c4c4","CreateIndex":11,"CreateTime":1619014162988133000,"DeploymentID":"254886f6-d67c-b137-10fb-20775a4da377","FailedTGAllocs":{"lightningCollector-lightningCollector-0":{"AllocationTime":45925,"ClassExhausted":null,"ClassFiltered":null,"CoalescedFailures":0,"ConstraintFiltered":{"${attr.unique.hostname} = blade1.lab.bulb.hr":1},"DimensionExhausted":null,"NodesAvailable":{"dc1":1},"NodesEvaluated":1,"NodesExhausted":0,"NodesFiltered":1,"QuotaExhausted":null,"ScoreMetaData":null,"Scores":null}},"ID":"129d7bb2-a028-553b-b986-b9e82806eb9a","JobID":"lightningCollector-lightningCollector","JobModifyIndex":10,"ModifyIndex":14,"ModifyTime":1619014163094593000,"Namespace":"default","Priority":50,"QueuedAllocations":{"lightningCollector-lightningCollector-0":1},"SnapshotIndex":11,"Status":"complete","TriggeredBy":"job-register","Type":"service"}* Closing connection 0

Actual Result (v0.12.1 and beyond)

Run a Nomad v0.12.1 dev agent with the included configuration

rm -rf bad; ./nomad_0.12.1 agent -config=nomad.hcl -data-dir=$(pwd)/bad 

Submit the job to the agent

$ curl -X PUT -d @jobspec.json http://127.0.0.1:4646/v1/jobs 
{"EvalCreateIndex":10,"EvalID":"f6a82cb2-2ef4-232c-c69a-305f0c8a1b76","Index":10,"JobModifyIndex":10,"KnownLeader":false,"LastContact":0,"Warnings":""}%                      

Server logs around the submit event:

    2021-04-20T19:20:04.487-0400 [TRACE] nomad.job: job mutate results: mutator=canonicalize warnings=[] error=<nil>
    2021-04-20T19:20:04.487-0400 [TRACE] nomad.job: job mutate results: mutator=connect warnings=[] error=<nil>
    2021-04-20T19:20:04.487-0400 [TRACE] nomad.job: job mutate results: mutator=expose-check warnings=[] error=<nil>
    2021-04-20T19:20:04.487-0400 [TRACE] nomad.job: job mutate results: mutator=constraints warnings=[] error=<nil>
    2021-04-20T19:20:04.487-0400 [TRACE] nomad.job: job validate results: validator=connect warnings=[] error=<nil>
    2021-04-20T19:20:04.487-0400 [TRACE] nomad.job: job validate results: validator=expose-check warnings=[] error=<nil>
    2021-04-20T19:20:04.487-0400 [TRACE] nomad.job: job validate results: validator=validate warnings=[] error=<nil>

Check the status of the job lightningCollector-lightningCollector. Note that it doesn't seem to have been evaluated because there isn't a blocked eval based on the constaint.

$ nomad job status lightningCollector-lightningCollector 
ID            = lightningCollector-lightningCollector
Name          = lightningCollector-lightningCollector
Submit Date   = 2021-04-21T10:05:06-04:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group                               Queued  Starting  Running  Failed  Complete  Lost
lightningCollector-lightningCollector-0  0       0         0        0       0         0

Allocations
No allocations placed

Fetch the evaluation via the API

$ curl -v http://127.0.0.1:4646/v1/evaluation/f6a82cb2-2ef4-232c-c69a-305f0c8a1b76                                    
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 4646 (#0)
> GET /v1/evaluation/f6a82cb2-2ef4-232c-c69a-305f0c8a1b76 HTTP/1.1
> Host: 127.0.0.1:4646
> User-Agent: curl/7.64.1
> Accept: */*
> 
< HTTP/1.1 404 Not Found
< Vary: Accept-Encoding
< X-Nomad-Index: 0
< X-Nomad-Knownleader: true
< X-Nomad-Lastcontact: 0
< Date: Wed, 21 Apr 2021 14:06:26 GMT
< Content-Length: 14
< Content-Type: text/plain; charset=utf-8
< 
* Connection #0 to host 127.0.0.1 left intact
eval not found* Closing connection 0

Job file (if appropriate)

Click for Job file
{"job":{
  "Stop": false,
  "Region": "global",
  "Namespace": "default",
  "ID": "lightningCollector-lightningCollector",
  "ParentID": "daf6f552-0f38-42b4-8b54-82bab3588892",
  "Name": "lightningCollector-lightningCollector",
  "Type": "service",
  "Priority": 50,
  "AllAtOnce": false,
  "Datacenters": [
    "dc1"
  ],
  "Constraints": [
    {
      "LTarget": "",
      "RTarget": "true",
      "Operand": "distinct_hosts"
    }
  ],
  "Affinities": null,
  "Spreads": null,
  "TaskGroups": [
    {
      "Name": "lightningCollector-lightningCollector-0",
      "Count": 1,
      "Update": {
        "Stagger": 30000000000,
        "MaxParallel": 1,
        "HealthCheck": "checks",
        "MinHealthyTime": 10000000000,
        "HealthyDeadline": 300000000000,
        "ProgressDeadline": 600000000000,
        "AutoRevert": false,
        "AutoPromote": false,
        "Canary": 0
      },
      "Migrate": {
        "MaxParallel": 1,
        "HealthCheck": "checks",
        "MinHealthyTime": 10000000000,
        "HealthyDeadline": 300000000000
      },
      "Constraints": [
        {
          "LTarget": "",
          "RTarget": "true",
          "Operand": "distinct_hosts"
        },
        {
          "LTarget": "${attr.unique.hostname}",
          "RTarget": "blade1.lab.bulb.hr",
          "Operand": "="
        }
      ],
      "Scaling": null,
      "RestartPolicy": {
        "Attempts": 960,
        "Interval": 14400000000000,
        "Delay": 15000000000,
        "Mode": "delay"
      },
      "Tasks": [
        {
          "Name": "lightningCollector-lightningCollector-0",
          "Driver": "raw_exec",
          "User": "",
          "Config": {
            "args": [
              "-Dserver",
              "-Djava.awt.headless=true",
              "-XX:+UseG1GC",
              "-XX:GCTimeRatio=2",
              "-XX:MaxGCPauseMillis=1000",
              "-XX:G1HeapRegionSize=1",
              "-XX:ParallelGCThreads=6",
              "-XX:+ScavengeBeforeFullGC",
              "-XX:+CMSScavengeBeforeRemark",
              "-XX:+CMSClassUnloadingEnabled",
              "-Xms64m",
              "-Xmx256m",
              "-Dcom.sun.management.jmxremote",
              "-Dcom.sun.management.jmxremote.ssl=false",
              "-Dcom.sun.management.jmxremote.authenticate=false",
              "-Dcom.sun.management.jmxremote.port=0",
              "-jar",
              "local/websocket-source-1.0.0.jar"
            ],
            "command": "/usr/bin/java"
          },
          "Env": {
            "SPRING_APPLICATION_JSON": "{\"spring.metrics.export.triggers.application.includes\":\"integration**\",\"spring.cloud.stream.metrics.key\":\"lightningCollector.lightningCollector.${spring.cloud.application.guid}\",\"spring.cloud.dataflow.stream.app.label\":\"lightningCollector\",\"spring.cloud.stream.metrics.properties\":\"spring.application.name,spring.application.index,spring.cloud.application.*,spring.cloud.dataflow.*\",\"spring.application.name\":\"lightning/lightningCollector/LightningCollector\",\"server.port\":\"0\",\"spring.cloud.dataflow.stream.name\":\"lightningCollector\",\"spring.cloud.stream.bindings.output.destination\":\"LIGHTNING_INPUT\",\"spring.cloud.dataflow.stream.app.type\":\"source\"}",
            "SPRING_CLOUD_APPLICATION_GROUP": "lightningCollector",
            "INSTANCE_INDEX": "0",
            "SPRING_APPLICATION_INDEX": "0"
          },
          "Services": null,
          "Vault": null,
          "Templates": null,
          "Constraints": null,
          "Affinities": null,
          "Resources": {
            "CPU": 100,
            "MemoryMB": 256,
            "DiskMB": 0,
            "IOPS": 0,
            "Networks": null,
            "Devices": null
          },
          "RestartPolicy": {
            "Attempts": 960,
            "Interval": 14400000000000,
            "Delay": 15000000000,
            "Mode": "delay"
          },
          "DispatchPayload": null,
          "Lifecycle": null,
          "Meta": {
            "uniqueId": "daf6f552-0f38-42b4-8b54-82bab3588892"
          },
          "KillTimeout": 5000000000,
          "LogConfig": {
            "MaxFiles": 2,
            "MaxFileSizeMB": 10
          },
          "Artifacts": [
            {
              "GetterSource": "http://localhost:9393/resources/maven/hr.bulb.ai.a1/websocket-source-1.0.0.jar",
              "GetterOptions": {
                "checksum": "md5:bdf0e2146031dcd0150fa9c3943a004f"
              },
              "GetterHeaders": null,
              "GetterMode": "any",
              "RelativeDest": "local"
            }
          ],
          "Leader": false,
          "ShutdownDelay": 0,
          "VolumeMounts": null,
          "ScalingPolicies": null,
          "KillSignal": "",
          "Kind": "",
          "CSIPluginConfig": null
        }
      ],
      "EphemeralDisk": {
        "Sticky": false,
        "SizeMB": 300,
        "Migrate": false
      },
      "Meta": {
        "uniqueId": "daf6f552-0f38-42b4-8b54-82bab3588892"
      },
      "ReschedulePolicy": {
        "Attempts": 0,
        "Interval": 0,
        "Delay": 30000000000,
        "DelayFunction": "exponential",
        "MaxDelay": 3600000000000,
        "Unlimited": true
      },
      "Affinities": null,
      "Spreads": null,
      "Networks": null,
      "Services": null,
      "Volumes": null,
      "ShutdownDelay": null,
      "StopAfterClientDisconnect": null
    }
  ],
  "Update": {
    "Stagger": 30000000000,
    "MaxParallel": 1,
    "HealthCheck": "",
    "MinHealthyTime": 0,
    "HealthyDeadline": 0,
    "ProgressDeadline": 0,
    "AutoRevert": false,
    "AutoPromote": false,
    "Canary": 0
  },
  "Multiregion": null,
  "Periodic": null,
  "ParameterizedJob": null,
  "Dispatched": false,
  "Payload": null,
  "Meta": {
    "uniqueId": "daf6f552-0f38-42b4-8b54-82bab3588892"
  },
  "ConsulToken": "",
  "VaultToken": "",
  "VaultNamespace": "",
  "NomadTokenID": "",
  "Status": "pending",
  "StatusDescription": "",
  "Stable": false,
  "Version": 0,
  "SubmitTime": 1608300369096493600,
  "CreateIndex": 38,
  "ModifyIndex": 38,
  "JobModifyIndex": 38
}}
@notnoop notnoop self-assigned this Apr 21, 2021
@notnoop notnoop added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Apr 21, 2021
notnoop pushed a commit that referenced this issue Apr 23, 2021
ParentID is an internal field that Nomad sets for dispatched or parameterized jobs. Job submitters should not be able to set it directly, as that messes up children tracking.

Fixes #10422 . It specifically stops the scheduler from honoring the ParentID. The reason failure and why the scheduler didn't schedule that job once it was created is very interesting and requires follow up with a more technical issue.
@ivanprostran
Copy link

I can confirm that job deployment works after removing ParentID property..
Thank you for your time and effort.

schmichael pushed a commit that referenced this issue May 14, 2021
ParentID is an internal field that Nomad sets for dispatched or parameterized jobs. Job submitters should not be able to set it directly, as that messes up children tracking.

Fixes #10422 . It specifically stops the scheduler from honoring the ParentID. The reason failure and why the scheduler didn't schedule that job once it was created is very interesting and requires follow up with a more technical issue.
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants