Batch Jobs Do Not Run Correctly (Problem with max-plan-attempt Logic...?) #1324

cigdono · 2016-06-20T21:43:55Z

Nomad version

Nomad v0.4.0-rc1 ('3c578fccde793a515bd1640c530f8df888a63b45')

Operating System

CentOS7

Issue

I stood up 3 server nodes and 10 (16 core) client nodes. Once the nodes came up I submitted 5 jobs. Each job had a task group count of 100. Two of the jobs changed to the running state and ran to completion (ending up in the dead state). The other 3 jobs switched into the dead state without ever going to the running state. I have included the verbose status for two of the stuck jobs below.

It looks to me as though there is an issue in the max-plan-attempt logic that was released in 0.4.0-rc1. Have you thought about the use case where you have a nomad cluster that is running in a cloud provider and is set up to auto scale and start client nodes as jobs are queued with the system? How should the max-plan-attempt logic work when jobs are submitted to a cluster that has no client nodes?

I can "kick start" the jobs that are dead with no completions (stuck in a blocked max-plan-attempt state) by using the HTTP API and forcing an evaluation (.../v1/job/[job-id]/evaluation). Once I do this the jobs will switch from dead to running and eventually complete.

Nomad Status Output

nomad status -verbose test6
ID = test6
Name = test-06
Type = batch
Priority = 50
Datacenters = dc1
Status = dead
Periodic = false

Evaluations
ID Priority Triggered By Status Placement Failures
7441818d-0391-b0f2-99a9-afaae43fa0b5 50 max-plan-attempts failed false
acc43f15-32c3-0817-cd79-f70d67bb7523 50 max-plan-attempts canceled false
fa6a22fa-d59c-da86-99c1-0b45a1bfe6c6 50 job-register failed true
5347dbb1-650c-6493-673a-8abd8a166093 50 job-register failed true

Allocations
No allocations placed

nomad status -verbose test8
ID = test8
Name = test-08
Type = batch
Priority = 50
Datacenters = dc1
Status = dead
Periodic = false

Evaluations
ID Priority Triggered By Status Placement Failures
ad4616eb-5ad0-6d21-af97-2740eebfd710 50 max-plan-attempts failed false
d58adc64-557f-8145-a83e-6d1f2aa40d37 50 job-register failed true
9b73cda3-084c-08af-50b9-e0a102a05fce 50 job-register complete true

Allocations
No allocations placed

Job file

{
    "Job": {
        "Region": "global",
        "ID": "test-0N",
        "Name": "test-0N",
        "Type": "batch",
        "Priority": 50,
        "Datacenters": [
            "dc1"
        ],
        "TaskGroups": [
            {
                "Name": "test-group",
                "Count": 100,
                "Tasks": [
                    {
                        "Name": "hello-world",
                        "Driver": "docker",
                        "Config": {
                            "image": "https://docker-cache.service.consul:5000/cdi/nomad-test:v0.0.9",
                            "command": "/opt/test/bin/test_batch.py",
                            "args": ["-t","120"],
                            "network_mode": "host"
                        },
                        "Resources": {
                            "CPU": 2500,
                            "MemoryMB": 256,
                            "DiskMB": 300,
                            "IOPS": 0
                        },
                        "LogConfig": {
                           "MaxFiles": 10,
                           "MaxFileSizeMB": 10
                        }
                    }
                ]
            }
        ]
    }
}

The text was updated successfully, but these errors were encountered:

dadgar · 2016-06-20T21:52:30Z

Can you share the logs of the servers?

The max-plan-attempts occurs when the schedulers try to make placements that are rejected by the leader several times in a row. This prevents the case of endlessly spinning the schedulers if there is so much cluster contention.

We retry those failed evaluations at one minute intervals such that if contention is reduced those jobs can make progress again.

cigdono · 2016-06-20T22:59:58Z

Here are the logs...
svr-logs.tar.gz

github-actions · 2022-12-21T02:13:50Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added the stage/waiting-reply label Jun 20, 2016

dadgar mentioned this issue Jun 21, 2016

Handle max plans evaluation in generic scheduler #1326

Merged

dadgar closed this as completed in #1326 Jun 21, 2016

cigdono mentioned this issue Jun 21, 2016

Batch Jobs Still Do Not Work Correctly #1330

Closed

github-actions bot locked as resolved and limited conversation to collaborators Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Jobs Do Not Run Correctly (Problem with max-plan-attempt Logic...?) #1324

Batch Jobs Do Not Run Correctly (Problem with max-plan-attempt Logic...?) #1324

cigdono commented Jun 20, 2016

dadgar commented Jun 20, 2016

cigdono commented Jun 20, 2016

github-actions bot commented Dec 21, 2022

Batch Jobs Do Not Run Correctly (Problem with max-plan-attempt Logic...?) #1324

Batch Jobs Do Not Run Correctly (Problem with max-plan-attempt Logic...?) #1324

Comments

cigdono commented Jun 20, 2016

Nomad version

Operating System

Issue

Nomad Status Output

Job file

dadgar commented Jun 20, 2016

cigdono commented Jun 20, 2016

github-actions bot commented Dec 21, 2022