Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch Jobs Do Not Run Correctly (Problem with max-plan-attempt Logic...?) #1324

Closed
cigdono opened this issue Jun 20, 2016 · 3 comments
Closed

Comments

@cigdono
Copy link

cigdono commented Jun 20, 2016

Nomad version

Nomad v0.4.0-rc1 ('3c578fccde793a515bd1640c530f8df888a63b45')

Operating System

CentOS7

Issue

I stood up 3 server nodes and 10 (16 core) client nodes. Once the nodes came up I submitted 5 jobs. Each job had a task group count of 100. Two of the jobs changed to the running state and ran to completion (ending up in the dead state). The other 3 jobs switched into the dead state without ever going to the running state. I have included the verbose status for two of the stuck jobs below.

It looks to me as though there is an issue in the max-plan-attempt logic that was released in 0.4.0-rc1. Have you thought about the use case where you have a nomad cluster that is running in a cloud provider and is set up to auto scale and start client nodes as jobs are queued with the system? How should the max-plan-attempt logic work when jobs are submitted to a cluster that has no client nodes?

I can "kick start" the jobs that are dead with no completions (stuck in a blocked max-plan-attempt state) by using the HTTP API and forcing an evaluation (.../v1/job/[job-id]/evaluation). Once I do this the jobs will switch from dead to running and eventually complete.

Nomad Status Output

nomad status -verbose test6
ID = test6
Name = test-06
Type = batch
Priority = 50
Datacenters = dc1
Status = dead
Periodic = false

Evaluations
ID Priority Triggered By Status Placement Failures
7441818d-0391-b0f2-99a9-afaae43fa0b5 50 max-plan-attempts failed false
acc43f15-32c3-0817-cd79-f70d67bb7523 50 max-plan-attempts canceled false
fa6a22fa-d59c-da86-99c1-0b45a1bfe6c6 50 job-register failed true
5347dbb1-650c-6493-673a-8abd8a166093 50 job-register failed true

Allocations
No allocations placed

nomad status -verbose test8
ID = test8
Name = test-08
Type = batch
Priority = 50
Datacenters = dc1
Status = dead
Periodic = false

Evaluations
ID Priority Triggered By Status Placement Failures
ad4616eb-5ad0-6d21-af97-2740eebfd710 50 max-plan-attempts failed false
d58adc64-557f-8145-a83e-6d1f2aa40d37 50 job-register failed true
9b73cda3-084c-08af-50b9-e0a102a05fce 50 job-register complete true

Allocations
No allocations placed

Job file

{
    "Job": {
        "Region": "global",
        "ID": "test-0N",
        "Name": "test-0N",
        "Type": "batch",
        "Priority": 50,
        "Datacenters": [
            "dc1"
        ],
        "TaskGroups": [
            {
                "Name": "test-group",
                "Count": 100,
                "Tasks": [
                    {
                        "Name": "hello-world",
                        "Driver": "docker",
                        "Config": {
                            "image": "https://docker-cache.service.consul:5000/cdi/nomad-test:v0.0.9",
                            "command": "/opt/test/bin/test_batch.py",
                            "args": ["-t","120"],
                            "network_mode": "host"
                        },
                        "Resources": {
                            "CPU": 2500,
                            "MemoryMB": 256,
                            "DiskMB": 300,
                            "IOPS": 0
                        },
                        "LogConfig": {
                           "MaxFiles": 10,
                           "MaxFileSizeMB": 10
                        }
                    }
                ]
            }
        ]
    }
}
@dadgar
Copy link
Contributor

dadgar commented Jun 20, 2016

Can you share the logs of the servers?

The max-plan-attempts occurs when the schedulers try to make placements that are rejected by the leader several times in a row. This prevents the case of endlessly spinning the schedulers if there is so much cluster contention.

We retry those failed evaluations at one minute intervals such that if contention is reduced those jobs can make progress again.

@cigdono
Copy link
Author

cigdono commented Jun 20, 2016

Here are the logs...
svr-logs.tar.gz

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants