Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true #11052

madsholden · 2021-08-13T10:19:25Z

Nomad version

Nomad v1.1.0 (2678c36)

Operating system and Environment details

Ubuntu 20.04.2 on EC2

Issue

I have a periodic job set to run every half hour, with prohibit_overlap set to true. It is usually finished in much shorter time, but now it's running for longer, which is expected. However, Nomad has started three instances of this job. When checking the Nomad server logs I noticed both times when it started a new job, the servers were in the middle of a server election.

Reproduction steps

I can't reproduce it, it usually doesn't start a new instance when one is running.

Expected Result

Nomad should never run two instances of the same job.

Actual Result

Three instances of the job is running.

Job file (if appropriate)

Only relevant parts included. Please let me know if you want more of the job file.

job "data-remover" {
  type = "batch"

  periodic {
    cron = "30 * * * *"
    prohibit_overlap = true
  }

  group "data-remover" {
    task "data-remover" {
      driver = "docker"

Nomad Server logs (if appropriate)

From the Nomad UI:
Aug 03, '21 01:33:58 +0200 | Received | Task received by client
Aug 12, '21 03:11:02 +0200 | Received | Task received by client

Server logs around those times:

2021-08-12 01:11:02.673	ip-10-49-15-7
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.673964Z","peer":{"Suffrage":0,"ID":"674dedad-4ea5-6e68-d59e-abac98e1c353","Address":"10.49.11.101:4647"}}
2021-08-12 01:11:02.672	ip-10-49-15-7
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.672838Z","peer":{"Suffrage":0,"ID":"f720e37e-203b-0d55-af95-19783d1663b5","Address":"10.49.13.249:4647"}}
2021-08-12 01:11:02.666	ip-10-49-15-7
{"@level":"info","@message":"cluster leadership acquired","@module":"nomad","@timestamp":"2021-08-12T01:11:02.666016Z"}
2021-08-12 01:11:02.665	ip-10-49-15-7
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.665901Z","peer":"674dedad-4ea5-6e68-d59e-abac98e1c353"}
2021-08-12 01:11:02.660	ip-10-49-15-7
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.660677Z","peer":"f720e37e-203b-0d55-af95-19783d1663b5"}
2021-08-12 01:11:02.659	ip-10-49-15-7
{"@level":"info","@message":"entering leader state","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.659669Z","leader":{}}
2021-08-12 01:11:02.658	ip-10-49-15-7
{"@level":"info","@message":"election won","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.658584Z","tally":2}
2021-08-12 01:11:02.642	ip-10-49-11-101
{"@level":"error","@message":"failed to dequeue evaluation","@module":"worker","@timestamp":"2021-08-12T01:11:02.642805Z","error":"rpc error: eval broker disabled"}
2021-08-12 01:11:02.642	ip-10-49-11-101
{"@level":"error","@message":"failed to dequeue evaluation","@module":"worker","@timestamp":"2021-08-12T01:11:02.642743Z","error":"rpc error: eval broker disabled"}
2021-08-12 01:11:02.641	ip-10-49-13-249
{"@level":"info","@message":"cluster leadership lost","@module":"nomad","@timestamp":"2021-08-12T01:11:02.641358Z"}
2021-08-12 01:11:02.639	ip-10-49-13-249
{"@level":"error","@message":"failed to dequeue evaluation","@module":"worker","@timestamp":"2021-08-12T01:11:02.639323Z","error":"eval broker disabled"}
2021-08-12 01:11:02.637	ip-10-49-13-249
{"@level":"info","@message":"entering follower state","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.637602Z","follower":{},"leader":""}
2021-08-12 01:11:02.637	ip-10-49-13-249
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.637694Z","peer":{"Suffrage":0,"ID":"674dedad-4ea5-6e68-d59e-abac98e1c353","Address":"10.49.11.101:4647"}}
2021-08-12 01:11:02.637	ip-10-49-11-101
{"@level":"warn","@message":"rejecting vote request since we have a leader","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.637706Z","from":"10.49.15.7:4647","leader":"10.49.13.249:4647"}
2021-08-12 01:11:02.636	ip-10-49-13-249
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.636849Z","peer":{"Suffrage":0,"ID":"1cdf4bda-30fd-61fb-db47-0d8fa00e3916","Address":"10.49.15.7:4647"}}
2021-08-12 01:11:02.635	ip-10-49-13-249
{"@level":"error","@message":"peer has newer term, stopping replication","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.635523Z","peer":{"Suffrage":0,"ID":"1cdf4bda-30fd-61fb-db47-0d8fa00e3916","Address":"10.49.15.7:4647"}}
2021-08-12 01:11:02.632	ip-10-49-15-7
{"@level":"info","@message":"entering candidate state","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.632505Z","node":{},"term":194}
2021-08-12 01:11:02.612	ip-10-49-15-7
{"@level":"warn","@message":"heartbeat timeout reached, starting election","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.612948Z","last-leader":"10.49.13.249:4647"}
2021-08-12 01:11:02.330	ip-10-49-13-249
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.330176Z","server-id":"1cdf4bda-30fd-61fb-db47-0d8fa00e3916","time":1454797605}
2021-08-12 01:11:01.835	ip-10-49-13-249
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:01.835577Z","server-id":"1cdf4bda-30fd-61fb-db47-0d8fa00e3916","time":960200871}
2021-08-12 01:11:01.375	ip-10-49-13-249
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:01.375827Z","server-id":"1cdf4bda-30fd-61fb-db47-0d8fa00e3916","time":500448920}

...

2021-08-02 23:33:58.492	ip-10-64-7-115
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.492920Z","peer":{"Suffrage":0,"ID":"242db6b7-88b3-0f8e-effe-6fb5777b3b3e","Address":"10.64.9.56:4647"}}
2021-08-02 23:33:58.475	ip-10-64-7-115
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.475325Z","peer":{"Suffrage":0,"ID":"f0d873fc-ec51-2d9c-558c-f387ebbdafb4","Address":"10.64.10.90:4647"}}
2021-08-02 23:33:58.464	ip-10-49-15-7
{"@level":"info","@message":"cluster leadership lost","@module":"nomad","@timestamp":"2021-08-02T23:33:58.464405Z"}
2021-08-02 23:33:58.463	ip-10-49-13-249
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.463595Z","peer":{"Suffrage":0,"ID":"1cdf4bda-30fd-61fb-db47-0d8fa00e3916","Address":"10.49.15.7:4647"}}
2021-08-02 23:33:58.461	ip-10-49-15-7
{"@level":"error","@message":"failed to dequeue evaluation","@module":"worker","@timestamp":"2021-08-02T23:33:58.461114Z","error":"eval broker disabled"}
2021-08-02 23:33:58.458	ip-10-49-15-7
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.458961Z","peer":{"Suffrage":0,"ID":"674dedad-4ea5-6e68-d59e-abac98e1c353","Address":"10.49.11.101:4647"}}
2021-08-02 23:33:58.456	ip-10-49-15-7
{"@level":"info","@message":"entering follower state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.456486Z","follower":{},"leader":""}
2021-08-02 23:33:58.452	ip-10-49-15-7
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.452808Z","peer":{"Suffrage":0,"ID":"f720e37e-203b-0d55-af95-19783d1663b5","Address":"10.49.13.249:4647"}}
2021-08-02 23:33:58.448	ip-10-49-15-7
{"@level":"error","@message":"peer has newer term, stopping replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.448268Z","peer":{"Suffrage":0,"ID":"674dedad-4ea5-6e68-d59e-abac98e1c353","Address":"10.49.11.101:4647"}}
2021-08-02 23:33:58.447	ip-10-49-15-7
{"@level":"warn","@message":"failed to contact quorum of nodes, stepping down","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.447808Z"}
2021-08-02 23:33:58.443	ip-10-49-15-7
{"@level":"error","@message":"peer has newer term, stopping replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.443291Z","peer":{"Suffrage":0,"ID":"f720e37e-203b-0d55-af95-19783d1663b5","Address":"10.49.13.249:4647"}}
2021-08-02 23:33:58.430	ip-10-49-13-249
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.430479Z","peer":{"Suffrage":0,"ID":"674dedad-4ea5-6e68-d59e-abac98e1c353","Address":"10.49.11.101:4647"}}
2021-08-02 23:33:58.430	ip-10-49-13-249
{"@level":"info","@message":"cluster leadership acquired","@module":"nomad","@timestamp":"2021-08-02T23:33:58.430361Z"}
2021-08-02 23:33:58.427	ip-10-49-13-249
{"@level":"info","@message":"election won","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.427525Z","tally":2}
2021-08-02 23:33:58.427	ip-10-49-11-101
{"@level":"info","@message":"entering follower state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.427727Z","follower":{},"leader":""}
2021-08-02 23:33:58.427	ip-10-49-13-249
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.427655Z","peer":"674dedad-4ea5-6e68-d59e-abac98e1c353"}
2021-08-02 23:33:58.427	ip-10-49-13-249
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.427631Z","peer":"1cdf4bda-30fd-61fb-db47-0d8fa00e3916"}
2021-08-02 23:33:58.427	ip-10-49-13-249
{"@level":"info","@message":"entering leader state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.427573Z","leader":{}}
2021-08-02 23:33:58.413	ip-10-49-13-249
{"@level":"info","@message":"entering candidate state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.413532Z","node":{},"term":193}
2021-08-02 23:33:58.412	ip-10-49-13-249
{"@level":"warn","@message":"Election timeout reached, restarting election","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.412816Z"}
2021-08-02 23:33:58.377	ip-10-64-9-56
{"@level":"info","@message":"cluster leadership lost","@module":"nomad","@timestamp":"2021-08-02T23:33:58.377411Z"}
2021-08-02 23:33:58.371	ip-10-64-7-115
{"@level":"info","@message":"cluster leadership acquired","@module":"nomad","@timestamp":"2021-08-02T23:33:58.371798Z"}
2021-08-02 23:33:58.340	ip-10-64-7-115
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.340419Z","peer":"242db6b7-88b3-0f8e-effe-6fb5777b3b3e"}
2021-08-02 23:33:58.339	ip-10-64-7-115
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.339733Z","peer":"f0d873fc-ec51-2d9c-558c-f387ebbdafb4"}
2021-08-02 23:33:58.338	ip-10-64-7-115
{"@level":"info","@message":"entering leader state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.338915Z","leader":{}}
2021-08-02 23:33:58.338	ip-10-64-7-115
{"@level":"info","@message":"election won","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.338866Z","tally":2}
2021-08-02 23:33:58.326	ip-10-64-10-90
{"@level":"error","@message":"failed to dequeue evaluation","@module":"worker","@timestamp":"2021-08-02T23:33:58.326320Z","error":"rpc error: eval broker disabled"}
2021-08-02 23:33:58.326	ip-10-64-10-90
{"@level":"error","@message":"failed to dequeue evaluation","@module":"worker","@timestamp":"2021-08-02T23:33:58.326320Z","error":"rpc error: eval broker disabled"}
2021-08-02 23:33:58.316	ip-10-64-9-56
{"@level":"info","@message":"entering follower state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.316078Z","follower":{},"leader":""}
2021-08-02 23:33:58.316	ip-10-64-9-56
{"@level":"error","@message":"failed to dequeue evaluation","@module":"worker","@timestamp":"2021-08-02T23:33:58.316792Z","error":"eval broker disabled"}
2021-08-02 23:33:58.316	ip-10-64-9-56
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.316237Z","peer":{"Suffrage":0,"ID":"f0d873fc-ec51-2d9c-558c-f387ebbdafb4","Address":"10.64.10.90:4647"}}
2021-08-02 23:33:58.315	ip-10-64-9-56
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.315248Z","peer":{"Suffrage":0,"ID":"3b4a35f8-e0d4-5375-1f33-a8d4ef09bacb","Address":"10.64.7.115:4647"}}
2021-08-02 23:33:58.315	ip-10-64-10-90
{"@level":"warn","@message":"rejecting vote request since we have a leader","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.315772Z","from":"10.64.7.115:4647","leader":"10.64.9.56:4647"}
2021-08-02 23:33:58.314	ip-10-64-9-56
{"@level":"error","@message":"peer has newer term, stopping replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.314168Z","peer":{"Suffrage":0,"ID":"3b4a35f8-e0d4-5375-1f33-a8d4ef09bacb","Address":"10.64.7.115:4647"}}
2021-08-02 23:33:58.305	ip-10-64-7-115
{"@level":"info","@message":"entering candidate state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.305108Z","node":{},"term":11}
2021-08-02 23:33:58.303	ip-10-64-7-115
{"@level":"warn","@message":"heartbeat timeout reached, starting election","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.303676Z","last-leader":"10.64.9.56:4647"}
2021-08-02 23:33:58.253	ip-10-64-9-56
{"@level":"warn","@message":"failed retrieving server health","@module":"nomad.stats_fetcher","@timestamp":"2021-08-02T23:33:58.253343Z","error":"context deadline exceeded","server":"ip-10-64-7-115.eu-west-1"}
2021-08-02 23:33:57.787	ip-10-64-9-56
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:57.787351Z","server-id":"3b4a35f8-e0d4-5375-1f33-a8d4ef09bacb","time":1477115952}
2021-08-02 23:33:57.623	ip-10-49-13-249
{"@level":"info","@message":"duplicate requestVote for same term","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:57.623105Z","term":192}
2021-08-02 23:33:57.615	ip-10-49-11-101
{"@level":"info","@message":"entering candidate state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:57.615636Z","node":{},"term":192}
2021-08-02 23:33:57.615	ip-10-49-11-101
{"@level":"warn","@message":"heartbeat timeout reached, starting election","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:57.615474Z","last-leader":"10.49.15.7:4647"}
2021-08-02 23:33:57.308	ip-10-64-9-56
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:57.308214Z","server-id":"3b4a35f8-e0d4-5375-1f33-a8d4ef09bacb","time":997992197}
2021-08-02 23:33:56.997	ip-10-49-11-101
{"@level":"warn","@message":"rejecting vote request since we have a leader","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:56.997567Z","from":"10.49.13.249:4647","leader":"10.49.15.7:4647"}
2021-08-02 23:33:56.990	ip-10-49-13-249
{"@level":"info","@message":"entering candidate state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:56.990410Z","node":{},"term":192}
2021-08-02 23:33:56.989	ip-10-49-13-249
{"@level":"warn","@message":"heartbeat timeout reached, starting election","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:56.989574Z","last-leader":"10.49.15.7:4647"}
2021-08-02 23:33:56.810	ip-10-64-9-56
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:56.810600Z","server-id":"3b4a35f8-e0d4-5375-1f33-a8d4ef09bacb","time":500140482}

The text was updated successfully, but these errors were encountered:

lgfa29 · 2021-08-19T15:55:02Z

Thank you for the report @madsholden.

Just to highlight one point that you mentioned:

When checking the Nomad server logs I noticed both times when it started a new job, the servers were in the middle of a server election.

This seems like an important clue and would explain whey you see this intermittently.

olanmills · 2021-09-03T01:32:03Z

I just want to say that this is happening at my company too. I ran into this issue where I have a job scheduled to run every half hour with prohibit_overlap set to true. The overlap prevention definitely works sometimes, but if one of the job instances happens to take multiple hours, then it's pretty much guaranteed that Nomad is going to fail to prevent the overlap on at least one of the recurring intervals. I had six instances of the job running at one point. When I was trying to find out more about it, I found out that people at my company have been encountering this issue sporadically over the past three years. It appears that the main reason it doesn't affect us more often is because the large majority of jobs we have never come close to taking longer than the scheduled interval.

This seems like a high priority flaw. prohibit_overlap is unreliable, so we have to build our own mechanism to handle the overlap.

olanmills · 2021-09-03T01:37:18Z

For reference, this was the schedule I was using:
cron: '*/30 * * * *'

Also, madsholden said he sheduled their job to run every half hour, but based on the cron schedule they shared in their post, it looks like they scheduled it to run once per hour, on minute x:30.

dnrce · 2021-10-08T14:00:29Z

Same symptom here (Nomad 1.0.4) -- a duplicate instance of a non-overlapping periodic job got started during a leadership election. Not on the cron schedule, either.

Cron is @daily
Duplicate job start time is 2021-10-08T08:51:27Z

Server logs from that time (3-server cluster):

10.0.0.1:

2021-10-08T08:51:26.568Z [WARN]  nomad.raft: failed to contact: server-id=10.0.0.3:4647 time=1.174670205s
2021-10-08T08:51:26.569Z [WARN]  nomad.raft: failed to contact: server-id=10.0.0.2:4647 time=1.198394307s
2021-10-08T08:51:26.569Z [WARN]  nomad.raft: failed to contact quorum of nodes, stepping down
2021-10-08T08:51:26.570Z [INFO]  nomad.raft: entering follower state: follower="Node at 10.0.0.1:4647 [Follower]" leader=
2021-10-08T08:51:26.570Z [INFO]  nomad.raft: aborting pipeline replication: peer="{Voter 10.0.0.3:4647 10.0.0.3:4647}"
2021-10-08T08:51:26.570Z [INFO]  nomad.raft: aborting pipeline replication: peer="{Voter 10.0.0.2:4647 10.0.0.2:4647}"
2021-10-08T08:51:26.571Z [INFO]  nomad: cluster leadership lost
2021-10-08T08:51:26.571Z [ERROR] worker: failed to dequeue evaluation: error="eval broker disabled"

10.0.0.2:

2021-10-08T08:51:26.565Z [ERROR] raft-net: failed to decode incoming command: error="read tcp 10.0.0.2:4647->10.0.0.1:55414: read:
2021-10-08T08:51:27.671Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader=10.0.0.1:4647
2021-10-08T08:51:27.671Z [INFO]  nomad.raft: entering candidate state: node="Node at 10.0.0.2:4647 [Candidate]" term=203
2021-10-08T08:51:27.680Z [INFO]  nomad.raft: election won: tally=2
2021-10-08T08:51:27.680Z [INFO]  nomad.raft: entering leader state: leader="Node at 10.0.0.2:4647 [Leader]"
2021-10-08T08:51:27.680Z [INFO]  nomad.raft: added peer, starting replication: peer=10.0.0.3:4647
2021-10-08T08:51:27.680Z [INFO]  nomad.raft: added peer, starting replication: peer=10.0.0.1:4647
2021-10-08T08:51:27.681Z [INFO]  nomad: cluster leadership acquired
2021-10-08T08:51:27.681Z [INFO]  nomad.raft: pipelining replication: peer="{Voter 10.0.0.1:4647 10.0.0.1:4647}"
2021-10-08T08:51:27.683Z [INFO]  nomad.raft: pipelining replication: peer="{Voter 10.0.0.3:4647 10.0.0.3:4647}"

10.0.0.3:

2021-10-08T08:51:26.565Z [ERROR] raft-net: failed to decode incoming command: error="read tcp 10.0.0.3:4647->10.0.0.1:39134: read:
2021-10-08T08:51:27.673Z [WARN]  nomad.raft: rejecting vote request since we have a leader: from=10.0.0.2:4647 leader=10.0.0.1:464

kpweiler · 2021-11-24T16:45:59Z

I'm seeing the same problem on Nomad v1.1.0 - duplicate periodic process launched during leader election

mikenomitch · 2022-02-24T19:57:15Z

I've heard of a similar issue being resolved for some users by removing time_zone from the periodic stanza. If daylight savings time is a factor, this might be a hassle to manage, but if you are running into this issue currently and have time_zone set, I might try removing it as a workaround. (That said, I don't see it on the job file posted by the original reporter)

If anybody tries this, please let me know if it alleviates the issue. This doesn't fix the core issue of course, but might be a serviceable workaround until we are able to fix the root cause.

cread · 2022-06-16T15:17:11Z

This bug is still causing us problems from time to time. Is anyone actually working on trying to fix it?

mikenomitch · 2022-07-05T21:50:13Z

Hey @cread, sorry about this one dragging out. This is on our radar, but unfortunately not something we can prioritize right now. Between 1.4 commitments and some other pressing bugs, we're more or less at capacity for at least several weeks.

I'll let you know if we pick it up. Or if anybody wants to take a crack at a fix, let me know and I can provide some rough guidance.

kpweiler · 2022-07-08T15:02:30Z

FWIW - we're also seeing this issue with time_zone unset (which in the Periodic stanza defaults to UTC)

olanmills · 2022-09-20T21:43:04Z

How is this not a higher priority issue? It seems like one of the main features of Nomad is to orchestrate jobs for you, yet it's not reliable and you have to orchestrate it yourself anyway. At the very least, you need to update your documentation so that people know that prohibit_overlap doesn't actually work. You're still advertising the feature in your documentation here:
https://www.nomadproject.io/docs/job-specification/periodic

DerekStrickland · 2022-09-22T14:29:11Z

Referencing #14505 as a possible duplicate issue.

jrasell · 2023-03-21T09:03:54Z

Hi everyone, I am currently looking into this and as a first step have worked on a reliable local reproduction. I am using the Vagrantfile in the repository root for my isolation and am building Nomad using the current main source at a633b79fb535679ce7776a0d65c88c788ba8ae92.

Start a Nomad cluster using the cluster.sh script ensuring all processes are running as root via sudo ./dev/cluster/cluster.sh

Register the jobspec below via nomad job run 11052.nomad.hcl and then trigger a forced periodic instance using the nomad job periodic force test command. Wait for the allocation to be successfully running before moving forward.

job "test" {
  type = "batch"
  periodic {
    cron             = "* * * * * *"
    prohibit_overlap = true
  }
  group "test" {
    task "test" {
      driver = "raw_exec"
      config {
        command = "sleep"
        args    = ["30000s"]
      }
    }
  }
}

Discover which Nomad server is the leader by using nomad server members and then find the process ID for this agent by running sudo ps -aef |grep server<NUM> where the number corresponds to the number within the server name of the leader. Kill the leader process using sudo kill -9 <PID> and wait for leadership to transition.

Upon transition, there will now be two instances of the periodic job running, despite the job including the prohibit_overlap parameter.

ID                        Type            Priority  Status   Submit Date
test                      batch/periodic  50        running  2023-03-21T08:38:15Z
test/periodic-1679387898  batch           50        running  2023-03-21T08:38:18Z
test/periodic-1679388004  batch           50        running  2023-03-21T08:40:04Z

I will be working through this scenario to try and identify what behaviour is causing this, and will post any relevant findings in a followup comment. Thanks for everyones patience.

…hibit_overlap is true Fixes #11052 When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.

…hibit_overlap is true Fixes #11052 When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.

…hibit_overlap is true Fixes #11052 When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.

…hibit_overlap is true Fixes #11052 When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.

…hibit_overlap is true Fixes #11052 When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.

…hibit_overlap is true Fixes #11052 When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.

…hibit_overlap is true (#16583) * Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true Fixes #11052 When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre. * Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true Fixes #11052 When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children. * style: refactor force run function * fix: remove defer and inline unlock for speed optimization * Update nomad/leader.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * style: refactor tests to use must * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * Update nomad/leader_test.go Co-authored-by: James Rasell <jrasell@users.noreply.github.com> * fix: move back from defer to calling unlock before returning. createEval cant be called with the lock on * style: refactor test to use must * added new entry to changelog and update comments --------- Co-authored-by: James Rasell <jrasell@hashicorp.com> Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

madsholden added the type/bug label Aug 13, 2021

lgfa29 self-assigned this Aug 19, 2021

lgfa29 added stage/needs-investigation theme/scheduling labels Aug 19, 2021

lgfa29 added this to Needs Triage in Nomad - Community Issues Triage via automation Aug 19, 2021

tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Nov 9, 2021

louievandyke added the hcc/cst Admin - internal label Jul 13, 2022

hashicorp locked as too heated and limited conversation to collaborators Sep 21, 2022

mikenomitch added the theme/batch Issues related to batch jobs and scheduling label Dec 6, 2022

mikenomitch added the hcc/pty label Feb 14, 2023

jrasell assigned jrasell and unassigned lgfa29 Mar 10, 2023

Juanadelacuesta closed this as completed in e773710 Mar 27, 2023

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true #11052

Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true #11052

madsholden commented Aug 13, 2021

lgfa29 commented Aug 19, 2021

olanmills commented Sep 3, 2021 •

edited

Loading

olanmills commented Sep 3, 2021

dnrce commented Oct 8, 2021

kpweiler commented Nov 24, 2021

mikenomitch commented Feb 24, 2022 •

edited

Loading

cread commented Jun 16, 2022

mikenomitch commented Jul 5, 2022

kpweiler commented Jul 8, 2022 •

edited

Loading

olanmills commented Sep 20, 2022

DerekStrickland commented Sep 22, 2022

jrasell commented Mar 21, 2023

Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true #11052

Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true #11052

Comments

madsholden commented Aug 13, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

lgfa29 commented Aug 19, 2021

olanmills commented Sep 3, 2021 • edited Loading

olanmills commented Sep 3, 2021

dnrce commented Oct 8, 2021

kpweiler commented Nov 24, 2021

mikenomitch commented Feb 24, 2022 • edited Loading

cread commented Jun 16, 2022

mikenomitch commented Jul 5, 2022

kpweiler commented Jul 8, 2022 • edited Loading

olanmills commented Sep 20, 2022

DerekStrickland commented Sep 22, 2022

jrasell commented Mar 21, 2023

olanmills commented Sep 3, 2021 •

edited

Loading

mikenomitch commented Feb 24, 2022 •

edited

Loading

kpweiler commented Jul 8, 2022 •

edited

Loading