Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true #11052

Closed
madsholden opened this issue Aug 13, 2021 · 12 comments · Fixed by #16681
Closed
Assignees
Labels

Comments

@madsholden
Copy link

Nomad version

Nomad v1.1.0 (2678c36)

Operating system and Environment details

Ubuntu 20.04.2 on EC2

Issue

I have a periodic job set to run every half hour, with prohibit_overlap set to true. It is usually finished in much shorter time, but now it's running for longer, which is expected. However, Nomad has started three instances of this job. When checking the Nomad server logs I noticed both times when it started a new job, the servers were in the middle of a server election.

Reproduction steps

I can't reproduce it, it usually doesn't start a new instance when one is running.

Expected Result

Nomad should never run two instances of the same job.

Actual Result

Three instances of the job is running.

Job file (if appropriate)

Only relevant parts included. Please let me know if you want more of the job file.

job "data-remover" {
  type = "batch"

  periodic {
    cron = "30 * * * *"
    prohibit_overlap = true
  }

  group "data-remover" {
    task "data-remover" {
      driver = "docker"

Nomad Server logs (if appropriate)

From the Nomad UI:
Aug 03, '21 01:33:58 +0200 | Received | Task received by client
Aug 12, '21 03:11:02 +0200 | Received | Task received by client

Server logs around those times:

2021-08-12 01:11:02.673	ip-10-49-15-7
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.673964Z","peer":{"Suffrage":0,"ID":"674dedad-4ea5-6e68-d59e-abac98e1c353","Address":"10.49.11.101:4647"}}
2021-08-12 01:11:02.672	ip-10-49-15-7
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.672838Z","peer":{"Suffrage":0,"ID":"f720e37e-203b-0d55-af95-19783d1663b5","Address":"10.49.13.249:4647"}}
2021-08-12 01:11:02.666	ip-10-49-15-7
{"@level":"info","@message":"cluster leadership acquired","@module":"nomad","@timestamp":"2021-08-12T01:11:02.666016Z"}
2021-08-12 01:11:02.665	ip-10-49-15-7
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.665901Z","peer":"674dedad-4ea5-6e68-d59e-abac98e1c353"}
2021-08-12 01:11:02.660	ip-10-49-15-7
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.660677Z","peer":"f720e37e-203b-0d55-af95-19783d1663b5"}
2021-08-12 01:11:02.659	ip-10-49-15-7
{"@level":"info","@message":"entering leader state","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.659669Z","leader":{}}
2021-08-12 01:11:02.658	ip-10-49-15-7
{"@level":"info","@message":"election won","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.658584Z","tally":2}
2021-08-12 01:11:02.642	ip-10-49-11-101
{"@level":"error","@message":"failed to dequeue evaluation","@module":"worker","@timestamp":"2021-08-12T01:11:02.642805Z","error":"rpc error: eval broker disabled"}
2021-08-12 01:11:02.642	ip-10-49-11-101
{"@level":"error","@message":"failed to dequeue evaluation","@module":"worker","@timestamp":"2021-08-12T01:11:02.642743Z","error":"rpc error: eval broker disabled"}
2021-08-12 01:11:02.641	ip-10-49-13-249
{"@level":"info","@message":"cluster leadership lost","@module":"nomad","@timestamp":"2021-08-12T01:11:02.641358Z"}
2021-08-12 01:11:02.639	ip-10-49-13-249
{"@level":"error","@message":"failed to dequeue evaluation","@module":"worker","@timestamp":"2021-08-12T01:11:02.639323Z","error":"eval broker disabled"}
2021-08-12 01:11:02.637	ip-10-49-13-249
{"@level":"info","@message":"entering follower state","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.637602Z","follower":{},"leader":""}
2021-08-12 01:11:02.637	ip-10-49-13-249
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.637694Z","peer":{"Suffrage":0,"ID":"674dedad-4ea5-6e68-d59e-abac98e1c353","Address":"10.49.11.101:4647"}}
2021-08-12 01:11:02.637	ip-10-49-11-101
{"@level":"warn","@message":"rejecting vote request since we have a leader","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.637706Z","from":"10.49.15.7:4647","leader":"10.49.13.249:4647"}
2021-08-12 01:11:02.636	ip-10-49-13-249
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.636849Z","peer":{"Suffrage":0,"ID":"1cdf4bda-30fd-61fb-db47-0d8fa00e3916","Address":"10.49.15.7:4647"}}
2021-08-12 01:11:02.635	ip-10-49-13-249
{"@level":"error","@message":"peer has newer term, stopping replication","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.635523Z","peer":{"Suffrage":0,"ID":"1cdf4bda-30fd-61fb-db47-0d8fa00e3916","Address":"10.49.15.7:4647"}}
2021-08-12 01:11:02.632	ip-10-49-15-7
{"@level":"info","@message":"entering candidate state","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.632505Z","node":{},"term":194}
2021-08-12 01:11:02.612	ip-10-49-15-7
{"@level":"warn","@message":"heartbeat timeout reached, starting election","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.612948Z","last-leader":"10.49.13.249:4647"}
2021-08-12 01:11:02.330	ip-10-49-13-249
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:02.330176Z","server-id":"1cdf4bda-30fd-61fb-db47-0d8fa00e3916","time":1454797605}
2021-08-12 01:11:01.835	ip-10-49-13-249
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:01.835577Z","server-id":"1cdf4bda-30fd-61fb-db47-0d8fa00e3916","time":960200871}
2021-08-12 01:11:01.375	ip-10-49-13-249
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2021-08-12T01:11:01.375827Z","server-id":"1cdf4bda-30fd-61fb-db47-0d8fa00e3916","time":500448920}

...

2021-08-02 23:33:58.492	ip-10-64-7-115
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.492920Z","peer":{"Suffrage":0,"ID":"242db6b7-88b3-0f8e-effe-6fb5777b3b3e","Address":"10.64.9.56:4647"}}
2021-08-02 23:33:58.475	ip-10-64-7-115
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.475325Z","peer":{"Suffrage":0,"ID":"f0d873fc-ec51-2d9c-558c-f387ebbdafb4","Address":"10.64.10.90:4647"}}
2021-08-02 23:33:58.464	ip-10-49-15-7
{"@level":"info","@message":"cluster leadership lost","@module":"nomad","@timestamp":"2021-08-02T23:33:58.464405Z"}
2021-08-02 23:33:58.463	ip-10-49-13-249
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.463595Z","peer":{"Suffrage":0,"ID":"1cdf4bda-30fd-61fb-db47-0d8fa00e3916","Address":"10.49.15.7:4647"}}
2021-08-02 23:33:58.461	ip-10-49-15-7
{"@level":"error","@message":"failed to dequeue evaluation","@module":"worker","@timestamp":"2021-08-02T23:33:58.461114Z","error":"eval broker disabled"}
2021-08-02 23:33:58.458	ip-10-49-15-7
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.458961Z","peer":{"Suffrage":0,"ID":"674dedad-4ea5-6e68-d59e-abac98e1c353","Address":"10.49.11.101:4647"}}
2021-08-02 23:33:58.456	ip-10-49-15-7
{"@level":"info","@message":"entering follower state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.456486Z","follower":{},"leader":""}
2021-08-02 23:33:58.452	ip-10-49-15-7
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.452808Z","peer":{"Suffrage":0,"ID":"f720e37e-203b-0d55-af95-19783d1663b5","Address":"10.49.13.249:4647"}}
2021-08-02 23:33:58.448	ip-10-49-15-7
{"@level":"error","@message":"peer has newer term, stopping replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.448268Z","peer":{"Suffrage":0,"ID":"674dedad-4ea5-6e68-d59e-abac98e1c353","Address":"10.49.11.101:4647"}}
2021-08-02 23:33:58.447	ip-10-49-15-7
{"@level":"warn","@message":"failed to contact quorum of nodes, stepping down","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.447808Z"}
2021-08-02 23:33:58.443	ip-10-49-15-7
{"@level":"error","@message":"peer has newer term, stopping replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.443291Z","peer":{"Suffrage":0,"ID":"f720e37e-203b-0d55-af95-19783d1663b5","Address":"10.49.13.249:4647"}}
2021-08-02 23:33:58.430	ip-10-49-13-249
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.430479Z","peer":{"Suffrage":0,"ID":"674dedad-4ea5-6e68-d59e-abac98e1c353","Address":"10.49.11.101:4647"}}
2021-08-02 23:33:58.430	ip-10-49-13-249
{"@level":"info","@message":"cluster leadership acquired","@module":"nomad","@timestamp":"2021-08-02T23:33:58.430361Z"}
2021-08-02 23:33:58.427	ip-10-49-13-249
{"@level":"info","@message":"election won","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.427525Z","tally":2}
2021-08-02 23:33:58.427	ip-10-49-11-101
{"@level":"info","@message":"entering follower state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.427727Z","follower":{},"leader":""}
2021-08-02 23:33:58.427	ip-10-49-13-249
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.427655Z","peer":"674dedad-4ea5-6e68-d59e-abac98e1c353"}
2021-08-02 23:33:58.427	ip-10-49-13-249
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.427631Z","peer":"1cdf4bda-30fd-61fb-db47-0d8fa00e3916"}
2021-08-02 23:33:58.427	ip-10-49-13-249
{"@level":"info","@message":"entering leader state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.427573Z","leader":{}}
2021-08-02 23:33:58.413	ip-10-49-13-249
{"@level":"info","@message":"entering candidate state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.413532Z","node":{},"term":193}
2021-08-02 23:33:58.412	ip-10-49-13-249
{"@level":"warn","@message":"Election timeout reached, restarting election","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.412816Z"}
2021-08-02 23:33:58.377	ip-10-64-9-56
{"@level":"info","@message":"cluster leadership lost","@module":"nomad","@timestamp":"2021-08-02T23:33:58.377411Z"}
2021-08-02 23:33:58.371	ip-10-64-7-115
{"@level":"info","@message":"cluster leadership acquired","@module":"nomad","@timestamp":"2021-08-02T23:33:58.371798Z"}
2021-08-02 23:33:58.340	ip-10-64-7-115
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.340419Z","peer":"242db6b7-88b3-0f8e-effe-6fb5777b3b3e"}
2021-08-02 23:33:58.339	ip-10-64-7-115
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.339733Z","peer":"f0d873fc-ec51-2d9c-558c-f387ebbdafb4"}
2021-08-02 23:33:58.338	ip-10-64-7-115
{"@level":"info","@message":"entering leader state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.338915Z","leader":{}}
2021-08-02 23:33:58.338	ip-10-64-7-115
{"@level":"info","@message":"election won","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.338866Z","tally":2}
2021-08-02 23:33:58.326	ip-10-64-10-90
{"@level":"error","@message":"failed to dequeue evaluation","@module":"worker","@timestamp":"2021-08-02T23:33:58.326320Z","error":"rpc error: eval broker disabled"}
2021-08-02 23:33:58.326	ip-10-64-10-90
{"@level":"error","@message":"failed to dequeue evaluation","@module":"worker","@timestamp":"2021-08-02T23:33:58.326320Z","error":"rpc error: eval broker disabled"}
2021-08-02 23:33:58.316	ip-10-64-9-56
{"@level":"info","@message":"entering follower state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.316078Z","follower":{},"leader":""}
2021-08-02 23:33:58.316	ip-10-64-9-56
{"@level":"error","@message":"failed to dequeue evaluation","@module":"worker","@timestamp":"2021-08-02T23:33:58.316792Z","error":"eval broker disabled"}
2021-08-02 23:33:58.316	ip-10-64-9-56
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.316237Z","peer":{"Suffrage":0,"ID":"f0d873fc-ec51-2d9c-558c-f387ebbdafb4","Address":"10.64.10.90:4647"}}
2021-08-02 23:33:58.315	ip-10-64-9-56
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.315248Z","peer":{"Suffrage":0,"ID":"3b4a35f8-e0d4-5375-1f33-a8d4ef09bacb","Address":"10.64.7.115:4647"}}
2021-08-02 23:33:58.315	ip-10-64-10-90
{"@level":"warn","@message":"rejecting vote request since we have a leader","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.315772Z","from":"10.64.7.115:4647","leader":"10.64.9.56:4647"}
2021-08-02 23:33:58.314	ip-10-64-9-56
{"@level":"error","@message":"peer has newer term, stopping replication","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.314168Z","peer":{"Suffrage":0,"ID":"3b4a35f8-e0d4-5375-1f33-a8d4ef09bacb","Address":"10.64.7.115:4647"}}
2021-08-02 23:33:58.305	ip-10-64-7-115
{"@level":"info","@message":"entering candidate state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.305108Z","node":{},"term":11}
2021-08-02 23:33:58.303	ip-10-64-7-115
{"@level":"warn","@message":"heartbeat timeout reached, starting election","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:58.303676Z","last-leader":"10.64.9.56:4647"}
2021-08-02 23:33:58.253	ip-10-64-9-56
{"@level":"warn","@message":"failed retrieving server health","@module":"nomad.stats_fetcher","@timestamp":"2021-08-02T23:33:58.253343Z","error":"context deadline exceeded","server":"ip-10-64-7-115.eu-west-1"}
2021-08-02 23:33:57.787	ip-10-64-9-56
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:57.787351Z","server-id":"3b4a35f8-e0d4-5375-1f33-a8d4ef09bacb","time":1477115952}
2021-08-02 23:33:57.623	ip-10-49-13-249
{"@level":"info","@message":"duplicate requestVote for same term","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:57.623105Z","term":192}
2021-08-02 23:33:57.615	ip-10-49-11-101
{"@level":"info","@message":"entering candidate state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:57.615636Z","node":{},"term":192}
2021-08-02 23:33:57.615	ip-10-49-11-101
{"@level":"warn","@message":"heartbeat timeout reached, starting election","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:57.615474Z","last-leader":"10.49.15.7:4647"}
2021-08-02 23:33:57.308	ip-10-64-9-56
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:57.308214Z","server-id":"3b4a35f8-e0d4-5375-1f33-a8d4ef09bacb","time":997992197}
2021-08-02 23:33:56.997	ip-10-49-11-101
{"@level":"warn","@message":"rejecting vote request since we have a leader","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:56.997567Z","from":"10.49.13.249:4647","leader":"10.49.15.7:4647"}
2021-08-02 23:33:56.990	ip-10-49-13-249
{"@level":"info","@message":"entering candidate state","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:56.990410Z","node":{},"term":192}
2021-08-02 23:33:56.989	ip-10-49-13-249
{"@level":"warn","@message":"heartbeat timeout reached, starting election","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:56.989574Z","last-leader":"10.49.15.7:4647"}
2021-08-02 23:33:56.810	ip-10-64-9-56
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2021-08-02T23:33:56.810600Z","server-id":"3b4a35f8-e0d4-5375-1f33-a8d4ef09bacb","time":500140482}
@lgfa29
Copy link
Contributor

lgfa29 commented Aug 19, 2021

Thank you for the report @madsholden.

Just to highlight one point that you mentioned:

When checking the Nomad server logs I noticed both times when it started a new job, the servers were in the middle of a server election.

This seems like an important clue and would explain whey you see this intermittently.

@olanmills
Copy link

olanmills commented Sep 3, 2021

I just want to say that this is happening at my company too. I ran into this issue where I have a job scheduled to run every half hour with prohibit_overlap set to true. The overlap prevention definitely works sometimes, but if one of the job instances happens to take multiple hours, then it's pretty much guaranteed that Nomad is going to fail to prevent the overlap on at least one of the recurring intervals. I had six instances of the job running at one point. When I was trying to find out more about it, I found out that people at my company have been encountering this issue sporadically over the past three years. It appears that the main reason it doesn't affect us more often is because the large majority of jobs we have never come close to taking longer than the scheduled interval.

This seems like a high priority flaw. prohibit_overlap is unreliable, so we have to build our own mechanism to handle the overlap.

@olanmills
Copy link

For reference, this was the schedule I was using:
cron: '*/30 * * * *'

Also, madsholden said he sheduled their job to run every half hour, but based on the cron schedule they shared in their post, it looks like they scheduled it to run once per hour, on minute x:30.

@dnrce
Copy link

dnrce commented Oct 8, 2021

Same symptom here (Nomad 1.0.4) -- a duplicate instance of a non-overlapping periodic job got started during a leadership election. Not on the cron schedule, either.

  • Cron is @daily
  • Duplicate job start time is 2021-10-08T08:51:27Z
  • Server logs from that time (3-server cluster):
    • 10.0.0.1:
      2021-10-08T08:51:26.568Z [WARN]  nomad.raft: failed to contact: server-id=10.0.0.3:4647 time=1.174670205s
      2021-10-08T08:51:26.569Z [WARN]  nomad.raft: failed to contact: server-id=10.0.0.2:4647 time=1.198394307s
      2021-10-08T08:51:26.569Z [WARN]  nomad.raft: failed to contact quorum of nodes, stepping down
      2021-10-08T08:51:26.570Z [INFO]  nomad.raft: entering follower state: follower="Node at 10.0.0.1:4647 [Follower]" leader=
      2021-10-08T08:51:26.570Z [INFO]  nomad.raft: aborting pipeline replication: peer="{Voter 10.0.0.3:4647 10.0.0.3:4647}"
      2021-10-08T08:51:26.570Z [INFO]  nomad.raft: aborting pipeline replication: peer="{Voter 10.0.0.2:4647 10.0.0.2:4647}"
      2021-10-08T08:51:26.571Z [INFO]  nomad: cluster leadership lost
      2021-10-08T08:51:26.571Z [ERROR] worker: failed to dequeue evaluation: error="eval broker disabled"
      
    • 10.0.0.2:
      2021-10-08T08:51:26.565Z [ERROR] raft-net: failed to decode incoming command: error="read tcp 10.0.0.2:4647->10.0.0.1:55414: read:
      2021-10-08T08:51:27.671Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader=10.0.0.1:4647
      2021-10-08T08:51:27.671Z [INFO]  nomad.raft: entering candidate state: node="Node at 10.0.0.2:4647 [Candidate]" term=203
      2021-10-08T08:51:27.680Z [INFO]  nomad.raft: election won: tally=2
      2021-10-08T08:51:27.680Z [INFO]  nomad.raft: entering leader state: leader="Node at 10.0.0.2:4647 [Leader]"
      2021-10-08T08:51:27.680Z [INFO]  nomad.raft: added peer, starting replication: peer=10.0.0.3:4647
      2021-10-08T08:51:27.680Z [INFO]  nomad.raft: added peer, starting replication: peer=10.0.0.1:4647
      2021-10-08T08:51:27.681Z [INFO]  nomad: cluster leadership acquired
      2021-10-08T08:51:27.681Z [INFO]  nomad.raft: pipelining replication: peer="{Voter 10.0.0.1:4647 10.0.0.1:4647}"
      2021-10-08T08:51:27.683Z [INFO]  nomad.raft: pipelining replication: peer="{Voter 10.0.0.3:4647 10.0.0.3:4647}"
      
    • 10.0.0.3:
      2021-10-08T08:51:26.565Z [ERROR] raft-net: failed to decode incoming command: error="read tcp 10.0.0.3:4647->10.0.0.1:39134: read:
      2021-10-08T08:51:27.673Z [WARN]  nomad.raft: rejecting vote request since we have a leader: from=10.0.0.2:4647 leader=10.0.0.1:464
      

@tgross tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Nov 9, 2021
@kpweiler
Copy link

I'm seeing the same problem on Nomad v1.1.0 - duplicate periodic process launched during leader election

@mikenomitch
Copy link
Contributor

mikenomitch commented Feb 24, 2022

I've heard of a similar issue being resolved for some users by removing time_zone from the periodic stanza. If daylight savings time is a factor, this might be a hassle to manage, but if you are running into this issue currently and have time_zone set, I might try removing it as a workaround. (That said, I don't see it on the job file posted by the original reporter)

If anybody tries this, please let me know if it alleviates the issue. This doesn't fix the core issue of course, but might be a serviceable workaround until we are able to fix the root cause.

@cread
Copy link

cread commented Jun 16, 2022

This bug is still causing us problems from time to time. Is anyone actually working on trying to fix it?

@mikenomitch
Copy link
Contributor

Hey @cread, sorry about this one dragging out. This is on our radar, but unfortunately not something we can prioritize right now. Between 1.4 commitments and some other pressing bugs, we're more or less at capacity for at least several weeks.

I'll let you know if we pick it up. Or if anybody wants to take a crack at a fix, let me know and I can provide some rough guidance.

@kpweiler
Copy link

kpweiler commented Jul 8, 2022

FWIW - we're also seeing this issue with time_zone unset (which in the Periodic stanza defaults to UTC)

@louievandyke louievandyke added the hcc/cst Admin - internal label Jul 13, 2022
@olanmills
Copy link

How is this not a higher priority issue? It seems like one of the main features of Nomad is to orchestrate jobs for you, yet it's not reliable and you have to orchestrate it yourself anyway. At the very least, you need to update your documentation so that people know that prohibit_overlap doesn't actually work. You're still advertising the feature in your documentation here:
https://www.nomadproject.io/docs/job-specification/periodic

@hashicorp hashicorp locked as too heated and limited conversation to collaborators Sep 21, 2022
@DerekStrickland
Copy link
Contributor

Referencing #14505 as a possible duplicate issue.

@mikenomitch mikenomitch added the theme/batch Issues related to batch jobs and scheduling label Dec 6, 2022
@jrasell jrasell assigned jrasell and unassigned lgfa29 Mar 10, 2023
@jrasell
Copy link
Member

jrasell commented Mar 21, 2023

Hi everyone, I am currently looking into this and as a first step have worked on a reliable local reproduction. I am using the Vagrantfile in the repository root for my isolation and am building Nomad using the current main source at a633b79fb535679ce7776a0d65c88c788ba8ae92.

Start a Nomad cluster using the cluster.sh script ensuring all processes are running as root via sudo ./dev/cluster/cluster.sh

Register the jobspec below via nomad job run 11052.nomad.hcl and then trigger a forced periodic instance using the nomad job periodic force test command. Wait for the allocation to be successfully running before moving forward.

job "test" {
  type = "batch"
  periodic {
    cron             = "* * * * * *"
    prohibit_overlap = true
  }
  group "test" {
    task "test" {
      driver = "raw_exec"
      config {
        command = "sleep"
        args    = ["30000s"]
      }
    }
  }
}

Discover which Nomad server is the leader by using nomad server members and then find the process ID for this agent by running sudo ps -aef |grep server<NUM> where the number corresponds to the number within the server name of the leader. Kill the leader process using sudo kill -9 <PID> and wait for leadership to transition.

Upon transition, there will now be two instances of the periodic job running, despite the job including the prohibit_overlap parameter.

ID                        Type            Priority  Status   Submit Date
test                      batch/periodic  50        running  2023-03-21T08:38:15Z
test/periodic-1679387898  batch           50        running  2023-03-21T08:38:18Z
test/periodic-1679388004  batch           50        running  2023-03-21T08:40:04Z

I will be working through this scenario to try and identify what behaviour is causing this, and will post any relevant findings in a followup comment. Thanks for everyones patience.

Juanadelacuesta pushed a commit that referenced this issue Mar 21, 2023
…hibit_overlap is true

Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.
Juanadelacuesta added a commit that referenced this issue Mar 21, 2023
…hibit_overlap is true

Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.
Juanadelacuesta pushed a commit that referenced this issue Mar 22, 2023
…hibit_overlap is true

Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.
Juanadelacuesta added a commit that referenced this issue Mar 22, 2023
…hibit_overlap is true

Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.
Juanadelacuesta pushed a commit that referenced this issue Mar 27, 2023
…hibit_overlap is true

Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.
Juanadelacuesta added a commit that referenced this issue Mar 27, 2023
…hibit_overlap is true

Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.
Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Mar 27, 2023
Juanadelacuesta added a commit that referenced this issue Mar 28, 2023
…hibit_overlap is true (#16583)

* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.

* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.

* style: refactor force run function

* fix: remove defer and inline unlock for speed optimization

* Update nomad/leader.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* style: refactor tests to use must

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* fix: move back from defer to calling unlock before returning.

createEval cant be called with the lock on

* style: refactor test to use must

* added new entry to changelog and update comments

---------

Co-authored-by: James Rasell <jrasell@hashicorp.com>
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
jrasell added a commit that referenced this issue Mar 28, 2023
…hibit_overlap is true (#16583)

* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.

* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.

* style: refactor force run function

* fix: remove defer and inline unlock for speed optimization

* Update nomad/leader.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* style: refactor tests to use must

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* fix: move back from defer to calling unlock before returning.

createEval cant be called with the lock on

* style: refactor test to use must

* added new entry to changelog and update comments

---------

Co-authored-by: James Rasell <jrasell@hashicorp.com>
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
jrasell added a commit that referenced this issue Mar 28, 2023
…hibit_overlap is true (#16583)

* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.

* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.

* style: refactor force run function

* fix: remove defer and inline unlock for speed optimization

* Update nomad/leader.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* style: refactor tests to use must

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <jrasell@users.noreply.github.com>

* fix: move back from defer to calling unlock before returning.

createEval cant be called with the lock on

* style: refactor test to use must

* added new entry to changelog and update comments

---------

Co-authored-by: James Rasell <jrasell@hashicorp.com>
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
10 participants