Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodic job with timezone wrongly runned after leader election. #2721

Closed
cyrilgdn opened this issue Jun 20, 2017 · 2 comments
Closed

Periodic job with timezone wrongly runned after leader election. #2721

cyrilgdn opened this issue Jun 20, 2017 · 2 comments

Comments

@cyrilgdn
Copy link

Nomad version

Nomad v0.5.6

Issue

We have some connectivity problems between our nomad servers and sometimes the leader changed because current leader can't contact other nodes.
After this election, some periodic job are force run whereas they were already launched at the good time.

The particularity is that our nomad servers and clients are in UTC and our nomad periodic job specify a timezone.

I'm not sure but I think in this call: https://github.com/hashicorp/nomad/blob/master/nomad/leader.go#L297
The function is: https://github.com/hashicorp/nomad/blob/master/nomad/structs/structs.go#L1866

func (p *PeriodicConfig) Next(fromTime time.Time) time.Time {
    switch p.SpecType {
    case PeriodicSpecCron:
        if e, err := cronexpr.Parse(p.Spec); err == nil {
            return e.Next(fromTime)
        }    
        [...]
    }
}

The cron parser does not take care of the configured timezone. So with the job above, the last launch time was in UTC at 2017-06-19 17:00 so the next launch parsed in "0 19 * * *" without timezone will be 2017-06-19 19:00 UTC

When leader changed the next day at 2017-06-20 09:21, computed next launch time is in the past so Nomad runs the job.

I see in other call to Next function that fromTime is convert to the periodic timezone
(e.g.: https://github.com/hashicorp/nomad/blob/master/nomad/periodic.go#L218)

I'm just not sure where the launch.Launch come from in restorePeriodicDispatcher
(if it's the timestamp parsed for the last periodic sub-job name, it's in UTC, but if all evaluations were gc, I don't know...)

Job file

job "my-batch-job" {
    [...]
    type = "batch"

    periodic {
        time_zone = "Europe/Zurich"
        cron = "0 19 * * * *"
        prohibit_overlap = true
    }
    [...]
}

Nomad logs

Multiple servers logs:

i-079a327fc5a9688c1    Jun 20 09:21:15 nomad[5955]: 2017/06/20 09:21:15 [WARN] raft: Failed to contact 172.31.61.159:4647 in 500.187758ms
i-079a327fc5a9688c1    Jun 20 09:21:16 nomad[5955]: 2017/06/20 09:21:16 [WARN] raft: Failed to contact 172.31.61.159:4647 in 988.181057ms
i-079a327fc5a9688c1    Jun 20 09:21:16 nomad[5955]: 2017/06/20 09:21:16 [WARN] raft: Failed to contact 172.31.61.159:4647 in 1.421098307s
i-079a327fc5a9688c1    Jun 20 09:21:17 nomad[5955]: 2017/06/20 09:21:17 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 1.889183718s
i-079a327fc5a9688c1    Jun 20 09:21:17 nomad[5955]: 2017/06/20 09:21:17 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 2.366315841s
i-0a9d68ba8a888deff    Jun 20 09:21:17 nomad[4819]: 2017/06/20 09:21:17 [WARN] raft: Heartbeat timeout from "172.31.17.209:4647" reached, starting election
i-0a9d68ba8a888deff    Jun 20 09:21:17 nomad[4819]: 2017/06/20 09:21:17 [INFO] raft: Node at 172.31.61.159:4647 [Candidate] entering Candidate state in term 477 
i-079a327fc5a9688c1    Jun 20 09:21:17 nomad[5955]: 2017/06/20 09:21:17 [WARN] raft: Rejecting vote request from 172.31.61.159:4647 since we have a leader: 172.31.17.209:4647
i-0a9d68ba8a888deff    Jun 20 09:21:17 nomad[4819]: 2017/06/20 09:21:17 [DEBUG] raft: Votes needed: 2
i-0a9d68ba8a888deff    Jun 20 09:21:17 nomad[4819]: 2017/06/20 09:21:17 [DEBUG] raft: Vote granted from 172.31.61.159:4647 in term 477. Tally: 1
i-079a327fc5a9688c1    Jun 20 09:21:18 nomad[5955]: 2017/06/20 09:21:18 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 2.82117036s
i-079a327fc5a9688c1    Jun 20 09:21:18 nomad[5955]: 2017/06/20 09:21:18 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 3.304393447s
i-079a327fc5a9688c1    Jun 20 09:21:19 nomad[5955]: 2017/06/20 09:21:19 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 3.801673851s
i-079a327fc5a9688c1    Jun 20 09:21:19 nomad[5955]: 2017/06/20 09:21:19 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 4.293306809s
i-0a9d68ba8a888deff    Jun 20 09:21:19 nomad[4819]: 2017/06/20 09:21:19 [INFO] raft: Node at 172.31.61.159:4647 [Candidate] entering Candidate state in term 478 
i-0a9d68ba8a888deff    Jun 20 09:21:19 nomad[4819]: 2017/06/20 09:21:19 [WARN] raft: Election timeout reached, restarting election
i-079a327fc5a9688c1    Jun 20 09:21:19 nomad[5955]: 2017/06/20 09:21:19 [WARN] raft: Rejecting vote request from 172.31.61.159:4647 since we have a leader: 172.31.17.209:4647
i-0a9d68ba8a888deff    Jun 20 09:21:19 nomad[4819]: 2017/06/20 09:21:19 [DEBUG] raft: Vote granted from 172.31.61.159:4647 in term 478. Tally: 1
i-0a9d68ba8a888deff    Jun 20 09:21:19 nomad[4819]: 2017/06/20 09:21:19 [DEBUG] raft: Votes needed: 2
i-079a327fc5a9688c1    Jun 20 09:21:20 nomad[5955]: 2017/06/20 09:21:20 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 4.763900551s
i-079a327fc5a9688c1    Jun 20 09:21:20 nomad[5955]: 2017/06/20 09:21:20 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 5.253341459s
i-079a327fc5a9688c1    Jun 20 09:21:21 nomad[5955]: 2017/06/20 09:21:21 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 5.729087854s
i-0a9d68ba8a888deff    Jun 20 09:21:21 nomad[4819]: 2017/06/20 09:21:21 [WARN] raft: Election timeout reached, restarting election
i-0a9d68ba8a888deff    Jun 20 09:21:21 nomad[4819]: 2017/06/20 09:21:21 [INFO] raft: Node at 172.31.61.159:4647 [Candidate] entering Candidate state in term 479 
i-0a9d68ba8a888deff    Jun 20 09:21:21 nomad[4819]: 2017/06/20 09:21:21 [DEBUG] raft: Votes needed: 2
i-0a9d68ba8a888deff    Jun 20 09:21:21 nomad[4819]: 2017/06/20 09:21:21 [DEBUG] raft: Vote granted from 172.31.61.159:4647 in term 479. Tally: 1
i-079a327fc5a9688c1    Jun 20 09:21:21 nomad[5955]: 2017/06/20 09:21:21 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 6.179051114s
i-079a327fc5a9688c1    Jun 20 09:21:21 nomad[5955]: 2017/06/20 09:21:21 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 6.589734023s
i-0a9d68ba8a888deff    Jun 20 09:21:22 nomad[4819]: 2017/06/20 09:21:22 [WARN] raft: Election timeout reached, restarting election
i-0a9d68ba8a888deff    Jun 20 09:21:22 nomad[4819]: 2017/06/20 09:21:22 [INFO] raft: Node at 172.31.61.159:4647 [Candidate] entering Candidate state in term 480 
i-0a9d68ba8a888deff    Jun 20 09:21:22 nomad[4819]: 2017/06/20 09:21:22 [DEBUG] raft: Vote granted from 172.31.61.159:4647 in term 480. Tally: 1
i-0a9d68ba8a888deff    Jun 20 09:21:22 nomad[4819]: 2017/06/20 09:21:22 [DEBUG] raft: Votes needed: 2
i-079a327fc5a9688c1    Jun 20 09:21:22 nomad[5955]: 2017/06/20 09:21:22 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 7.047892322s
i-079a327fc5a9688c1    Jun 20 09:21:22 nomad[5955]: 2017/06/20 09:21:22 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 7.511820449s
i-079a327fc5a9688c1    Jun 20 09:21:23 nomad[5955]: 2017/06/20 09:21:23 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 7.982202505s
i-079a327fc5a9688c1    Jun 20 09:21:23 nomad[5955]: 2017/06/20 09:21:23 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 8.452748352s
i-079a327fc5a9688c1    Jun 20 09:21:24 nomad[5955]: 2017/06/20 09:21:24 [DEBUG] memberlist: Failed UDP ping: i-0a9d68ba8a888deff.europe (timeout reached)
i-079a327fc5a9688c1    Jun 20 09:21:24 nomad[5955]: 2017/06/20 09:21:24 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 8.95072332s
i-079a327fc5a9688c1    Jun 20 09:21:24 nomad[5955]: 2017/06/20 09:21:24 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 9.446581994s
i-079a327fc5a9688c1    Jun 20 09:21:25 nomad[5955]: 2017/06/20 09:21:25 [DEBUG] raft: Failed to contact 172.31.61.159:4647 in 9.915657351s
i-079a327fc5a9688c1    Jun 20 09:21:25 nomad[5955]: 2017/06/20 09:21:25 [WARN] raft: Rejecting vote request from 172.31.61.159:4647 since we have a leader: 172.31.17.209:4647
i-079a327fc5a9688c1    Jun 20 09:21:25 nomad[5955]: 2017/06/20 09:21:25 [ERR] raft: peer {Voter 172.31.61.159:4647 172.31.61.159:4647} has newer term, stopping replication
i-079a327fc5a9688c1    Jun 20 09:21:25 nomad[5955]: 2017/06/20 09:21:25 [INFO] raft: Node at 172.31.17.209:4647 [Follower] entering Follower state (Leader: "") 
i-079a327fc5a9688c1    Jun 20 09:21:25 nomad[5955]: 2017/06/20 09:21:25 [INFO] raft: aborting pipeline replication to peer {Voter 172.31.61.159:4647 172.31.61.159:4647}
i-079a327fc5a9688c1    Jun 20 09:21:25 nomad[5955]: 2017/06/20 09:21:25.411908 [INFO] nomad: cluster leadership lost
i-079a327fc5a9688c1    Jun 20 09:21:25 nomad[5955]: 2017/06/20 09:21:25 [INFO] raft: aborting pipeline replication to peer {Voter 172.31.25.212:4647 172.31.25.212:4647}
i-079a327fc5a9688c1    Jun 20 09:21:25 nomad[5955]: 2017/06/20 09:21:25.412201 [ERR] worker: failed to dequeue evaluation: eval broker disabled
i-079a327fc5a9688c1    Jun 20 09:21:25 nomad[5955]: 2017/06/20 09:21:25 [DEBUG] raft-net: 172.31.17.209:4647 accepted connection from: 172.31.61.159:13841
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25 [ERR] raft-net: Failed to flush response: write tcp 172.31.61.159:4647->172.31.17.209:57671: write: connection reset by peer
i-079a327fc5a9688c1    Jun 20 09:21:25 nomad[5955]: 2017/06/20 09:21:25 [DEBUG] raft-net: 172.31.17.209:4647 accepted connection from: 172.31.61.159:13848
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25 [INFO] raft: Node at 172.31.61.159:4647 [Leader] entering Leader state
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25 [INFO] raft: Election won. Tally: 2
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25 [DEBUG] raft: Vote granted from 172.31.17.209:4647 in term 482. Tally: 2
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25 [INFO] raft: Added peer 172.31.17.209:4647, starting replication
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25 [INFO] raft: Added peer 172.31.25.212:4647, starting replication
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25.781784 [INFO] nomad: cluster leadership acquired
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25 [INFO] raft: pipelining replication to peer {Voter 172.31.25.212:4647 172.31.25.212:4647}
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25 [INFO] raft: pipelining replication to peer {Voter 172.31.17.209:4647 172.31.17.209:4647}
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25.872845 [DEBUG] nomad.periodic: registered periodic job "my-batch-job"
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25.896155 [DEBUG] nomad.periodic: periodic job "my-batch-job" force run during leadership establishment
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25.896186 [DEBUG] leader: reconciling job summaries at index: 311453
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25.899410 [DEBUG] sched: <Eval 'edb3b6ca-64a7-24fd-d5d3-83ba52cf2d1d' JobID: 'my-batch-job/periodic-1497950485'>: allocs: (place 1) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25.899085 [DEBUG] worker: dequeued evaluation edb3b6ca-64a7-24fd-d5d3-83ba52cf2d1d
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25 [DEBUG] sched: <Eval '23b05878-882c-c702-0a4d-0ed2d89856cf' JobID: 'my-batch-job'>: allocs: (place 1) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25 [DEBUG] sched: <Eval '2e42d76d-f26c-6629-b98b-55fbb383299a' JobID: 'my-batch-job/periodic-1497950485'>: allocs: (place 1) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25.928703 [DEBUG] worker: submitted plan for evaluation edb3b6ca-64a7-24fd-d5d3-83ba52cf2d1d
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25.936161 [DEBUG] worker: ack for evaluation edb3b6ca-64a7-24fd-d5d3-83ba52cf2d1d
i-0a9d68ba8a888deff    Jun 20 09:21:25 nomad[4819]: 2017/06/20 09:21:25.936133 [DEBUG] worker: updated evaluation <Eval 'edb3b6ca-64a7-24fd-d5d3-83ba52cf2d1d' JobID: 'my-batch-job/periodic-1497950485'>
@dadgar
Copy link
Contributor

dadgar commented Jun 20, 2017

@cyrilgdn Thanks for the issue and the debugging! I think you are spot on about the restore not taking the timezone into consideration! Will hopefully get this fixed in 0.6.0

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants