Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Periodic Scheduler stuck on March 8th 2AM EST, while servers clock was set to UTC #7289

Closed
burdandrei opened this issue Mar 8, 2020 · 12 comments · Fixed by #7894
Closed
Labels
theme/batch Issues related to batch jobs and scheduling theme/scheduling type/bug

Comments

@burdandrei
Copy link
Contributor

burdandrei commented Mar 8, 2020

Nomad version

Nomad v0.10.3 (65af1b9)

Operating system and Environment details

# cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.4 LTS"

# cat /etc/timezone 
Etc/UTC

Issue

Periodic jobs stopped firing exactly on DST change

Reproduction steps

Save daylight time 🙄

Here's 24-hour logs pattern from nomad server leader. The obvious spike at 2 AM UTC (8AM local browser time) and decrease after nomad leader was restarted and migrated to another server

image

Will post logs after sanitizing

@burdandrei
Copy link
Contributor Author

Found the exact message nomad was shouting:
skipping launch of periodic job because job prohibits
image

Please guide me what other logs/info can be helpfull

@jippi
Copy link
Contributor

jippi commented Mar 8, 2020

Note: This has been a multi-year issue hitting us every year - see #5410 and #3392

Looks like the upstream project (which has been archived a long time ago) got a fix for it since 2016 that was never merged (gorhill/cronexpr#17)

@burdandrei
Copy link
Contributor Author

According to @jippi's assumption, the scheduler is going nuts even if there's one Periodic job that is not in UTC timezone.
I checked this, and in affected cluster couple of jobs indeed had America/New_York time zone configured.
Other clusters, that have only UTC time zone crons survived this night well.

@the-maldridge
Copy link

I just got paged into a "fun" outage where a single task running in a localized timezone caused hundreds of other batch tasks to not be dispatched. What can be done to making sure this bug doesn't go the way of the others referenced above?

@burdandrei
Copy link
Contributor Author

Similar to us @the-maldridge =)
We added a force check of the Periodic jobs timezone for now.
But obviously, when you're running with multi DC, distributed team of developers environment use of timezone is very handy from the developer's perspective

@jrasell
Copy link
Member

jrasell commented Mar 9, 2020

Hi @burdandrei, @jippi and @the-maldridge. Thanks a lot for the detail in this issue and apologies this has both caused impact and been in existence for a while. The team started some discussions yesterday on how best to resolve this and we will again talk about this today. I'll likely close this issue as a duplicate of the already linked #5410, however, I think its worth leaving this open for at least today so that anyone else encountering this problem can quickly and easily find the conversation.

@jrasell jrasell added theme/batch Issues related to batch jobs and scheduling type/bug theme/scheduling labels Mar 9, 2020
@burdandrei
Copy link
Contributor Author

Thanks for update @jrasell

@Dirrk
Copy link

Dirrk commented Mar 9, 2020

We also had this happen in our dev/prod clusters running 0.9.6 on Ubuntu 16.04. Unfortunately fluentd dropped our logs that would have ended up in Kibana and we shutdown the node once it alerted for 0% disk space which replaced it in the autoscaling group. So I don't have much to add for debugging info but I do know that it used up a ton of memory + disk space on the box. Hopefully at least this will help others next year.

image

image

@jdebbink
Copy link

jdebbink commented Mar 9, 2020

We got hit by this issue as well, what did you do to get things back in a working state?

@the-maldridge
Copy link

@jdebbink We had great luck with removing anything that wasn't running in UTC timezone. After that we did a stop/start on all jobs in batch/periodic mode and ran a monitoring query to figure out what needed an on-demand launch.

@jippi
Copy link
Contributor

jippi commented Mar 9, 2020

also just restarting the nomad leader made everything work without any job changes :)

@github-actions
Copy link

github-actions bot commented Nov 7, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
theme/batch Issues related to batch jobs and scheduling theme/scheduling type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants