Uncontrolled evaluation of periodic batch jobs during DST change #5410

boardwalk · 2019-03-12T19:22:51Z

Nomad version

Nomad v0.8.7 (21a2d93+CHANGES)

Operating system and Environment details

Red Hat Enterprise Linux Server release 7.6 (Maipo)

Issue

I had a periodic job scheduled during the DST rollover this past weekend which was repeatedly rapidly evaluated at 6AM UTC when it was scheduled for 2AM local/ET. For reference, ET was UTC-5 before the change and UTC-4 after. The job ran many, many thousands of times before it was caught, and I believe there were many times that many allocations that never ended up getting placed (I posted about that in #4532). The flood eventually crippled the cluster, because Nomad was tracking so much that the OOM killer came out in force, and even without that, Nomad was mostly unresponsive and far as I can tell. I was able to stop the stage-a-restart-services job, but I seemed to also have to change the eligibility of the nodes to false to get the allocations to drain from being in the thousands per node to the dozen or as normal.

Reproduction steps

Running a job scheduled the same way across the the DST change should reproduce this, but I honestly don't have time to do things like bring up another Nomad instance in a VM where I have the privs to set the date.

Job file (if appropriate)

job "stage-a-restart-services" {
  type = "batch"
  periodic {
    cron = "0 0 2 * * * *"
    time_zone = "Local"
  }
  datacenters = ["a"]
  group "restart-services" {
    task "restart-services" {
      leader = true
      driver = "raw_exec"
      env {
        TIER = "stage"
        SITE = "a"
      }
      config {
        command = "/home/fds/dsotm/FDSdsotm_misc/bin/restart_services"
      }
    }
    task "store-logs" {
      driver = "raw_exec"
      config {
        command = "/home/fds/dsotm/FDSdsotm_misc/bin/store_logs"
        args = ["/home/fds/dsotm/log/${NOMAD_JOB_NAME}"]
      }
    }
  }
}

/home/fds/dsotm/FDSdsotm_misc/bin/restart_services is a short Python script that thankfully didn't actually do anything with this time around.

The text was updated successfully, but these errors were encountered:

c2nes · 2019-03-15T15:00:09Z

We experienced the same issue with a periodic job configured with a 30 * * * * schedule and timezone "America/New_York" on Nomad 0.8.6.

The first problematic allocation was started at 2019-03-10T06:30:11.736Z. Once that allocation completed a new allocation for the same job instance was created. This cycled continued until this morning when we manually stopped the parent job and the child job instance. Ultimately, over 4000 allocations were created. Virtually all completed successfully.

Attempting to manually stop just the child job (while leaving the parent periodic job registered) was unsuccessful. Nomad would accept the DELETE request, and the child job would sometimes briefly be marked as dead, but would almost immediately return to a running state with all of the old allocations still in place.

After completely stopping the job (both parent and child) we were able to successfully re-register the job. However, it is now not being scheduled for execution at all. Looking at other periodic jobs, the ones that aren't stuck in continuous execution loops appear to not have run at all since the EST/EDT switch over.

stale · 2019-06-13T15:08:24Z

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

stale · 2019-07-13T15:18:51Z

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

notnoop · 2019-10-24T17:40:25Z

Thank you so much for reporting this and sorry for taking a long time. We plan to investigate and remedy this soon.

The issue here is that our cron library doesn't handle daylight transitions well. We have two complications: the library we use is deprecated and unmaintained[1] and the daylight saving issue is a known unresolved issue[2]. We'll investigate our options and address it soon.

Meanwhile, we recommend using UTC timezone for periodic jobs either in general or at least around DST transitioning time, if possible.

[1] https://github.com/gorhill/cronexpr
[2] gorhill/cronexpr#17

notnoop · 2020-05-04T14:45:18Z

Providing an update here with my notes.

We have two options:

We can fix gorhill/cronexpr library to handle DST properly. Sadly, the DST PR gorhill/cronexpr#17 fails some our testing, as it gets into infinite recursion causing a stack overflow in some cases.

Alternatively, we can migrate to using another maintained library. https://github.com/robfig/cron is a very reasonable library. Its handling of DST passed our tests. The library is well maintained and commonly used.

The downside of switching libraries that cronexpr supports some cron expression extensions not supported by any other library I looked at, so we may risk introducing subtle compatibility changes:

Years: this is a simple thing to add
L (last day), W (week day), # (further constraints on days) - these are trickier to implement while ensuring that we adhere to gorhill/cronexpr semantics properly.

My current inclination is to check if robfig/cron would welcome contributions for the extensions - Their SpecSchedule struct would need to change significantly. If not, I would suggest fixing cronexpr as-is.

github-actions · 2022-11-07T02:33:19Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

nickethier added theme/scheduling stage/needs-investigation labels Mar 12, 2019

stale bot added the stage/waiting-reply label Jun 13, 2019

stale bot closed this as completed Jul 13, 2019

endocrimes reopened this Jul 13, 2019

stale bot removed the stage/waiting-reply label Jul 13, 2019

endocrimes added the type/bug label Jul 13, 2019

preetapan removed the stage/needs-investigation label Aug 13, 2019

notnoop self-assigned this Oct 24, 2019

notnoop mentioned this issue Dec 28, 2019

FTBFS: concurrent map writes #6896

Closed

jippi mentioned this issue Mar 8, 2020

[BUG] Periodic Scheduler stuck on March 8th 2AM EST, while servers clock was set to UTC #7289

Closed

notnoop mentioned this issue May 7, 2020

Fix Daylight saving transition handling #7894

Merged

notnoop closed this as completed in #7894 May 12, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uncontrolled evaluation of periodic batch jobs during DST change #5410

Uncontrolled evaluation of periodic batch jobs during DST change #5410

boardwalk commented Mar 12, 2019

c2nes commented Mar 15, 2019

stale bot commented Jun 13, 2019

stale bot commented Jul 13, 2019

notnoop commented Oct 24, 2019

notnoop commented May 4, 2020

github-actions bot commented Nov 7, 2022

Uncontrolled evaluation of periodic batch jobs during DST change #5410

Uncontrolled evaluation of periodic batch jobs during DST change #5410

Comments

boardwalk commented Mar 12, 2019

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

c2nes commented Mar 15, 2019

stale bot commented Jun 13, 2019

stale bot commented Jul 13, 2019

notnoop commented Oct 24, 2019

notnoop commented May 4, 2020

github-actions bot commented Nov 7, 2022