[feature] Timeout for batch jobs #1782

sheldonkwok · 2016-10-03T20:40:01Z

I'm currently running a handful of periodic batch jobs. It's great that nomad doesn't schedule another one if a current one is still running. However, I think it would be helpful to stop a batch job on timeout if it's running beyond a set time. Maybe there could be a script run on timeout or the script itself would just have to handle the signal.

dadgar · 2016-10-04T17:41:43Z

Hey Sheldon,

You could accomplish this yourself by putting a little script in-between what you actually want to run that waits till either the task finishes or the timeout and then returns exit 1

sheldonkwok · 2016-10-05T01:43:40Z

That's how I'm handling it right now but I was thinking it would be cool if Nomad could do it. I understand if it seems like bloat though :)

OferE · 2017-02-23T10:20:04Z

+1 - important feature for batch runs. Not so clean to handle this ourselves

alxark · 2017-09-01T06:50:21Z

I think this function should be available not only for batch jobs but also for regular services, this will help us to implement "chaos monkey" function right inside nomad. This will increase system stability, because it will be ready for downtime of any service.

jippi · 2017-09-01T07:08:03Z

As mentioned in Gitter chat, the timeout binary in coreutils can do this inside the container if you need a fix right now.

timeout 5 /path/to/slow/command with options

alxark · 2017-09-01T09:35:39Z

I think it will be better to add "max_lifetime" and add ability to specify it as a range or concrete value. For example 10h-20h means that daemon might be killed in 11h or after 19h, but maximum time will be 20h. Implementing chaosmonkey in such way will be great feature in my opinion, and you don't need any 3rd party apps =)

shantanugadgil · 2017-09-01T15:33:41Z

If a timeout function is implemented it can be used to mimic the classic HPC schedulers like PBS, TORQUE, SGE, etc.

Having it as a first-class feature would be indeed useful for many folks including me!
Hope this does get implemented.

Thanks and Regards,
Shantanu

mlehner616 · 2018-04-06T00:43:48Z

Just adding a use case here. Let's say I have an app and implement a timeout inside the app. Let's assume there's a bug in this app which causes it to hang occasionally under certain conditions and it thus never reaches it's hard coded "timeout" because it's essentially stopped responding. We should have a way in the infrastructure to automate a relatively simple, drop dead timeline where the scheduler will kill a task that's unresponsive.

Nomad is better equipped to provide that failsafe protection at the infrastructure level than rewriting timeout code in every app simply because it doesn't rely on the app's runtime to perform that kill.

shantanugadgil · 2018-04-06T04:25:29Z

I agree, but as a temporary workaround, how about the timeout with the kill timout parameter?

http://man7.org/linux/man-pages/man1/timeout.1.html

Miserlou · 2018-08-17T14:37:58Z

+1 for this, it's basic functionality for a job scheduler. Amazing this doesn't exist. @mlehner616 is obviously correct about why having the timeout checker inside the container itself is a boneheaded recommendation. We got bit by 3 hung jobs out of 100,000 that prevented our elastic infrastructure from scaling back down, costing a nice chunk of change.

AndrewSav · 2018-08-18T02:26:06Z

@Miserlou as mentioned earlier in this thread a workaround would be to wrap your app in a timeout script. There is an example of how you can do it above. That might save your beckon in the scenario you described.

onlyjob · 2018-08-29T10:34:27Z

Timeout for batch jobs is an important safeguard. We can't rely on jobs' good behaviour... Job without time limit is a service hence timeout is crucial to constrain buggy tasks that might be running for too long...

wiedenmeier · 2018-10-10T18:49:52Z

I'd also very much like to see nomad implement this, for the use case where nomad's parameterized jobs are used as a bulk task processing system, similar to the workflow described here: https://www.hashicorp.com/blog/replacing-queues-with-nomad-dispatch

There are major advantages for us to use this workflow, as it takes advantage of infrastructure already in place to handle autoscaling, rather than having to set up a new system using celery or similar task queues. The lack of a built in timeout mechanism for batch jobs makes the infrastructure required for this, fairly common (afaik) use case quite a bit more complex.

Handling the timeout in the tasks themselves is not a safe approach, for the reasons mentioned above, and would also increase the complexity of individual tasks, which is not ideal. Therefore the dispatcher must manage the timeout itself and kill batch jobs once it has been reached. This makes it inconvenient to manage jobs which need different timeouts using a single bulk task management system, as configuration for these needs to be stored centrally, separate from the actual job specification.

There are workarounds for this, but it would be very nice to see nomad itself handle timeouts, both for safety and to simplify using nomad.

Miserlou · 2018-10-10T19:02:37Z

@magical-chicken - I strongly, strongly recommend you avoid using Nomad for that purpose. There are claims made in that blog post which are simply untrue, and many people are being duped by it.

See more here:
#4323 (comment)

wiedenmeier · 2018-10-10T19:20:08Z

@Miserlou Thanks for the heads up, that is a pretty serious bug in nomad, and is pretty concerning since we have a large amount of infrastructure managed by it. The volume of dispatches we are handling currently isn't too high, so I'm hoping nomad will be ok to use here in the short term, but long term I will definitely consider switching this system over to a dedicated task queue.

Nomad will crash with out of memory
Hopefully hashicorp intents to fix this, maybe they could add a configuration option for servers to use a memory mapped file to store state rather than risking an OOM kill, or even have servers start rejecting additional job registrations if they're running out of memory. There's really no case where it is acceptable for servers to crash completely, or secondary servers to fail to elect a new leader after the leader is lost.

jxgriffiths · 2019-03-28T17:32:27Z

+1 are the any plans to include this feature any time soon? Seems pretty important. Wrapping tasks in a timeout script is a bit hacky.

epetrovich · 2019-04-30T02:13:44Z

+1

grainnemcknight · 2019-05-26T21:27:44Z

+1

sabbene · 2019-11-07T06:11:19Z

A job run limit is an essential feature of a batch scheduler. All major batch schedulers (PBS, slurm, LSF, etc) have this capability. I’ve seen a growing interest in a tool like nomad, something that combines many of the features of a traditional batch scheduler with K8. But without a run time limit feature, integration into a traditional batch environment would be next to impossible. Is there any timeline on adding this feature to nomad?

karlem · 2020-01-31T11:00:21Z

+1

shantanugadgil · 2020-01-31T12:08:36Z

@karlem you should add a +1 to the first post rather that a seaprate message.
That's how they track the demand of a feature.

If you know more folks who might be interested in this, you should encourage them to do so as well! 😉

BirkhoffLee · 2022-03-31T21:52:25Z

The absence of this feature just killed my cluster. A curl periodic job piled up to 600+ pending and tens running. This caused very high disk i/o usage from nomad and effectively rendered affected nodes totally unresponsive. Then Consul decided to stop working as well, because i/o timeout from other nodes.

Of course you could argue that curl has in-built timeout options, the point is that if a task scheduler does not provide this feature, there is no simple and unified way to keep all jobs organised and safe if they can decide on their own on how much time they want to run.

smaeda-ks · 2022-04-08T03:05:32Z

GitHub Actions Self-hosted runner with autoscaling is another good example, I thought. It's very much possible to run runners on Nomad as batch jobs, and autoscale them using parameterized batch jobs, so tasks can be easily dispatched, and can be triggered by GitHub Webhooks upon receiving the queued events. And having max lifetime support would be a great safeguard for such dynamic job scheduling integrated with third-party systems.

danielnegri · 2022-10-14T15:16:15Z

+1

NickJLange · 2022-10-22T01:19:12Z

While a safety catch (timeout) is definitely a gap in the product - I don't think it captures the use-case I'm looking for in #15011. I am looking for a stanza to run my type=service job M-F, from 08:00-20:00 with a user-defined stop command when a driver supports it.

shantanugadgil · 2022-10-27T08:18:04Z

Today I hit this for Docker jobs. Out system is full of Docker cron jobs and one job was stuck for 20 (twenty) days. 🥴

Without the timeout parameter, could there be some other "systematic" way to detect stuck (or jobs running for too long)?

jxgriffiths · 2022-10-27T21:27:13Z

We have some python scripts that hit the allocations api to get this type of information and trigger alerts / remediations in our monitoring system. This may be a little different using the docker driver / your implementation but we just reverse sort allocations by CreateTime and look for a specific prefix (xxx-periodic) where xxx is the job name. This tells us when the last allocation happened.

…

On Thu, Oct 27, 2022 at 1:18 AM Shantanu Gadgil ***@***.***> wrote: Today I hit this for Docker jobs. Out system is full of Docker cron jobs and one job was stuck for 20 (twenty) days. 🥴 Without the timeout parameter, could there be some other "systematic" way to detect stuck (or jobs running for too long)? — Reply to this email directly, view it on GitHub <#1782 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ2XYVW2VT7Q3XA3N3NI6DWFI3EPANCNFSM4CRUCDLQ> . You are receiving this because you commented.Message ID: ***@***.***>

shantanugadgil · 2022-11-13T16:25:05Z

@jxgriffiths thanks for the idea ...

Since your post we have been putting together a standalone Nomad checker job which will go through all batch jobs and to figure out "stuck" allocations.

The allocation based search was easy enough using a combination of curl, jq, date and bash. (wanted to avoid python as much as possible)

We also ended up putting together a jobs endpoint query for figuring out pending jobs too, but I think that is easily discoverable via metrics.

The subsequent question was how to individually tune the timeout for each job.

What we have done for this is to add a job level meta parameter, which the checker job will use as the configuration parameter to eventually kill the particular job.

In case one has multiple groups/tasks in a batch job, one could also move the meta down into the groups or tasks as per requirement.

dadgar added type/enhancement stage/thinking theme/jobspec labels Oct 4, 2016

dadgar added this to the near-term milestone Feb 25, 2017

schmichael removed this from the near-term milestone Jul 31, 2017

dadgar mentioned this issue Aug 31, 2017

[enhancement] job_timeout in batch jobs #3145

Closed

bittrance mentioned this issue Oct 10, 2017

Command not running when docker container has an ENTRYPOINT #2219

Closed

schmichael mentioned this issue Dec 11, 2017

[feature] job lifetime limit #3594

Closed

dadgar mentioned this issue Jun 19, 2018

[question] How to set a timeout value for a batch job? #4423

Closed

schmichael mentioned this issue Mar 22, 2019

[question] is it possible to limit number of childs in periodic jobs? #5462

Closed

tgross added stage/needs-discussion and removed stage/thinking labels Aug 24, 2020

jazzyfresh mentioned this issue Sep 2, 2020

StopTimeout for Tasks #8817

Closed

tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021

tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021

mikenomitch added the theme/batch Issues related to batch jobs and scheduling label Aug 11, 2022

schmichael mentioned this issue Oct 22, 2022

Stop Jobs #15011

Closed

schmichael mentioned this issue Jul 21, 2023

Post Stop tasks are never killed on the client node (also at GC time) resulting in task events loss. #17971

Closed

shantanugadgil mentioned this issue Jul 25, 2023

need a metric to see if overlap is being hit #18055

Open

mikenomitch linked a pull request Sep 12, 2023 that will close this issue

Adding timeouts to group and task #18456

Draft

tgross mentioned this issue Oct 12, 2023

Periodic Jobs: Allow to set a "DO NOT RUN" time #18739

Closed

mmcquillan added the hcc/jira label Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Timeout for batch jobs #1782

[feature] Timeout for batch jobs #1782

sheldonkwok commented Oct 3, 2016

dadgar commented Oct 4, 2016

sheldonkwok commented Oct 5, 2016

OferE commented Feb 23, 2017

alxark commented Sep 1, 2017

jippi commented Sep 1, 2017

alxark commented Sep 1, 2017

shantanugadgil commented Sep 1, 2017

mlehner616 commented Apr 6, 2018

shantanugadgil commented Apr 6, 2018

Miserlou commented Aug 17, 2018 •

edited

Loading

AndrewSav commented Aug 18, 2018

onlyjob commented Aug 29, 2018

wiedenmeier commented Oct 10, 2018

Miserlou commented Oct 10, 2018

wiedenmeier commented Oct 10, 2018

jxgriffiths commented Mar 28, 2019

epetrovich commented Apr 30, 2019

grainnemcknight commented May 26, 2019 •

edited

Loading

sabbene commented Nov 7, 2019

karlem commented Jan 31, 2020

shantanugadgil commented Jan 31, 2020

BirkhoffLee commented Mar 31, 2022

smaeda-ks commented Apr 8, 2022 •

edited

Loading

danielnegri commented Oct 14, 2022

NickJLange commented Oct 22, 2022

shantanugadgil commented Oct 27, 2022

jxgriffiths commented Oct 27, 2022 via email

shantanugadgil commented Nov 13, 2022

[feature] Timeout for batch jobs #1782

[feature] Timeout for batch jobs #1782

Comments

sheldonkwok commented Oct 3, 2016

dadgar commented Oct 4, 2016

sheldonkwok commented Oct 5, 2016

OferE commented Feb 23, 2017

alxark commented Sep 1, 2017

jippi commented Sep 1, 2017

alxark commented Sep 1, 2017

shantanugadgil commented Sep 1, 2017

mlehner616 commented Apr 6, 2018

shantanugadgil commented Apr 6, 2018

Miserlou commented Aug 17, 2018 • edited Loading

AndrewSav commented Aug 18, 2018

onlyjob commented Aug 29, 2018

wiedenmeier commented Oct 10, 2018

Miserlou commented Oct 10, 2018

wiedenmeier commented Oct 10, 2018

jxgriffiths commented Mar 28, 2019

epetrovich commented Apr 30, 2019

grainnemcknight commented May 26, 2019 • edited Loading

sabbene commented Nov 7, 2019

karlem commented Jan 31, 2020

shantanugadgil commented Jan 31, 2020

BirkhoffLee commented Mar 31, 2022

smaeda-ks commented Apr 8, 2022 • edited Loading

danielnegri commented Oct 14, 2022

NickJLange commented Oct 22, 2022

shantanugadgil commented Oct 27, 2022

jxgriffiths commented Oct 27, 2022 via email

shantanugadgil commented Nov 13, 2022

Miserlou commented Aug 17, 2018 •

edited

Loading

grainnemcknight commented May 26, 2019 •

edited

Loading

smaeda-ks commented Apr 8, 2022 •

edited

Loading