-
Notifications
You must be signed in to change notification settings - Fork 401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manually started jobs with Run All (Catch-up) mode enabled are not automatically restarted after a server crash/restart. #757
Comments
I'm extremely sorry for this oversight. It will be fixed in the very next release, coming out in a few minutes. |
Fixed in v0.9.50. Thank you so much for this detailed issue report. |
No worries at all! Thanks for creating/maintaining this excellent OS tool, and for your exceptionally quick fix! |
Hi @jhuckaby, Unfortunately this fix doesn't seem to have worked, following the same steps as above, the manually started job still doesn't get automatically restarted. I have updated to v.0.9.51. By looking at your changes/the logs, I can see that the "Reset event" log messages for Service1 and Service2, but Service1 still doesn't get restarted. Any ideas? Log extract attached. Many thanks, |
Ah, so here is the problem. "Run All Mode" is really only for scheduled jobs, because all it really does is "rewind" the event cursor to a point in history, and then when the master server comes back up it "ticks" all the missing minutes, running any missed jobs along the way. But those jobs have to be actually scheduled to run on those missed minutes for it to trigger a job launch.
Since your job wasn't actually scheduled to run on 2024/05/15 15:24:00, it didn't fire off a new one. Adding "Retries" to the event won't work here either, because retries don't kick in for "aborted" jobs. When a server shuts down, the job is aborted. Hmmm.... This is a design flaw. I'll re-open this issue and keep thinking about ways to solve this. But please understand, Cronicle v0 is in maintenance mode, and I'm hard at work on a huge ground-up rewrite for the big v2 (coming out later this year, with any luck). I may not have time to solve this in Cronicle v0, as it looks like a core design oversight -- a truly missing feature that was never implemented properly. I'll put a huge warning in the docs that explains this issue. |
Hey @jhuckaby - I've been looking at this off and on over the last few weeks; would something like the below work in if(job.catch_up && job.source){
// If manually started and catch-up enabled, attempt to relaunch or queue
this.launchOrQueueJob(job,CALLBACK)
} else {
// otherwise, just rewind cursor instead
this.rewindJob(job);
} |
Oh hey, cool idea! This might just work, and is a very small code change. I need to consider all the ramifications and do a bunch of testing, however. I'll dive into this as soon as I can. |
Ah yes, so, as I suspected, it's not quite as simple as your suggested change (but it's a start!). There are a number of cases that may result in a aborted job due to an unexpected server loss. Another one is that the server may have been rebooted, or Cronicle was restarted, in which case it detects the leftover job log on disk and "finishes" (aborts) the job on startup. That case also has to be handled, as it should trigger a rerun, if it has catch-up and was manually started. Working through things... |
So, there are actually a bunch of different cases that have to be handled:
So, in all of these different cases what I need to do is add a custom flag to the job object, and then detect that flag when the job is finalized (completed and cleaned up). If the flag is set, and the job has catch-up mode enabled, and was manually started, then and only then should Cronicle re-run the job. But I also have to write a custom re-run function to facilitate this, because you can't really just shove the job object into Anyway, I'm working through all the cases and trying to test as much as I can. It has turned out to be a can of worms. I would normally not do this in Cronicle v0, because it's in "maintenance mode" (no new features), and I'm focusing all my efforts in the big v2 rewrite, but I'm going to make an exception for this issue, because this really is an unimplemented feature that should have existed from the start. It may take me a while to finish the code changes and test all the cases, but I'm working on it... |
Should be fixed in v0.9.52. I was unable to test all the cases, but I tested some of them. This is the best I can do for v0, I'm afraid (maintenance mode). |
Many thanks, this seems to have done the trick. |
Summary
Manually started jobs with Run All (Catch-up) mode enabled are not automatically restarted after a server crash or restart, even on worker hosts. This seems like an oversight to me, given that Run All mode is documented as being able to re-run jobs where the
Server running the job crashed
or where theServer running the job was shut down
. No mention is made of the jobs having needed to be started on a schedule (i.e. not manually) for this to happen, that I can find.Note that event retries (where just the job crashes, and not the server) do work for manually started jobs.
Recovering from server crashes where the jobs were running on a Primary Server
I understand per the docs that jobs running on a primary server are not and cannot be brought back up when the server crashes. However, my understanding is that this should happen on workers when they crash, but I don't see this happening either, in the case of manually started jobs. For simplicity of the reproduction below, the steps I have written up explain just how to recreate the issue i'm describing using a single server setup, and by restarting (rather than crashing) the Cronicle daemon.
Steps to reproduce the problem
Using a Single Server Cronicle setup:
systemctl restart cronicle.service
.Your Setup
Operating system and version?
WSL, Ubuntu 22.04
Node.js version?
v20.12.1
Cronicle software version?
0.9.48
Are you using a multi-server setup, or just a single server?
Single Server
Are you using the filesystem as back-end storage, or S3/Couchbase?
Local FS.
Can you reproduce the crash consistently?
Yes
Log Excerpts
The text was updated successfully, but these errors were encountered: