Scheduling Queue #335

jbee · 2022-09-15T10:59:46Z

jbee
Sep 15, 2022
Collaborator

Background / Problem

At the moment jobs are

executed ad-hoc
scheduled to be executed at an interval defined by a CRON expression
scheduled to be executed once at a specific time in the future

Most maintenance tasks are run by jobs using a CRON expression.

The problem is that maintenance tasks might depend on each other so one would want them to be executed as a sequence to be sure that a certain update has already been completed before a job relying on this update runs.
The only option available currently is to space such jobs more than enough time apart.
Still this is no guarantee and the gaps might not be enough with growing data volumes.
Also spacing jobs apart soon reaches the point where the maintenance window for a day is completely filled.

Solution / Proposal

Edit: Everything below might be outdated based on a new idea to link/add jobs to a sequence by having them refer to the job that runs before by ID. In short the first member of a queue would have the queue name (new property) and be triggered by whatever trigger it has (cron, time, ...). All other jobs would then refer to their predecessor by ID (new property). Jobs with a reference are essentially reference triggered. Their other trigger (cron, time, ...) configuration is kept internally but is not used (hidden in UI) but becomes active should the user remove them from the queue by clearing the reference.

A solution to this problem that only requires smaller changes in API, UI and scheduling implementation is to introduce a single common queue. Jobs (configurations) could get a flag to switch between executing right away when it is their time or entering the queue when it is their time. A job that enters the queue is executed directly if the queue is empty. If there are items in the queue, all items already in the queue need to complete before the added item is started.

For example, currently jobs A, B, C might be dependent on each other and are therefore spaced apart to ensure proper sequence.
A runs midnight every day, B runs at 2am, C at 4am. The 2h gap was chosen to be safe, not because the jobs require 2h each.

In the new setup all configurations of A, B and C get flagged to enter the queue.
Also B's time is changed to 0:20am and C's time is changed to 0:21am.
As before A runs at 0:00. Should A still run when B is entering the queue at 0:20 B's execution is delayed until A is done. Otherwise B will start right away at 0:20. With C entering shortly after B at 0:21 it is unlikely B is done. C will wait for B to complete before it executed.

~~Using short or large gaps between jobs users can still distribute jobs in the time-frame but make sure at the same time they will always run to completion in a certain sequence.~~

Implementation Notes

JobConfiguration needs a new boolean enterQueue = false (default) to allow configuring jobs to use the queue system.
The UI needs a new checkbox when creating/editing job configurations.

The JobStatus enum gets a new status QUEUING which is used when a job wants to execute, but has to wait in the queue.
As soon as it runs its status is changing to RUNNING as before.

The queue itself is a new scheduling mechanism that implements the Function<Runnable, Future<?>>
in DefaultSchedulingManager (https://github.com/dhis2/dhis2-core/blob/master/dhis-2/dhis-services/dhis-service-core/src/main/java/org/hisp/dhis/scheduling/DefaultSchedulingManager.java#L122 ):

void scheduleTask( JobConfiguration configuration, Function<Runnable, Future<?>> scheduler )

The implementation of this would still use scheduleCronBased or scheduleFixedDelayBased but the Runnable task given to these would not be executing the job but entering it into the queue passing along the original task.

Once a job enters the queue or a queue job finishes the queue will check if it can run the next in line.

Jobs waiting in the queue would still be considered as "scheduled" and be subject to cancellation as usual.
The queue is merely one level of indirection potentially delaying running the original task of the job.

The queue itself would be a ConcurrentLinkedQueue (a non-blocking thread-safe queue) holding the items that are currently queuingas well as the one currently running through the queue.
Execution of the queue task is run in the same way an ad-hoc jobs runs using the AsyncTaskExecutor.

The core idea of this way to integrate a queue is to not having to change the more complex layers around actual task execution.
The queue is just modifying the way how a task finally arrives at the essential execution method DefaultSchedulingManager#runIfPossible. Apart from having a new JobStatus#QUEUING that needs to be maintained as a job goes through the queue all updates from the point of running a task remain the same.

Limitations

A single common queue allows to model sequences of jobs. But all jobs entering the queue now are interdependent even if the work they do is not. It is up to the user to have them enter the queue in a proper sequence to prevent unrelated work to block/delay reaching a multi-step target. In practice this only means users have to plan out when during the maintenance window jobs enter the queue and which jobs even should participate in queuing rather than running when their time has come.

In theory this can prevent parallelism of independent jobs but in practice it should almost always be a good idea to limit the workload of the system to a single maintenance task at a time. Using the time of entering users can model their preferred overall sequence and by entering in short order they can enforce that the overall work of completing all jobs that go through the queue is completed in the shortest time possible (without utilizing parallel execution of jobs).

Variants

The only real downside of the single common queue is that it might prevent independent jobs to run in parallel as they now all become part of the same sequence.

An alternative is to replace boolean enterQueue with String queue in the job configuration.
This way a user can have any number of named queues to organize sequences of jobs that stay independent to the sequences modelled by other queues. All jobs that should share a queue simply use the same queue name.
The only downside of this that multi-queue is slightly more complex to implement and much more complex to visualise in a UI.

It might also be less invasive to not add a new JobStatus#QUEUING but to instead add a new boolean queuing field in JobConfiguration and to still consider a queuing job as being in status JobStatus#SCHEDULED. This might save us from unintentionally filter away jobs because what was SCHEDULED would now be either SCHEDULED or QUEUING. The additional flag is easier to see as an addition when transitioning the UI and other dependent parts.

ismay · 2022-09-15T13:24:00Z

ismay
Sep 15, 2022

Thanks for the detailed writeup. I'm thinking it might be helpful to see if there's anything out there that could serve as an example of what we're implementing. Is there anything out there that serves a similar purpose and does it well in your opinion? I assume we're not the first ones to run into this so it'd be good to see how others have solved it.

Since this has an effect on the ui it'd probably be good to involve Joe as well.

5 replies

ismay Sep 15, 2022

A quick search yielded this: https://github.com/OptimalBits/bull, and this frontend for it: https://docs.taskforce.sh/connections/queues. The last link has some screenshots of their queueing ui. It's a little different in its usecase, but might provide some inspiration.

jbee Sep 15, 2022
Collaborator Author

Looking at queue processing solutions is good to get inspired but also dangerous. Generic solutions have wide feature sets and make generic assumptions because they cannot know how a user might want to use the tool. The idea I describe is quite on the opposite of that axis. It is mainly designed to fit our needs while still playing nicely with what we already have. Not adding more complexity is pretty important because the scheduler already has a difficult task to solve. For example, a generic queue systems will offer to pause/resume tasks. This is nice but we don't really need it or would have a reasonable way implementing it.

What we can take from these is how they might visualize the tasks in a queue. Or we might ask if certain features would be reasonable for us. Would we want to be able to manually add items to the queue? Similar to ad-hoc execution but instead of running the task directly we add it directly to a queue.

I hope my suggestion is not misunderstood as the ambition to build a full blown queue-scheduling solution with all its glory.
What I describe is still a fairly small change in the backend, maybe on the order of 100 LOC. To me this is important with a design that it does not fight what exists but goes along with it saving us from future headaches. Getting too excited about features tends to push in the other direction and overwhelm with complexity that is added without a real need.

ismay Sep 15, 2022

What we can take from these is how they might visualize the tasks in a queue. Or we might ask if certain features would be reasonable for us. Would we want to be able to manually add items to the queue? Similar to ad-hoc execution but instead of running the task directly we add it directly to a queue.

Exactly, I meant it more as a way of getting inspiration for what others have done to solve similar problems.

I hope my suggestion is not misunderstood as the ambition to build a full blown queue-scheduling solution with all its glory.
What I describe is still a fairly small change in the backend, maybe on the order of 100 LOC. To me this is important with a design that it does not fight what exists but goes along with it saving us from future headaches. Getting too excited about features tends to push in the other direction and overwhelm with complexity that is added without a real need.

No not at all. I understand you want to keep the change small, and that's also what I'd personally prefer. It's just that your proposal is phrased in a very technical manner, and I'm approaching it a little more from the design/frontend perspective. I'm trying to see what would make sense for the design/frontend and see whether your proposal would fit that. For that initial big picture for me it helps to see what's out there in the wild. Plus involving someone from design serves that same purpose.

jbee Sep 15, 2022
Collaborator Author

💯 I am happy you get involved.

Just want to keep this down to earth and low impact driving it from what is easy to do while reaching the goal rather than nice to have.
This is why I started thinking about the technical side but I am glad about any input and collaboration especially related to UI/UX.

The main reason I suggested the single common queue is to keep the UI simple and allow a easy transition. With just one queue we could just add bit more info to the list entries we already got and a user could infer the "queue" from that. Not the nicest, but low effort and might do the job just fine. It would be cool to actually show the queue in a more visual form but unless there isn't a clear push from some direction to take it that far I would not think it is strictly needed.

ismay Sep 27, 2022

Sorry for the late response, I was off for a bit. Yeah no I understand what you mean. I think my point is something I'm also raising because I've noticed that we've developed a couple scheduler features, and it feels like both design and frontend are involved fairly late in the process. That makes it a little harder from my perspective to give useful feedback, it feels like things have already consolidated a fair bit. Not meant in a personal way btw., I appreciate that you're giving us a heads up. Just meant to share in what way I personally feel our collaboration would be most productive.

cjmamo · 2022-09-16T07:59:45Z

cjmamo
Sep 16, 2022
Collaborator

Still this is no guarantee and the gaps might not be enough with growing data volumes.
Also spacing jobs apart soon reaches the point where the maintenance window for a day is completely filled.

IMO it makes sense but, are we side-stepping the real problem, at least, long-term? I mean, despite how less likely it's going to be, if the data volumes keep on growing then the jobs can go over the maintenance window even with this solution? If that's the case, another question is whether we should bite the bullet and rethink the processing model used for some of the data (e.g., batch processing program indicators) rather than introducing band-aids. I'm commenting from my ivory tower so feel free to ignore.

1 reply

jbee Sep 26, 2022
Collaborator Author

Increasing execution times as a consequence of increasing data value is an independent problem. If improvements for doing the actual work of a job process are made the scheduling does benefit. The other way around the scheduling cannot solve this problem independent of how you do it as not running certain things daily is not an option also running them in parallel is not an option. We kind of have build the system with the assumption that we can run the jobs at least daily in a certain sequence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduling Queue #335

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Scheduling Queue #335

jbee Sep 15, 2022 Collaborator

Background / Problem

Solution / Proposal

Implementation Notes

Limitations

Variants

Replies: 2 comments · 6 replies

ismay Sep 15, 2022

ismay Sep 15, 2022

jbee Sep 15, 2022 Collaborator Author

ismay Sep 15, 2022

jbee Sep 15, 2022 Collaborator Author

ismay Sep 27, 2022

cjmamo Sep 16, 2022 Collaborator

jbee Sep 26, 2022 Collaborator Author

jbee
Sep 15, 2022
Collaborator

Replies: 2 comments 6 replies

ismay
Sep 15, 2022

jbee Sep 15, 2022
Collaborator Author

jbee Sep 15, 2022
Collaborator Author

cjmamo
Sep 16, 2022
Collaborator

jbee Sep 26, 2022
Collaborator Author