Replies: 2 comments 6 replies
-
Thanks for the detailed writeup. I'm thinking it might be helpful to see if there's anything out there that could serve as an example of what we're implementing. Is there anything out there that serves a similar purpose and does it well in your opinion? I assume we're not the first ones to run into this so it'd be good to see how others have solved it. Since this has an effect on the ui it'd probably be good to involve Joe as well. |
Beta Was this translation helpful? Give feedback.
-
IMO it makes sense but, are we side-stepping the real problem, at least, long-term? I mean, despite how less likely it's going to be, if the data volumes keep on growing then the jobs can go over the maintenance window even with this solution? If that's the case, another question is whether we should bite the bullet and rethink the processing model used for some of the data (e.g., batch processing program indicators) rather than introducing band-aids. I'm commenting from my ivory tower so feel free to ignore. |
Beta Was this translation helpful? Give feedback.
-
Background / Problem
At the moment jobs are
Most maintenance tasks are run by jobs using a CRON expression.
The problem is that maintenance tasks might depend on each other so one would want them to be executed as a sequence to be sure that a certain update has already been completed before a job relying on this update runs.
The only option available currently is to space such jobs more than enough time apart.
Still this is no guarantee and the gaps might not be enough with growing data volumes.
Also spacing jobs apart soon reaches the point where the maintenance window for a day is completely filled.
Solution / Proposal
Edit: Everything below might be outdated based on a new idea to link/add jobs to a sequence by having them refer to the job that runs before by ID. In short the first member of a queue would have the queue name (new property) and be triggered by whatever trigger it has (cron, time, ...). All other jobs would then refer to their predecessor by ID (new property). Jobs with a reference are essentially reference triggered. Their other trigger (cron, time, ...) configuration is kept internally but is not used (hidden in UI) but becomes active should the user remove them from the queue by clearing the reference.
A solution to this problem that only requires smaller changes in API, UI and scheduling implementation is to introduce a single common queue. Jobs (configurations) could get a flag to switch between executing right away when it is their time or entering the queue when it is their time. A job that enters the queue is executed directly if the queue is empty. If there are items in the queue, all items already in the queue need to complete before the added item is started.For example, currently jobs A, B, C might be dependent on each other and are therefore spaced apart to ensure proper sequence.A runs midnight every day, B runs at 2am, C at 4am. The 2h gap was chosen to be safe, not because the jobs require 2h each.
In the new setup all configurations of A, B and C get flagged to enter the queue.Also B's time is changed to 0:20am and C's time is changed to 0:21am.
As before A runs at 0:00. Should A still run when B is entering the queue at 0:20 B's execution is delayed until A is done. Otherwise B will start right away at 0:20. With C entering shortly after B at 0:21 it is unlikely B is done. C will wait for B to complete before it executed.
Using short or large gaps between jobs users can still distribute jobs in the time-frame but make sure at the same time they will always run to completion in a certain sequence.Implementation Notes
JobConfiguration
needs a newboolean enterQueue = false
(default) to allow configuring jobs to use the queue system.The UI needs a new checkbox when creating/editing job configurations.
The
JobStatus
enum gets a new statusQUEUING
which is used when a job wants to execute, but has to wait in the queue.As soon as it runs its status is changing to
RUNNING
as before.The queue itself is a new scheduling mechanism that implements the
Function<Runnable, Future<?>>
in
DefaultSchedulingManager
(https://github.com/dhis2/dhis2-core/blob/master/dhis-2/dhis-services/dhis-service-core/src/main/java/org/hisp/dhis/scheduling/DefaultSchedulingManager.java#L122 ):The implementation of this would still use
scheduleCronBased
orscheduleFixedDelayBased
but theRunnable task
given to these would not be executing the job but entering it into the queue passing along the original task.Once a job enters the queue or a queue job finishes the queue will check if it can run the next in line.
Jobs waiting in the queue would still be considered as "scheduled" and be subject to cancellation as usual.
The queue is merely one level of indirection potentially delaying running the original task of the job.
The queue itself would be a
ConcurrentLinkedQueue
(a non-blocking thread-safe queue) holding the items that are currently queuingas well as the one currently running through the queue.Execution of the queue task is run in the same way an ad-hoc jobs runs using the
AsyncTaskExecutor
.The core idea of this way to integrate a queue is to not having to change the more complex layers around actual task execution.
The queue is just modifying the way how a task finally arrives at the essential execution method
DefaultSchedulingManager#runIfPossible
. Apart from having a newJobStatus#QUEUING
that needs to be maintained as a job goes through the queue all updates from the point of running a task remain the same.Limitations
A single common queue allows to model sequences of jobs. But all jobs entering the queue now are interdependent even if the work they do is not. It is up to the user to have them enter the queue in a proper sequence to prevent unrelated work to block/delay reaching a multi-step target. In practice this only means users have to plan out when during the maintenance window jobs enter the queue and which jobs even should participate in queuing rather than running when their time has come.
In theory this can prevent parallelism of independent jobs but in practice it should almost always be a good idea to limit the workload of the system to a single maintenance task at a time. Using the time of entering users can model their preferred overall sequence and by entering in short order they can enforce that the overall work of completing all jobs that go through the queue is completed in the shortest time possible (without utilizing parallel execution of jobs).
Variants
The only real downside of the single common queue is that it might prevent independent jobs to run in parallel as they now all become part of the same sequence.
An alternative is to replace
boolean enterQueue
withString queue
in the job configuration.This way a user can have any number of named queues to organize sequences of jobs that stay independent to the sequences modelled by other queues. All jobs that should share a queue simply use the same queue name.
The only downside of this that multi-queue is slightly more complex to implement and much more complex to visualise in a UI.
It might also be less invasive to not add a new
JobStatus#QUEUING
but to instead add a newboolean queuing
field inJobConfiguration
and to still consider a queuing job as being in statusJobStatus#SCHEDULED
. This might save us from unintentionally filter away jobs because what wasSCHEDULED
would now be eitherSCHEDULED
orQUEUING
. The additional flag is easier to see as an addition when transitioning the UI and other dependent parts.Beta Was this translation helpful? Give feedback.
All reactions