You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem?
The job scheduler uses a consistent hash function to form a hash ring to determine which job should be assigned to which node. in the case of cluster events like node add/remove and shard relocation (routing table update), all data nodes will be notified, then refresh the hash ring to schedule/deschedule job on local node accordingly. During this refresh, there is some time where the jobs are descheduled and then on the new node, the job scheduler sweeps the shard and reschedules the jobs for the next execution. If the job was supposed to execute in that gap between the deschedule and the reschedule, the execution would be skipped. As jobs are rescheduled by sweeping the entire shard again, the more jobs you have on the shard, the larger the lag and the greater the chance of a missed execution.
What solution would you like?
Update documentation on possible ways jobs could skip an execution for future developers
Look into providing a more strict guarantee that a job will not “miss” an execution, i.e. it might be delayed but it'll execute to make up for the missed one if we haven't overlapped with the next one yet
What alternatives have you considered?
This isn't a new issue, and is very rare for most use cases. It may be that it isn't necessary to guarantee no missed executions.
Do you have any additional context?
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem?
The job scheduler uses a consistent hash function to form a hash ring to determine which job should be assigned to which node. in the case of cluster events like node add/remove and shard relocation (routing table update), all data nodes will be notified, then refresh the hash ring to schedule/deschedule job on local node accordingly. During this refresh, there is some time where the jobs are descheduled and then on the new node, the job scheduler sweeps the shard and reschedules the jobs for the next execution. If the job was supposed to execute in that gap between the deschedule and the reschedule, the execution would be skipped. As jobs are rescheduled by sweeping the entire shard again, the more jobs you have on the shard, the larger the lag and the greater the chance of a missed execution.
What solution would you like?
What alternatives have you considered?
This isn't a new issue, and is very rare for most use cases. It may be that it isn't necessary to guarantee no missed executions.
Do you have any additional context?
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: