[FEATURE] Guarantee that jobs will not miss execution when moving shards #173

downsrob · 2022-04-21T23:07:45Z

Is your feature request related to a problem?
The job scheduler uses a consistent hash function to form a hash ring to determine which job should be assigned to which node. in the case of cluster events like node add/remove and shard relocation (routing table update), all data nodes will be notified, then refresh the hash ring to schedule/deschedule job on local node accordingly. During this refresh, there is some time where the jobs are descheduled and then on the new node, the job scheduler sweeps the shard and reschedules the jobs for the next execution. If the job was supposed to execute in that gap between the deschedule and the reschedule, the execution would be skipped. As jobs are rescheduled by sweeping the entire shard again, the more jobs you have on the shard, the larger the lag and the greater the chance of a missed execution.

What solution would you like?

Update documentation on possible ways jobs could skip an execution for future developers
Look into providing a more strict guarantee that a job will not “miss” an execution, i.e. it might be delayed but it'll execute to make up for the missed one if we haven't overlapped with the next one yet

What alternatives have you considered?
This isn't a new issue, and is very rare for most use cases. It may be that it isn't necessary to guarantee no missed executions.

Do you have any additional context?
Add any other context or screenshots about the feature request here.

downsrob added the enhancement New feature or request label Apr 21, 2022

downsrob mentioned this issue May 12, 2022

Flaky tests opensearch-project/index-management#90

Open

peterzhuamazon added this to Engineering Effectiveness Board Jul 11, 2024

github-project-automation bot moved this to 🆕 New in Engineering Effectiveness Board Jul 11, 2024

getsaurabh02 moved this from 🆕 New to Backlog in Engineering Effectiveness Board Jul 18, 2024

peterzhuamazon moved this to 📦 Backlog in Engineering Effectiveness Board Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Guarantee that jobs will not miss execution when moving shards #173

[FEATURE] Guarantee that jobs will not miss execution when moving shards #173

downsrob commented Apr 21, 2022

[FEATURE] Guarantee that jobs will not miss execution when moving shards #173

[FEATURE] Guarantee that jobs will not miss execution when moving shards #173

Comments

downsrob commented Apr 21, 2022