Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite the handling of process stalled jobs to be atomic so it doesn… #358

Closed

Conversation

bradvogel
Copy link
Contributor

…'t double-process jobs. See #356 for more on how this can happen.

The change here is to use a LUA script to atomically iterate the active queue AND get the lock. This ensures that the job is actually in 'active' when it's being processed, which fixes #356. If a stalled job was found, it will continue to call itself recursively until all stalled jobs are processed. This strategy is also slightly more efficient than before because it iterates the jobs all within a script, reducing the back-and-forth with the Redis server.

Additionally, this addresses long-standing issue where a job can be considered "stalled" even though it was just moved to active and before a lock could be obtained by the worker that moved it (#258), by waiting a grace period (with a default of LOCK_RENEW_TIME) before considering a job as possibly stalled. This gives the (real) worker time to acquire its lock after moving it to active.

…'t double-process jobs. See OptimalBits#356 for more on how this can happen.

The change here is to use a LUA script to atomically iterate the active queue AND get the lock. This ensures that the job is actually in 'active' when it's being processed, which fixes OptimalBits#356. If a stalled job was found, it will continue to call itself recursively until all stalled jobs are processed. This strategy is also slightly more efficient than before because it iterates the jobs all within a script, reducing the back-and-forth with the Redis server.

Additionally, this addresses long-standing issue where a job can be considered "stalled" even though it was just moved to active and before a lock could be obtained by the worker that moved it (OptimalBits#258), by waiting a grace period (with a default of LOCK_RENEW_TIME) before considering a job as possibly stalled. This gives the (real) worker time to acquire its lock after moving it to active.
bradvogel added a commit to mixmaxhq/bull that referenced this pull request Oct 15, 2016
bradvogel added a commit to mixmaxhq/bull that referenced this pull request Oct 15, 2016
@bradvogel
Copy link
Contributor Author

Also, per #311 (comment), should a stalled job be moved back to 'wait' to be picked up again, versus what I'm doing here and processing it immediately? That would speed up processing these jobs immensely.

@bradvogel
Copy link
Contributor Author

Closing this in favor of #359 which puts jobs back into the 'wait' queue. It's much faster.

@bradvogel bradvogel closed this Oct 15, 2016
@wearhere wearhere deleted the better-process-stalled-jobs branch October 16, 2016 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant