Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown for stalled jobs ( lock renewal ) #484

Open
DevBrent opened this issue Apr 2, 2017 · 4 comments
Open

Graceful shutdown for stalled jobs ( lock renewal ) #484

DevBrent opened this issue Apr 2, 2017 · 4 comments

Comments

@DevBrent
Copy link

DevBrent commented Apr 2, 2017

First of all sorry for muddying up the issue tracker, I'll be sure to help close two tickets to make up for this.

Active & Unlocked -> Queued
I see the code for stalled jobs and time-renewed locks. I don't see the process for checking the queue for active and unlocked jobs to migrate back into inactive/queued for another job consumer to take up.

Do you know where this code is located so I can review the process exactly?

Disabling auto-lock renewal
If we wanted to manually trigger the lock renewal, would that just involve setting the built-in lock renewal timer to a higher value, or would that also influence the stalled job monitor in a negative way? I'd ideally like each job's own event loop to control the renewal so if one job were to slow enormously it could lose the lock and another consumer could start over.

Lock renewal failure detection/notification
There is a TODO in the lock renewal code for notifying the consumer. This one is pretty important to us because running two jobs at once could trigger API rate limiting. Is there a best practice currently to detect the lock renewal failure? Promise cancellation comes into play here, but that's outside of the scope of the question and I'm already imagining the mess of code for handling that.

@manast
Copy link
Member

manast commented Apr 2, 2017

here comes the answers:
Active & Unlocked -> Queued
https://github.com/OptimalBits/bull/blob/master/lib/queue.js#L559
And the timer is started here: _this.startMoveUnlockedJobsToWait();
Disabling auto-lock renewal
There is no "public" way to do this, but you can hack around it by setting the LOCK_RENEW_TIME to infinite, and calling moveUnlockedJobsToWait manually.

Lock renewal failure detection/notification
Not sure I understand which notification you mean. There is an event 'stalled' that is emitted when a job has been detected as stalled, but I guess that is not the one you mean.

@DevBrent
Copy link
Author

DevBrent commented Apr 4, 2017

@manast my issue specifically regards the stalled job that is still running (just slowly). #308 has a similar issue, but theirs was resolved by simply not stalling meanwhile based on our legacy code quality I expect some of our jobs to stall, and I'd like to notice that and cancel execution within the job processing code.

I guess I could setup a listener on every worker monitoring for stalled jobs, but still how would I know which of the two jobs currently running is the stalled job?

Specifically either of these two locations are where I would expect my job processor to be able to know immediately when the job was unable to renew the lock so it could shut itself down.

// TODO: if we failed to re-acquire the lock while trying to renew, should we let the job

console.error('Error renewing lock ' + err);

I don't see any "stalled" or "lock" parameters inside the Job object, but ideally I would need to know as soon as possible if I'm stalled and another worker has started up to try again. It looks like within the job processor I could call Job.takeLock, and check if the result value was false, null, or a caught error then exit processing of the job.

@DevBrent DevBrent changed the title [Question] Lock renewals Graceful shutdown for stalled jobs ( lock renewal ) Apr 4, 2017
@DevBrent
Copy link
Author

DevBrent commented Apr 6, 2017

@manast #488 is enticing, but I could effectively stop my own processing with a simple notice or ability to check job.hasLock() intermittently to confirm I'm still ok to keep processing.

The reason would be I might want to gracefully shutdown e.g. delete temporary files and close database connections.

I don't necessarily need the memory segmentation and overhead of using IPC either. I like the option of running jobs as child processes though, in cases where I might be running unpredictable low quality code it would certainly give that code a better chance and isolate it's impacts.

@shaunwarman
Copy link

Bump, would also love to see a graceful shutdown in the PATTERNS section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants