Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker can get caught in infinite loop when redis connection is closed unexpectedly #389

Closed
airhorns opened this issue Feb 4, 2021 · 4 comments
Labels

Comments

@airhorns
Copy link
Contributor

airhorns commented Feb 4, 2021

Because of https://github.com/taskforcesh/bullmq/blob/master/src/classes/worker.ts#L192-L195 , I am seeing workers get caught in an infinite loop of trying to get the next job only to error again trying to get the next job.

The loop is this:

  • Redis connection is closed somehow (I assume this is my code doing this but don't have a smoking gun yet)
  • worker is not closed explicitly, this.closing inside the worker is undefined
  • worker run loop calls getNextJob asynchronously, which calls waitForJob, which calls BRPOPLPUSH
  • Redis client synchronously throws, which interrupts execution of waitForJob, and is caught in getNextJob, and swallowed
  • getNextJob call wins the Promise.race in worker run loop
  • getNextJob call returns nothing, so worker doesn't work the job
  • worker run loop repeats the process

I am not sure why that error is being swallowed, or why the Connection is closed. message is special. If I had to guess, special care has to be taken to handle connection closes that we expect on the blocking calls like that. In this case though, the worker has not been explicitly closed, and it's being asked to use a closed redis connection, which I think should be an error that at least gets emitted and maybe takes down the process if unhandled.

Happy to do up a PR if someone can tell me what the semantics should be!

@airhorns
Copy link
Contributor Author

airhorns commented Feb 4, 2021

Also, I think this is actually the root cause of #359, not pausing! The event loop starvation happens where my timeouts for a test or what have you never fire because of this async-but-infinite loop that has higher priority in the node tick order I think.

@manast
Copy link
Contributor

manast commented Feb 4, 2021

I am working on a fix for this. This issue exists in bull v3 and I expect to have a fix by tomorrow.

@manast manast closed this as completed in d05566e Feb 7, 2021
github-actions bot pushed a commit that referenced this issue Feb 7, 2021
## [1.14.3](v1.14.2...v1.14.3) (2021-02-07)

### Bug Fixes

* **worker:** avoid possible infinite loop fixes [#389](#389) ([d05566e](d05566e))
@github-actions
Copy link

github-actions bot commented Feb 7, 2021

🎉 This issue has been resolved in version 1.14.3 🎉

The release is available on:

Your semantic-release bot 📦🚀

@airhorns
Copy link
Contributor Author

airhorns commented Feb 7, 2021

Thanks @manast !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants