-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock in consumer group handleError #1581
Conversation
@matthewloring these changes look reasonable — are you able to sign the cla? |
I already have. I didn't have permission to re-run the ci after signing. |
@matthewloring ah OK — it should re-run checks if you click on the |
That started the travis CI but not the CLA check |
Anything else I can do to re-run the CLA bot? |
I've re-run the CLA check |
PR looks good to me. I’m away for the rest of the week, please feel free to merge at will. 🙏 |
The deadlock can happen if the heartbeat request runs out of retries while the consumer group session is being released.
consumerGroup.Consume
will acquire the consumer group lock and then callconsumerGroupSession.release
which will block waiting for the channelhbDead
to be closed. If theheartbeatLoop
then runs out of retries and callsconsumerGroup.handleError
thehandleError
call will block on the lock preventingheartbeatLoop
from closinghbDead
.It is safe to remove the lock acquisition in
handleError
because the channel operations in the critical section are already thread safe.I also moved the lock/unlock from
Close
intoleave
to more accurately reflect the resources being protected.Example stack traces for the deadlock: