Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock in consumer group handleError #1581

Merged
merged 1 commit into from
Jan 22, 2020

Conversation

matthewloring
Copy link
Contributor

The deadlock can happen if the heartbeat request runs out of retries while the consumer group session is being released. consumerGroup.Consume will acquire the consumer group lock and then call consumerGroupSession.release which will block waiting for the channel hbDead to be closed. If the heartbeatLoop then runs out of retries and calls consumerGroup.handleError the handleError call will block on the lock preventing heartbeatLoop from closing hbDead.

It is safe to remove the lock acquisition in handleError because the channel operations in the critical section are already thread safe.

I also moved the lock/unlock from Close into leave to more accurately reflect the resources being protected.

Example stack traces for the deadlock:

goroutine 1 [semacquire, 2 minutes]:
sync.runtime_SemacquireMutex(0xc00131aa74, 0xc0010b5500, 0x1)
        /usr/local/Cellar/go/1.13.3/libexec/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00131aa70)
        /usr/local/Cellar/go/1.13.3/libexec/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
        /usr/local/Cellar/go/1.13.3/libexec/src/sync/mutex.go:81
github.com/Shopify/sarama.(*consumerGroup).Close.func1()
        /Users/<redacted>/go/pkg/mod/github.com/!shopify/sarama@v1.25.0/consumer_group.go:123 +0x1f0
sync.(*Once).doSlow(0xc00131aa80, 0xc0045fb7c8)
        /usr/local/Cellar/go/1.13.3/libexec/src/sync/once.go:66 +0xe3
sync.(*Once).Do(...)
        /usr/local/Cellar/go/1.13.3/libexec/src/sync/once.go:57
github.com/Shopify/sarama.(*consumerGroup).Close(0xc00131aa20, 0x0, 0x0)
        /Users/<redacted>/go/pkg/mod/github.com/!shopify/sarama@v1.25.0/consumer_group.go:120 +0x7d
...

goroutine 468 [chan receive, 10 minutes]:
github.com/Shopify/sarama.(*consumerGroupSession).release.func1()
        /Users/<redacted>/go/pkg/mod/github.com/!shopify/sarama@v1.25.0/consumer_group.go:719 +0x97
sync.(*Once).doSlow(0xc0012f2664, 0xc0018f6d38)
        /usr/local/Cellar/go/1.13.3/libexec/src/sync/once.go:66 +0xe3
sync.(*Once).Do(...)
        /usr/local/Cellar/go/1.13.3/libexec/src/sync/once.go:57
github.com/Shopify/sarama.(*consumerGroupSession).release(0xc0012f2600, 0x1, 0x0, 0x0)
        /Users/<redacted>/go/pkg/mod/github.com/!shopify/sarama@v1.25.0/consumer_group.go:706 +0xb0
github.com/Shopify/sarama.(*consumerGroup).Consume(0xc00131aa20, 0x662ef60, 0xc000225580, 0xc0013c6f10, 0x1, 0x1, 0x66231a0, 0xc0007e37e0, 0x0, 0x0)
        /Users/<redacted>/go/pkg/mod/github.com/!shopify/sarama@v1.25.0/consumer_group.go:184 +0x27d
...

goroutine 956 [semacquire, 10 minutes]:
sync.runtime_SemacquireMutex(0xc00131aa74, 0xc000149f00, 0x1)
        /usr/local/Cellar/go/1.13.3/libexec/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00131aa70)
        /usr/local/Cellar/go/1.13.3/libexec/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
        /usr/local/Cellar/go/1.13.3/libexec/src/sync/mutex.go:81
github.com/Shopify/sarama.(*consumerGroup).handleError(0xc00131aa20, 0x65fa680, 0xc0072ff860, 0xc0012b9bc0, 0xc, 0xc000000000)
        /Users/<redacted>/go/pkg/mod/github.com/!shopify/sarama@v1.25.0/consumer_group.go:433 +0x175
github.com/Shopify/sarama.newConsumerGroupClaim.func1(0x6639e20, 0xc001114dc0, 0xc0012f2600, 0xc0012b9bc0, 0xc, 0x0)
        /Users/<redacted>/go/pkg/mod/github.com/!shopify/sarama@v1.25.0/consumer_group.go:846 +0x96
created by github.com/Shopify/sarama.newConsumerGroupClaim
        /Users/<redacted>/go/pkg/mod/github.com/!shopify/sarama@v1.25.0/consumer_group.go:844 +0x10d

@ghost ghost added the cla-needed label Jan 16, 2020
@dnwe
Copy link
Collaborator

dnwe commented Jan 17, 2020

@matthewloring these changes look reasonable — are you able to sign the cla?

@matthewloring
Copy link
Contributor Author

I already have. I didn't have permission to re-run the ci after signing.

@dnwe
Copy link
Collaborator

dnwe commented Jan 17, 2020

@matthewloring ah OK — it should re-run checks if you click on the Close pull request button and then immediately click on the Re-open pull request button straight after

@matthewloring
Copy link
Contributor Author

That started the travis CI but not the CLA check

@matthewloring
Copy link
Contributor Author

Anything else I can do to re-run the CLA bot?

@dnwe
Copy link
Collaborator

dnwe commented Jan 22, 2020

The quickest solution is probably just to close this PR and re-open with a new PR and just paste the description across.

Otherwise you'd have to chase one of the Shopify guys (@eapache / @d1egoaz), but its probably quicker to just try re-spinning a new PR

@ghost ghost removed the cla-needed label Jan 22, 2020
@d1egoaz
Copy link
Contributor

d1egoaz commented Jan 22, 2020

I've re-run the CLA check

@d1egoaz
Copy link
Contributor

d1egoaz commented Jan 22, 2020

ping @bai @varun06 👀

@d1egoaz d1egoaz requested review from bai and varun06 January 22, 2020 14:54
@bai
Copy link
Contributor

bai commented Jan 22, 2020

PR looks good to me. I’m away for the rest of the week, please feel free to merge at will. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants