Fix deadlock in consumer group handleError #1581

matthewloring · 2020-01-16T21:22:21Z

The deadlock can happen if the heartbeat request runs out of retries while the consumer group session is being released. consumerGroup.Consume will acquire the consumer group lock and then call consumerGroupSession.release which will block waiting for the channel hbDead to be closed. If the heartbeatLoop then runs out of retries and calls consumerGroup.handleError the handleError call will block on the lock preventing heartbeatLoop from closing hbDead.

It is safe to remove the lock acquisition in handleError because the channel operations in the critical section are already thread safe.

I also moved the lock/unlock from Close into leave to more accurately reflect the resources being protected.

Example stack traces for the deadlock:

goroutine 1 [semacquire, 2 minutes]:
sync.runtime_SemacquireMutex(0xc00131aa74, 0xc0010b5500, 0x1)
        /usr/local/Cellar/go/1.13.3/libexec/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00131aa70)
        /usr/local/Cellar/go/1.13.3/libexec/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
        /usr/local/Cellar/go/1.13.3/libexec/src/sync/mutex.go:81
github.com/Shopify/sarama.(*consumerGroup).Close.func1()
        /Users/<redacted>/go/pkg/mod/github.com/!shopify/sarama@v1.25.0/consumer_group.go:123 +0x1f0
sync.(*Once).doSlow(0xc00131aa80, 0xc0045fb7c8)
        /usr/local/Cellar/go/1.13.3/libexec/src/sync/once.go:66 +0xe3
sync.(*Once).Do(...)
        /usr/local/Cellar/go/1.13.3/libexec/src/sync/once.go:57
github.com/Shopify/sarama.(*consumerGroup).Close(0xc00131aa20, 0x0, 0x0)
        /Users/<redacted>/go/pkg/mod/github.com/!shopify/sarama@v1.25.0/consumer_group.go:120 +0x7d
...

goroutine 468 [chan receive, 10 minutes]:
github.com/Shopify/sarama.(*consumerGroupSession).release.func1()
        /Users/<redacted>/go/pkg/mod/github.com/!shopify/sarama@v1.25.0/consumer_group.go:719 +0x97
sync.(*Once).doSlow(0xc0012f2664, 0xc0018f6d38)
        /usr/local/Cellar/go/1.13.3/libexec/src/sync/once.go:66 +0xe3
sync.(*Once).Do(...)
        /usr/local/Cellar/go/1.13.3/libexec/src/sync/once.go:57
github.com/Shopify/sarama.(*consumerGroupSession).release(0xc0012f2600, 0x1, 0x0, 0x0)
        /Users/<redacted>/go/pkg/mod/github.com/!shopify/sarama@v1.25.0/consumer_group.go:706 +0xb0
github.com/Shopify/sarama.(*consumerGroup).Consume(0xc00131aa20, 0x662ef60, 0xc000225580, 0xc0013c6f10, 0x1, 0x1, 0x66231a0, 0xc0007e37e0, 0x0, 0x0)
        /Users/<redacted>/go/pkg/mod/github.com/!shopify/sarama@v1.25.0/consumer_group.go:184 +0x27d
...

goroutine 956 [semacquire, 10 minutes]:
sync.runtime_SemacquireMutex(0xc00131aa74, 0xc000149f00, 0x1)
        /usr/local/Cellar/go/1.13.3/libexec/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00131aa70)
        /usr/local/Cellar/go/1.13.3/libexec/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
        /usr/local/Cellar/go/1.13.3/libexec/src/sync/mutex.go:81
github.com/Shopify/sarama.(*consumerGroup).handleError(0xc00131aa20, 0x65fa680, 0xc0072ff860, 0xc0012b9bc0, 0xc, 0xc000000000)
        /Users/<redacted>/go/pkg/mod/github.com/!shopify/sarama@v1.25.0/consumer_group.go:433 +0x175
github.com/Shopify/sarama.newConsumerGroupClaim.func1(0x6639e20, 0xc001114dc0, 0xc0012f2600, 0xc0012b9bc0, 0xc, 0x0)
        /Users/<redacted>/go/pkg/mod/github.com/!shopify/sarama@v1.25.0/consumer_group.go:846 +0x96
created by github.com/Shopify/sarama.newConsumerGroupClaim
        /Users/<redacted>/go/pkg/mod/github.com/!shopify/sarama@v1.25.0/consumer_group.go:844 +0x10d

dnwe · 2020-01-17T16:00:39Z

@matthewloring these changes look reasonable — are you able to sign the cla?

matthewloring · 2020-01-17T16:02:56Z

I already have. I didn't have permission to re-run the ci after signing.

dnwe · 2020-01-17T16:04:43Z

@matthewloring ah OK — it should re-run checks if you click on the Close pull request button and then immediately click on the Re-open pull request button straight after

matthewloring · 2020-01-17T17:57:13Z

That started the travis CI but not the CLA check

matthewloring · 2020-01-21T23:06:14Z

Anything else I can do to re-run the CLA bot?

dnwe · 2020-01-22T09:26:05Z

The quickest solution is probably just to close this PR and re-open with a new PR and just paste the description across.

Otherwise you'd have to chase one of the Shopify guys (@eapache / @d1egoaz), but its probably quicker to just try re-spinning a new PR

d1egoaz · 2020-01-22T14:51:24Z

I've re-run the CLA check

d1egoaz · 2020-01-22T14:54:24Z

ping @bai @varun06 👀

bai · 2020-01-22T15:15:00Z

PR looks good to me. I’m away for the rest of the week, please feel free to merge at will. 🙏

Fix deadlock in consumer group Close

7f85050

ghost added the cla-needed label Jan 16, 2020

matthewloring closed this Jan 17, 2020

matthewloring reopened this Jan 17, 2020

ghost removed the cla-needed label Jan 22, 2020

d1egoaz requested review from bai and varun06 January 22, 2020 14:54

bai approved these changes Jan 22, 2020

View reviewed changes

d1egoaz merged commit 33aa349 into IBM:master Jan 22, 2020

sysadmind mentioned this pull request Jan 22, 2020

consumerGroup will deadlock when handling errors #1554

Closed

napallday mentioned this pull request Sep 9, 2022

fix: race condition(may panic) when closing consumer group #2331

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock in consumer group handleError #1581

Fix deadlock in consumer group handleError #1581

matthewloring commented Jan 16, 2020

dnwe commented Jan 17, 2020

matthewloring commented Jan 17, 2020

dnwe commented Jan 17, 2020 •

edited

Loading

matthewloring commented Jan 17, 2020

matthewloring commented Jan 21, 2020

dnwe commented Jan 22, 2020

d1egoaz commented Jan 22, 2020

d1egoaz commented Jan 22, 2020

bai commented Jan 22, 2020

Fix deadlock in consumer group handleError #1581

Fix deadlock in consumer group handleError #1581

Conversation

matthewloring commented Jan 16, 2020

dnwe commented Jan 17, 2020

matthewloring commented Jan 17, 2020

dnwe commented Jan 17, 2020 • edited Loading

matthewloring commented Jan 17, 2020

matthewloring commented Jan 21, 2020

dnwe commented Jan 22, 2020

d1egoaz commented Jan 22, 2020

d1egoaz commented Jan 22, 2020

bai commented Jan 22, 2020

dnwe commented Jan 17, 2020 •

edited

Loading