Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential deadlock on startup #29

Merged
merged 3 commits into from
Apr 4, 2014
Merged

Conversation

mreiferson
Copy link
Member

This is the same issue as brought up by @rtecco in https://gist.github.com/rtecco/9945119

Here's a full dump of of blocked state in real life: https://gist.github.com/kjk/33ec3bdf079e43869d14

The issue is that: ConnectionMaxInFlight() takes RLock() and then calls MaxInFlight() which also takes RLock(). That is not a problem unless some other place takes (or even attempts to) Lock(), in which case the second RLock() will block and deadlock happens. In the dump below the place that is waiting in Lock() is ConnectToLookupd().

This RLock() behavior is un-intuitive enough that I had to write a test program to test that this is how it works: https://gist.github.com/shurcooL/e0639ec703e3282200e9

Note: there might be other places where RLock() is called recursively and therefore susceptible to this problem. An easy fix is to change them all to Lock(), that will immediately deadlock in such cases.

A full program that shows the problem (non-deterministically): https://gist.github.com/kjk/41cd49704c635b35bfd1

You just need to set correct nsqlookupdHosts, topic and channel variables.

@jehiah jehiah added the bug label Apr 3, 2014
@mreiferson
Copy link
Member

@kfk thanks for writing this up!

Interestingly there is currently an ongoing discussion on golang-dev regarding recursive use of sync.RWMutex - https://groups.google.com/forum/#!topic/golang-dev/k3XKTsYS_-w

In this case the recursion is unintentional and unnecessary, I'll clean it up.

@mreiferson
Copy link
Member

@kjk take this PR for a spin and see if you can reproduce, I think I got them all...

@mreiferson
Copy link
Member

I verified with your posted test program locally, cannot reproduce deadlock with these changes.

RFR @jehiah

@kjk
Copy link
Author

kjk commented Apr 3, 2014

I verified I can no longer reproduce the dead-lock so it seem to be working.

@mreiferson mreiferson mentioned this pull request Apr 3, 2014
backoffChan chan bool
rdyChan chan *nsqConn
needRDYRedistributed int32
backoffCounter int32

mifMutex sync.RWMutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this abbreviation here really necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea yea, I'll be verbose.. but only because I ❤️ @jphines

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might make this new max-in-flight lock a plain mutex, instead of a RW lock. The critical sections are so short and trivial, and the overhead of the additional complexity of a RW lock are probably not worth it.

@jphines
Copy link
Contributor

jphines commented Apr 3, 2014

I like the style of naming locks here, especially as we add more. Why don't you name the lock defined at https://github.com/mreiferson/go-nsq/blob/recursive_locks_29/reader.go#L90

addr, err.Error())
continue
}
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to break here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's the "I succeeded to connect" state... but this will no longer be in the PR after I revert that commit... (and note that this hasn't actually changed)

@ploxiln
Copy link
Member

ploxiln commented Apr 3, 2014

LGTM. but I don't have that much familiarity with this codebase :)

for _, c := range q.nsqConnections {
if i != choice {
if i == idx {
i++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears to me that, if i != idx, i will never increment, and choice will never be set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, the only edge case where this would be possible would be if # of connections is zero... I'll add a check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused about this, sorry.

var i int
idx := rand.Intn(len(q.nsqConnections))

So, first, i is set to 0 (default initialization value), and idx could be 0, or if there's at least 2 connections, it could be 1, etc. So let's say it happens to be that there are 2 connections and idx is randomly set to 1 in this example.

for _, c := range q.nsqConnections {
    if i == idx {
        i++
        choice = c
        break
    }
}

Again, let's say there's two connections, so in this loop c is 0 in the first iteration, and 1 in the second. In each iteration, if i == idx is false, because i is 0 and idx is 1. And because that condition is false, nothing happens (i does not increment!), and the loop goes on to the next iteration, and the condition is the same - always false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so c isn't 0 or 1, it's the connection struct, but that doesn't matter to the fact that the flow control does not do anything if idx != 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, I didn't realize what line you were actually commenting on - good catch, the increment should be outside the conditional...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change to check for "lack of choice" (no conns, choice == nil) is still necessary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. I think the code is correct now.

... but for style points, you could let the range operator set and update i for you, instead of putting an underscore there 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a map, it would be the key not an enumeration

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see, I don't know this go stuff

@mreiferson
Copy link
Member

ok, I think I got everything (I don't want to touch the other mutex @jphines and I don't want to just use a regular Lock because concurrent goroutines should be able to read max in flight @ploxiln )

@ploxiln
Copy link
Member

ploxiln commented Apr 4, 2014

Looks like you've addressed everything, perhaps it's time to tastefully squash?

* use a more granular lock on q.maxInFlight to avoid recursion
* use a more granular lock around connections
* don't hold RLock over q.ConnectionMaxInFlight
avoid holding RLock over q.updateRDY
@mreiferson
Copy link
Member

squashed

jphines pushed a commit that referenced this pull request Apr 4, 2014
@jphines jphines merged commit cd846c1 into nsqio:master Apr 4, 2014
@mreiferson mreiferson deleted the recursive_locks_29 branch April 4, 2014 03:36
@jehiah
Copy link
Member

jehiah commented Apr 4, 2014

Thanks @mreiferson for the quick fix and @ploxiln @jphines and @kjk for reviewing and validating this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants