-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential deadlock on startup #29
Conversation
@kfk thanks for writing this up! Interestingly there is currently an ongoing discussion on In this case the recursion is unintentional and unnecessary, I'll clean it up. |
@kjk take this PR for a spin and see if you can reproduce, I think I got them all... |
I verified with your posted test program locally, cannot reproduce deadlock with these changes. RFR @jehiah |
I verified I can no longer reproduce the dead-lock so it seem to be working. |
backoffChan chan bool | ||
rdyChan chan *nsqConn | ||
needRDYRedistributed int32 | ||
backoffCounter int32 | ||
|
||
mifMutex sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this abbreviation here really necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea yea, I'll be verbose.. but only because I ❤️ @jphines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might make this new max-in-flight lock a plain mutex, instead of a RW lock. The critical sections are so short and trivial, and the overhead of the additional complexity of a RW lock are probably not worth it.
I like the style of naming locks here, especially as we add more. Why don't you name the lock defined at |
addr, err.Error()) | ||
continue | ||
} | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want to break here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's the "I succeeded to connect" state... but this will no longer be in the PR after I revert that commit... (and note that this hasn't actually changed)
LGTM. but I don't have that much familiarity with this codebase :) |
for _, c := range q.nsqConnections { | ||
if i != choice { | ||
if i == idx { | ||
i++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it appears to me that, if i != idx, i will never increment, and choice will never be set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, the only edge case where this would be possible would be if # of connections is zero... I'll add a check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still confused about this, sorry.
var i int
idx := rand.Intn(len(q.nsqConnections))
So, first, i
is set to 0
(default initialization value), and idx
could be 0
, or if there's at least 2 connections, it could be 1
, etc. So let's say it happens to be that there are 2
connections and idx
is randomly set to 1
in this example.
for _, c := range q.nsqConnections {
if i == idx {
i++
choice = c
break
}
}
Again, let's say there's two connections, so in this loop c
is 0
in the first iteration, and 1
in the second. In each iteration, if i == idx
is false, because i
is 0
and idx
is 1
. And because that condition is false, nothing happens (i does not increment!), and the loop goes on to the next iteration, and the condition is the same - always false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so c
isn't 0
or 1
, it's the connection struct, but that doesn't matter to the fact that the flow control does not do anything if idx != 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad, I didn't realize what line you were actually commenting on - good catch, the increment should be outside the conditional...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the change to check for "lack of choice" (no conns, choice == nil
) is still necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. I think the code is correct now.
... but for style points, you could let the range operator set and update i
for you, instead of putting an underscore there 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a map, it would be the key not an enumeration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see, I don't know this go stuff
Looks like you've addressed everything, perhaps it's time to tastefully squash? |
* use a more granular lock on q.maxInFlight to avoid recursion * use a more granular lock around connections * don't hold RLock over q.ConnectionMaxInFlight
avoid holding RLock over q.updateRDY
squashed |
Thanks @mreiferson for the quick fix and @ploxiln @jphines and @kjk for reviewing and validating this |
This is the same issue as brought up by @rtecco in https://gist.github.com/rtecco/9945119
Here's a full dump of of blocked state in real life: https://gist.github.com/kjk/33ec3bdf079e43869d14
The issue is that: ConnectionMaxInFlight() takes RLock() and then calls MaxInFlight() which also takes RLock(). That is not a problem unless some other place takes (or even attempts to) Lock(), in which case the second RLock() will block and deadlock happens. In the dump below the place that is waiting in Lock() is ConnectToLookupd().
This RLock() behavior is un-intuitive enough that I had to write a test program to test that this is how it works: https://gist.github.com/shurcooL/e0639ec703e3282200e9
Note: there might be other places where RLock() is called recursively and therefore susceptible to this problem. An easy fix is to change them all to Lock(), that will immediately deadlock in such cases.
A full program that shows the problem (non-deterministically): https://gist.github.com/kjk/41cd49704c635b35bfd1
You just need to set correct nsqlookupdHosts, topic and channel variables.