-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Drain()
infinite loop and add test for concurrent Next()
calls
#1525
Conversation
Fixes possible deadlock when Next() is waiting and holding the lock and cleanup() waiting for the lock to unsubscribe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good work here and again, thank you for the detailed issue and for the PR! There are some problems with your approach though in my opinion, as explained in the comment below - please let me know what you think.
jetstream/pull.go
Outdated
@@ -537,69 +535,79 @@ var ( | |||
) | |||
|
|||
func (s *pullSubscription) Next() (Msg, error) { | |||
s.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think removing the lock from the entire Next()
execution is a good idea. For example, it can cause problems with StopAfter
option where executing Next()
concurrently may lead to delivering more messages to the caller than specified in StopAfter
option.
Holding the lock for the duration of Next()
is challenging, and you're absolutely right, we need to have a way to unlock it for cleanup
- therefore, I believe catching tle closure of s.done
is necessary. Based on your branch I came up with a different solution (it's a bit crude right now as I just wanted to give an example, this would have to be cleaned up a bit): https://github.com/nats-io/nats.go/blob/fix-drain-in-messages/jetstream/pull.go#L537
Here's the gist of it:
-
When
s.done
is closed, we unlock the mutex, so that the subscription can be cleaned up properly ands.msgs
can be closed. We need a way to conditionally unlock mutex indefer
, thus thedone
bool (I really don't like that...) -
If we detect we are draining, we set
done
to true and continue. Next iterations of the loop will check for the state ofdone
and if it's set, will go to a select statement which does not listen ons.done
. Those 2 select statements are identical except whether or not we havecase <-s.done
.
I extracted handleIncomingMessage()
and handleError()
methods to make it a bit more readable, but now it's just copy-paste from the select
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, as you mentioned, we could use a separate lock just to make sure Next()
cannot be executed concurrently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that, I overlooked this StopAfter
option, I think we should I add a test to verify this behavior first before we move on, the current test calls Next()
sequentially.
After we add this test we should be able to refactor the code without breaking things.
There are also more elegant solutions:
- Using a dedicated lock for the
subscription
(Used incleanup()
) and another lock for the counter fields or maybe the rest of the fields - Use atomic values for the counter fields
Which solution do you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using a separate lock for accessing subscription
would be a preferable solution since I would be hesitant about allowing concurrent Next()
calls - for concurrency, the suggested solution would be to create a whole new MessagesContext()
for the same consumer. Separate lock sounds like it could actually simplify some things though, so that's nice.
Do you have time and would like to tackle this? Or should I take over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piotrpio yes you're right this is a better solution.
I started working on adding a test to verify Next()
concurrent calls so any changes afterwards won't break this behavior.
I'll see what I can do, and you too feel free to do whatever is best.
I'll keep you updated with what I come up with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that in cleanup()
there's no risk of race conditions, so it might be ok without holding the lock.
It only reads the subscription
field which is set by methods that create the pullSubscription
struct like pullConsumer.Consume
, pullConsumer.Messages
and pullConsumer.fetch
and the actual Subscription
has it's own mutex.
But I don't know if this acceptable, I mean for future changes to the code that might introduce race conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's ok right now (and looks like it is) and it does not produce a race, we can try to go without the lock I think - if in the future we will need locking mechanisms we can always add it. Just please (if you're working on it), add an appropriate comment on why the lock is not needed.
Thank you again for your contribution, it's extremely valuable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will add the comment right now.
Added comment for the cleanup function on why it doesn't need to hold the lock.
Drain()
infinite loop and minimize lock scope in Next()
Drain()
infinite loop and add test for concurrent Next()
calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks great, LGTM!
This is a possible fix for #1524.
Previously when calling
Stop()
we were checking for thedone
channel to be closed to exit the loop inNext()
, but as stated in #1524 this didn't work forDrain()
as it also closes thedone
channel, so now we check themsgs
channel if closed to exit the loop, but this required some changes to how locking was done to prevent a deadlock, asNext()
was holding the lock until it returns (which might take a long time), that preventedcleanup()
from executing (waiting for the lock) to unsubscribe (where themsgs
channel is closed).Changes in this pull request:
Drain()
Next()
calls withStopAfter
option (auto unsubscribe)cleanup()
functiondrained
channel frompullSubscription
Maybe a better solution would be to use multiple locks, for example one lock dedicated only for the subscription.