-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grpc: hold ac.mu while calling resetTransport to prevent concurrent connection attempts #7390
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #7390 +/- ##
==========================================
- Coverage 81.51% 81.35% -0.17%
==========================================
Files 348 348
Lines 26744 26741 -3
==========================================
- Hits 21801 21754 -47
- Misses 3764 3793 +29
- Partials 1179 1194 +15
|
clientconn.go
Outdated
@@ -918,6 +918,9 @@ func (ac *addrConn) connect() error { | |||
ac.mu.Unlock() | |||
return nil | |||
} | |||
// Update the state to ensure no concurrent requests start once the lock | |||
// is released. | |||
ac.updateConnectivityState(connectivity.Connecting, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably file a bug for this otherwise it will be a behavior change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you look at #7365 (comment) for the details of this bug?
What change in behaviour are you concerned about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I was just asking with respect to release notes because currently the bug points to test flake. Anyways, I just checked it doesn't matter because release notes refer to the fix PR and not the issue. Although in the release notes, we should prefix the package
balancer: Fix race condition that could lead to multiple transports being created in parallel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it looks like without your fix, there is a case where resetTransport can error out and return without updating the connectivity state
Line 1237 in bdd707e
if acCtx.Err() != nil { |
May be we can make the resetTransport() in the same critical section instead of releasing lock and aquiring again in resetTransport()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any benefit in adding the acCtcx.Err
check in connect
because even resetTransport sets the the state to Connecting
and releases the lock:
Lines 1262 to 1263 in bdd707e
ac.updateConnectivityState(connectivity.Connecting, nil) | |
ac.mu.Unlock() |
This means that the context can be cancelled (and subsequently addrConn shutdown) after the channel is in connecting
state even without the change.
IIUC we just need to ensure that we don't set connecting state after the channel enters shutdown.
The test for shutdown state on top should be enough protection to ensure shutdown state comes only after we enter connecting
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we can make the resetTransport() in the same critical section instead of releasing lock and aquiring again in resetTransport()?
We could rename resetTransport
to resetTransportLocked
and expect the callers to hold the lock while calling this method. However, resetTransport
releases the lock temporarily. Add to this that ac.updateAddrs calls resetTransport
in a new go routine so it can't hold the lock till resetTransport
completes. It feels a little risky to make that change. I don't know for sure, but I feel we could end up in a situation where the lock is not released correctly resulting in a deadlock.
I don't want do make that change as the first option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any benefit in adding the acCtcx.Err check in connect because even resetTransport sets the the state to Connecting and releases the lock
From the code it looks like in case of acCtcx.Err
, resetTransport() doesn't update the state and return so state will be still idle but after your fix in case of acCtcx.Err
state will be updated to connecting
. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, ac.ctx
is used to control the creation of remote connections while ac.state
is used to synchronize all the state transitions for the addrConn
. ac.connect()
doesn't deal with creating remote connections, so it doesn't need to check ac.ctx.Err()
. It needs to ensure the transition to Connecting
is valid, which it does by locking the mutex and verifying that ac.state != Shutdown
ac.ctx
is used to avoid doing throw away work which takes significant time (creating a remote conn).
Please let me know if your understanding is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline: it doesn't matter if resetTransport() returns error after state being updated to connecting
Release notes needs to be prefixed with "package name:" in this case |
This is a change in the outermost |
is |
It's a flake, there's an existing issue for this and I've commented on the issue about this failure: #6914 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
375bdd7
to
6214c9d
Compare
What is the user-visible symptom here? A memory leak? Or just an extra connection that was attempted, but that will quickly go away on its own anyway? (Or will the extra connection stick around until the channel is closed?) |
@@ -1231,8 +1228,7 @@ func (ac *addrConn) adjustParams(r transport.GoAwayReason) { | |||
} | |||
} | |||
|
|||
func (ac *addrConn) resetTransport() { | |||
ac.mu.Lock() | |||
func (ac *addrConn) resetTransportAndUnlock() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a short comment here:
// resetTransportAndUnlock unconditionally connects the addrConn.
//
// ac.mu must be held by the caller, and this function will guarantee it is released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ac.mu must be held by the caller, and this function will guarantee it is released
should we have code check for this as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the doc comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have code check for this as well?
We don't verify if the mutex is locked in other functions in the code which assume the caller has the lock. These functions have the suffix Locked
in the name.
- There is a
TryLock()
method onsync/mutex
, but its use is discouraged. resetTransportAndUnlock
is a private method and we run tests with race detector so we can catch incorrect usage.- When the
resetTransportAndUnlock
method tries to unlock a mutex that isn't locked, it will panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i meant just having a doc doesn't enforce the mutex should be locked. Unlocking a mutex that is not locked in Go will result in a runtime panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you suggest how to enforce the locking?
How is the caller expected to handle the error if resetTransportAndUnlock
doesn't panic?
Since its a private method, doesn't a panic make the failure more visible and ensure incorrect usages are caught by tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it will probably require to implement custom mutex with some boolean field but since we have precedences of methods (e.g https://github.com/grpc/grpc-go/blob/master/clientconn.go#L728) unlocking mutex, devs will be aware of these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name of the function and the comment should be sufficient for this.
What I'm seeing in the test is that only one transport is closed when the subConn is updated ( |
…onnection attempts (grpc#7390)
…onnection attempts (grpc#7390)
…onnection attempts (grpc#7390)
This change ensures that caller of
resetTransport()
keep holding the mutex instead of releasing it and havingresetTransport()
re-acquire it. This ensures that no concurrent requests are able to start once caller ofresetTransport
does some validation and callsresetTransport
.See #7365 (comment) for more details.
Tested
Verified that Test/AuthorityRevive no longer flakes for 100000 attempts with the change.
Fixes: #7365
RELEASE NOTES: