-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Fixing some lock-order-inversion and data race problems (TSAN reports) #1824
[core] Fixing some lock-order-inversion and data race problems (TSAN reports) #1824
Conversation
// [[using locked(m_GlobControlLock)]]
void srt::CUDTSocket::breakSocket_LOCKED()
{
// This function is intended to be called from GC,
// under a lock of m_GlobControlLock.
m_UDT.m_bBroken = true;
m_UDT.m_iBrokenCounter = 0;
HLOGC(smlog.Debug, log << "@" << m_SocketID << " CLOSING AS SOCKET");
m_UDT.closeInternal();
setClosed();
} |
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor suggestion, the rest can be approved.
Co-authored-by: Maxim Sharabayko <maxlovic@gmail.com>
See Epic #1813 for some details about the TSAN reports. Fixes include:
Unlocking m_GlobControlLock before locking m_ConnectionLock in the call of CUDT::closeInternal. m_GlobControlLock orders after m_ConnectionLock. This lock is not necessary for calling closeInternal (it's however applied when the call is done from GC) as the function is called after the socket has been removed from all containers, including
m_ClosedSockets
, it should be then believed that no other thread should dispatch to it.Applied lock on
m_RcvBufferLock
inCUDT::readReady
and CUDTUnited::checkBrokenSockets`. These functions were trying to access fields of common interest in the receiver buffer to be independently read out of lock, which constitutes a data race.The
CSndBuffer::m_iCount
field was turned into atomic. Although this field is under mutex lock in most of the operations, the method returning its value is without a lock. This field is fortunately self-standing as a value, so it suffices to make it atomic to make the reading thread safe, although it should undergo modifications still under a lock due to consistency requirements with other fields. Reading this value is only to recognize if the buffer is empty and it shouldn't be a problem if the buffer gets modified just after reading this value while it is 0.Removed locking on
m_GlobControlLock
for the group synchronization activities inCUDT::acceptAndRespond
. Actually this lock was applied only in order to make the group kept unable to be deleted for the time of accessing it. For this there is a mechanism of counter lock provided, which still requires locking onm_GlobControlLock
, but only for the time to modify the lock and therefore this could be moved outside of the lock section ofCUDT::m_ConnectionLock
, while the persistence of the group is maintained for the whole function lifetime. A second locking ofm_GlobControlLock
will also be done outside of it because the destruction order for the earlier created object makes it happen later.The call to
CRcvQueue::ifNewEntry
applied a lock on the container that is under a lock normally in all other places. Checking the container for emptiness must be also done under a lock. This is an unefficient solution: alternative solutions are either to lock-and-extract if available, or using a helper atomic boolean field that will be updated when anything has been added or removed from the buffer, but this solution at least fixes the problem.The
m_dCongestionWindow
field has been changed fromdouble
toatomic<int>
. Actually everywhere where this value was read it was converted toint
, so this shouldn't make a difference, while it was required to be made atomic as it was being read and written in two different threads.NOTE: This doesn't fix all the reported TSAN problems, some other solutions are pending.
NOTE 2: The lock-order-inversion for the calls around
setListeningState
while the call ofacceptAndRespond
has been set as "tolerated". This is because even if the case of two different orders can be found and very close to one another, these two activities are impossible to happen in runtime and without doing something stupid in the listener callback.