Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Fixing some lock-order-inversion and data race problems (TSAN reports) #1824

Merged
merged 9 commits into from
Jan 23, 2024

Conversation

ethouris
Copy link
Collaborator

@ethouris ethouris commented Feb 19, 2021

See Epic #1813 for some details about the TSAN reports. Fixes include:

  1. Unlocking m_GlobControlLock before locking m_ConnectionLock in the call of CUDT::closeInternal. m_GlobControlLock orders after m_ConnectionLock. This lock is not necessary for calling closeInternal (it's however applied when the call is done from GC) as the function is called after the socket has been removed from all containers, including m_ClosedSockets, it should be then believed that no other thread should dispatch to it.

  2. Applied lock on m_RcvBufferLock in CUDT::readReady and CUDTUnited::checkBrokenSockets`. These functions were trying to access fields of common interest in the receiver buffer to be independently read out of lock, which constitutes a data race.

  3. The CSndBuffer::m_iCount field was turned into atomic. Although this field is under mutex lock in most of the operations, the method returning its value is without a lock. This field is fortunately self-standing as a value, so it suffices to make it atomic to make the reading thread safe, although it should undergo modifications still under a lock due to consistency requirements with other fields. Reading this value is only to recognize if the buffer is empty and it shouldn't be a problem if the buffer gets modified just after reading this value while it is 0.

  4. Removed locking on m_GlobControlLock for the group synchronization activities in CUDT::acceptAndRespond. Actually this lock was applied only in order to make the group kept unable to be deleted for the time of accessing it. For this there is a mechanism of counter lock provided, which still requires locking on m_GlobControlLock, but only for the time to modify the lock and therefore this could be moved outside of the lock section of CUDT::m_ConnectionLock, while the persistence of the group is maintained for the whole function lifetime. A second locking of m_GlobControlLock will also be done outside of it because the destruction order for the earlier created object makes it happen later.

  5. The call to CRcvQueue::ifNewEntry applied a lock on the container that is under a lock normally in all other places. Checking the container for emptiness must be also done under a lock. This is an unefficient solution: alternative solutions are either to lock-and-extract if available, or using a helper atomic boolean field that will be updated when anything has been added or removed from the buffer, but this solution at least fixes the problem.

  6. The m_dCongestionWindow field has been changed from double to atomic<int>. Actually everywhere where this value was read it was converted to int, so this shouldn't make a difference, while it was required to be made atomic as it was being read and written in two different threads.

NOTE: This doesn't fix all the reported TSAN problems, some other solutions are pending.

NOTE 2: The lock-order-inversion for the calls around setListeningState while the call of acceptAndRespond has been set as "tolerated". This is because even if the case of two different orders can be found and very close to one another, these two activities are impossible to happen in runtime and without doing something stupid in the listener callback.

@ethouris ethouris marked this pull request as draft February 19, 2021 12:53
@maxsharabayko maxsharabayko added [core] Area: Changes in SRT library core Type: Bug Indicates an unexpected problem or unintended behavior labels Mar 1, 2021
@maxsharabayko maxsharabayko added this to the v1.4.3 milestone Mar 1, 2021
@maxsharabayko maxsharabayko modified the milestones: v1.4.3, v1.4.4 Apr 13, 2021
@maxsharabayko maxsharabayko modified the milestones: v1.4.4, v1.4.5 Aug 17, 2021
@maxsharabayko maxsharabayko modified the milestones: v1.5.0, Backlog May 24, 2022
@maxsharabayko
Copy link
Collaborator

m_GlobControlLock protects structures of the CUDTUnited.
Looks correct, but maybe worth checking other places where the closeInternal is used, e.g. as->breakSocket_LOCKED(); in the same CUDTUnited::removeSocket(..) function.

// [[using locked(m_GlobControlLock)]]
void srt::CUDTSocket::breakSocket_LOCKED()
{
    // This function is intended to be called from GC,
    // under a lock of m_GlobControlLock.
    m_UDT.m_bBroken        = true;
    m_UDT.m_iBrokenCounter = 0;
    HLOGC(smlog.Debug, log << "@" << m_SocketID << " CLOSING AS SOCKET");
    m_UDT.closeInternal();
    setClosed();
}

@ethouris
Copy link
Collaborator Author

The breakSocket_LOCKED() is called under the lock of m_GlobControlLock, which is required just by the fact that it is run in the loop rolling over a socket container, protected by this mutex.

@ethouris ethouris changed the title [core] Fixing the wrong locking order around m_ConnectionLock (potential deadlock) [core] Fixing some lock-order-inversion and data race problems (TSAN reports) Jan 23, 2024
@ethouris ethouris marked this pull request as ready for review January 23, 2024 12:13
Copy link
Collaborator

@maxsharabayko maxsharabayko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor suggestion, the rest can be approved.

srtcore/core.cpp Outdated Show resolved Hide resolved
Co-authored-by: Maxim Sharabayko <maxlovic@gmail.com>
@maxsharabayko maxsharabayko merged commit 1bfab3d into Haivision:master Jan 23, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants