-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] Fix the issues identified by the sanitizer #1813
Comments
A collective PR #1844 (required due to conflicting changes) is created to fix the following problems:
|
This is a list of findings collected after the fix from #1844 (including a failed fix for UList nodes). These are problems that are likely reidentified, the numbers are the same as for findings with previous P* prefix:
New findings in the bonding tests:
The following items seem to require immediate attention:
Note: GP37 is introduced by the trial fix for GP18 and GP19 that has also failed. A more elaborate solution is required, although the GP37 problem can be ignored. |
#1848 submitted to address |
GP33 addressed by #1849. |
More information about the GP30+ findings, except those that were already fixed:
NOTE: Likely identical to P24 The call to
while simultaneously they are being written to by - lilkely (the stack was not recoverable in the test, but it is believe it could only happen here) in
SEVERITY: important. Likely the data calculated out of these data might be sometimes wrong.
Total of 5 different cases, of reading and writing the same field in two different threads without protection. SEVERITY: minor
"Sender buffer group". Severity unknown, but this problem exists since always.
The state of a connection is being updated as per closing, while another thread is getting the status. It's not a problem, if this state is reporting connected while the socket will be simultaneously closed by another thread and it will happen just SEVERITY: minor
In other reports there's e.g.:
SEVERITY: minor |
#1859 Fixes the problem of updating 4 fields, the issue described in GP30. |
After a test with included all above fixes there are only the following findings still collected:
This is extremely weird and of completely unknown reason. The sanitizer reports that "write happened" here (in the instruction inside the loop body): void CUDT::EmitSignal(ETransmissionEvent tev, EventVariant var)
{
for (std::vector<EventSlot>::iterator i = m_Slots[tev].begin(); i != m_Slots[tev].end(); ++i)
{
i->emit(tev, var);
}
} while simultaneously "previous write" happened here (case identified by referring to identical address): Verb() << "(#" << mctrl.msgno << " %" << mctrl.pktseq << " " << BufferStamp(data.payload.data(), data.payload.size()) << ") " << VerbNoEOL; The function that is next on stack towards this above one is:
There's no pointer-to-function object visible in this call at all. Unknown as to what could have caused this.
This problem is much like others that have been fixed using atomic, however this is if (evt != TEV_ACKACK && evt != TEV_SEND && evt != TEV_RECEIVE)
{
m_tdSendInterval = microseconds_from((int64_t)m_CongCtl->pktSndPeriod_us());
m_dCongestionWindow = m_CongCtl->cgWindowSize();
}
This collects about 9 findings in total, mostly related to simultaneously reading and writing of the same object being the item in the sender buffer. Although there is a mutex intended to be used for the sender buffer, somehow it's used at best partially and only for some operations. This is something that exists since UDT times; likely this problem wasn't possible to be detected in the file mode because in file mode the schedule window in the sender buffer gets stretched to full possible immediately and the flight window eats it up much slower anyway, so there's practically no way that the CSndQueue::worker thread reads the same buffer element that is being simultaneously filled by the main application thread - even at the very end of file and when there's last packet buffer to be filled, it was unlikely to be simultaneously picked up to shift it from schedule window to flight window. In live mode it's different - the schedule window, if the thread layout is perfect and there's no chocking in the network,may at worst stretch to 2, and most of the time it's 1 or even 0, so collision on the same packet being stored and read by the sender thread is highly probable. This is a long-term task that likely needs redesigning of the sender buffer, or at least tough and very detailed analysis for possible mutex locking to be added to key operations in the sender buffer. For performance reasons this may require also the use of a shared lock (aka read-write-lock - not available even in C++11) or simply use both atomic for key fields and mutexes.
Likely in |
IMPORTANT!!! All the above tests have been done by testing the group sender in backup mode only. This means that this isn't the complete set of tests that could be performed with thread sanitizer and more will have to be done. |
For TSAN_ATM_REPORT_MP-part003.txtLikely m_RcvAckLock could be locked when This isn't certain because the call to TSAN_ATM_REPORT_MP-part004.txtThis is a collision between readData and ackData. If we pretend that m_BufLock is guarding the internal data TSAN_ATM_REPORT_MP-part005.txtThe problem is on accessing the The only problem is that the buffer might be potentially updated from both incoming ACK (CRcvQueue::worker thread) This fragment needs further research as per correct usage of the fields modified here:
It is possible that this value could be protected by just atomic modifier, but then there likely should be OTOH important thing is that this emptiness might be a temporary state that will definitely be missed in TSAN_ATM_REPORT_MP-part006.txtSimilar to the problem in 005, the getCurrBufSize - another overload - was called while the sizes being TSAN_ATM_REPORT_MP-part007.txtidem TSAN_ATM_REPORT_MP-part008.txtLooks DANGEROUS, this modifies the block in the buffer directly assigned to
TSAN_ATM_REPORT_MP-part009.txtidem ( TSAN_ATM_REPORT_MP-part010.txtidem. TSAN_ATM_REPORT_MP-part011.txtidem, getSourceTime. |
Ok, the test was conducted with all previous tests passed and no issues as above described seen anymore. Here is the list of the new findings extracted when testing application was listener and receiver: TSAN_RTM_REPORT_MP-part001.txt, TSAN_RTM_REPORT_MP-part005.txtOne order must be selected between m_GroupLock and m_ConnectionLock. Order G ; C --> locking the group container, then setting options. Single option locks m_ConnectionLock. In one of these cases one of the lock must be lifted. COMPLEXITY: HIGH TSAN_RTM_REPORT_MP-part002.txtCUnitQueue::m_iCount should likely be atomic COMPLEXITY: LOW TSAN_RTM_REPORT_MP-part003.txt, TSAN_RTM_REPORT_MP-part004.txtCollision between:
Some mutex must guard these data and likely it should be m_RecvBufferLock. COMPLEXITY: MEDIUM. This shouldn't be a problem if adding m_RecvBufferLock is safe and fixes the problem. TSAN_RTM_REPORT_MP-part006.txtThe official order so far is: m_GlobControlLock The ordering for m_LSLock is undefined, and likely it can't be treated as final. The problem COMPLEXITY: HIGH |
Related issue #2273 |
Moving some old overview from #1620 here for future reference. Core threads synchronization overview. Note. tsbpd()A separate internal thread per receiving socket. CUDT::tsbpd() (click to expand/collapse)
{
UniqueLock recv_lock (self->m_RecvLock);
CSync tsbpd_cc (self->m_RcvTsbPdCond, recv_lock);
while (!self->m_bClosing)
{
enterCS(self->m_RcvBufferLock);
self->m_pRcvBuffer->getRcvFirstMsg();
self->m_pRcvBuffer->skipData(seqlen);
self->m_pRcvBuffer->isRcvDataReady(..);
leaveCS(self->m_RcvBufferLock);
if (self->m_bSynRecving)
// TODO: [SYNC] Lock before signalling?
self->m_ReadyToReadEvent.notify_one();
self->s_UDTUnited.m_EPoll.update_events(self->m_SocketID, self->m_sPollID, SRT_EPOLL_IN, true);
CGlobEvent::triggerEvent();
if (tsbpdtime)
tsbpd_cc.wait_for(timediff);
else
tsbpd_cc.wait();
}
} receiveMessage()
CUDT::receiveMessage() (click to expand/collapse)
{
UniqueLock recvguard (m_RecvLock);
CSync tscond (m_RcvTsbPdCond, recvguard);
if (m_bBroken || m_bClosing)
{
enterCS(m_RcvBufferLock);
const int res = m_pRcvBuffer->readMsg(data, len);
leaveCS(m_RcvBufferLock);
if (m_bTsbPd)
{
HLOGP(tslog.Debug, "Ping TSBPD thread to schedule wakeup");
tscond.signal_locked(recvguard);
}
}
if (!m_bSynRecving)
{
enterCS(m_RcvBufferLock);
const int res = m_pRcvBuffer->readMsg(data, len, (w_mctrl), seqdistance);
leaveCS(m_RcvBufferLock);
if (m_bTsbPd)
{
HLOGP(arlog.Debug, "receiveMessage: nothing to read, kicking TSBPD, return AGAIN");
tscond.signal_locked(recvguard);
}
if (!m_pRcvBuffer->isRcvDataReady())
{
// Kick TsbPd thread to schedule next wakeup (if running)
if (m_bTsbPd)
{
HLOGP(arlog.Debug, "receiveMessage: DATA READ, but nothing more - kicking TSBPD.");
tscond.signal_locked(recvguard);
}
}
return res;
}
do
{
if (stillConnected() && !timeout && !m_pRcvBuffer->isRcvDataReady(..))
{
/* Kick TsbPd thread to schedule next wakeup (if running) */
if (m_bTsbPd)
tscond.signal_locked(recvguard);
do
{
if (!m_ReadyToReadEvent.lock_wait_until(exptime))
{
if (m_iRcvTimeOut >= 0) // otherwise it's "no timeout set"
timeout = true;
}
} while (stillConnected() && !timeout && (!m_pRcvBuffer->isRcvDataReady()));
}
enterCS(m_RcvBufferLock);
res = m_pRcvBuffer->readMsg((data), len, (w_mctrl), seqdistance);
leaveCS(m_RcvBufferLock);
} while ((res == 0) && !timeout);
if (!m_pRcvBuffer->isRcvDataReady())
{
// Kick TsbPd thread to schedule next wakeup (if running)
if (m_bTsbPd)
tscond.signal_locked(recvguard);
s_UDTUnited.m_EPoll.update_events(m_SocketID, m_sPollID, SRT_EPOLL_IN, false);
}
return res;
} processData(..)
CUDT::processData(..) (click to expand/collapse)
{
const bool need_tsbpd = m_bTsbPd || m_bGroupTsbPd;
// We are receiving data, start tsbpd thread if TsbPd is enabled
if (need_tsbpd && !m_RcvTsbPdThread.joinable())
{
StartThread(m_RcvTsbPdThread, CUDT::tsbpd, this, thname);
}
{
UniqueLock recvbuf_acklock(m_RcvBufferLock);
m_pRcvBuffer->addData(*i, offset);
}
// Wake up TSBPD on loss to reschedule possible TL drop
if (!srt_loss_seqs.empty() && m_bTsbPd)
CSync::lock_signal(m_RcvTsbPdCond, m_RecvLock);
if (!filter_loss_seqs.empty() && m_bTsbPd)
CSync::lock_signal(m_RcvTsbPdCond, m_RecvLock);
} releaseSynch()
CUDT::releaseSynch() (click to expand/collapse)
{
// wake up user calls
CSync::lock_signal(m_SendBlockCond, m_SendBlockLock);
enterCS(m_SendLock);
leaveCS(m_SendLock);
m_ReadyToReadEvent.lock_notify_one();
CSync::lock_signal(m_RcvTsbPdCond, m_RecvLock);
// TODO: [SYNC] Protect TBBPD Thread join
//enterCS(m_NewDataReceivedLock);
if (m_RcvTsbPdThread.joinable())
{
m_RcvTsbPdThread.join();
}
//leaveCS(m_NewDataReceivedLock);
enterCS(m_RecvLock);
leaveCS(m_RecvLock);
} |
This ticket is created to track the work on issues identified by the sanitizer.
Related tickets: #1547, #1609, #1610.
Findings
P00
==============================
False positives or sanitizer collision.
SEVERITY: NONE.
P01-LastRspAckTime
==============================
Accessing CUDT::m_tsLastRspAckTime without lock.
SEVERITY: LOW
P02-SndBuffer_count
==============================
Unprotected access to CSndBuffer::m_iCount through getCurrBufSize
with simultaneous modification from CSndBuffer::addBuffer
SEVERITY: MEDIUM
FOCUS: might be that atomic can fix it, but the size is modified
usually together with buffer contents and internal pointers, so
this is likely data integrity violation.
P03-CSndUList
==============================
CSndUList against CSndQueue and m_WindowCond/m_WindowLock locking
that caused accessing m_iLastEntry without locking. CV synchronization
and access control use two different locks located in two different classes,
shared by a plain pointer. Larger rework required.
SEVERITY: MEDIUM
FOCUS: This is a design flaw; two different mutexes in two different
classes (one class of interest accesses it by a pointer) guard the same data.
A planned rework code is already in progress.
P04-Cycle3
==============================
Report Add SRT logo and fix typo. #8, Cycled ordering between CUDT::m_ConnectionLock and CUDT::m_SendLock,
with involved CSndUList::m_ListLock.
Report #28, Cycled ordering between CUDT::m_ConnectionLock and CUDTUnited::m_GlobControlLock
with involved CRcvQueue::m_LSLock.
SEVERITY: HIGH
FOCUS: This is a series of lock cycles, all of them must be resolved
under highest priority. This is a high deadlock risk.
P05-m_dCongestionWindow
==============================
Unguarded writing to CUDT::m_dCongestionWindow while reading
(likely requires atomic)
SEVERITY: LOW
P06-LastSendTime
==============================
Unguarded read of CUDT::m_tsLastSndTime,
likely should be under CUDT::m_ConnectionLock
SEVERITY: MEDIUM
FOCUS: Although this itself shouldn't be a big deal, the matter of
what exactly data is guarded by m_ConnectionLock is worth focusing.
Theoreticaly it should guard data that are used during connection
process, but then there are also data being used by workers and the
main thread simultaneously as a part of the data transmission activities,
and these don't even have dedicated mutexes to protect the data.
Not high priority because this looks like a problem derived from UDT.
P07-SendInterval
==============================
Unguarded write to CUDT::m_tdSendInterval
SEVERITY: LOW
P08-m_pLastBlock
==============================
Unguarded READING AND WRITING of CSndBuffer::m_pLastBlock.
Simultaneous access from CSndQueue::worker and CRcvQueue::worker.
Likely to be guarded by CSndBuffer::m_BufLock.
SEVERITY: MEDIUM
FOCUS: Likely a mutex that isn't consistently used.
P09-m_iFlowWindowSize
==============================
Unguarded access to CUDT::m_iFlowWindowSize
Writtten by: ACK report
Read by: sender worker when preparing a packet to send.
Likely to be fixed by atomic.
SEVERITY: LOW
P11-m_iSndLastAck
==============================
Unguarded read and write of CUDT::m_iSndLastAck
Likely to be solved by atomic
SEVERITY: LOW
P12-ACK-data
==============================
Unguarded reading and writing of CUDT::m_iRTT and CUDT::m_iRTTVar,
plus other fields belonging to those filled by incoming ACK message.
Likely atomic.
SEVERITY: LOW
P13-zPayloadSize
==============================
Unguarded r/w of LiveCC::m_zSndAvgPayloadSize
Likely atomic
SEVERITY: LOW
P14-SndBuf_CurBlock
==============================
Unguarded WRITE (!!!) to CSndBuffer::m_pCurrBlock with reading by RcvQueue::worker with RecvAckLock and BufLock applied
SEVERITY: MEDIUM
FOCUS: Likely the whole CSndBuffer should be under that protection,
although might be that atomic can fix it.
P15-SndBuf-BytesCount
==============================
Unguarded read of CSndBuffer::m_iBytesCount in CSndBuffer::getCurrBufSize().
May lead to data discrepancy as this function accesses also CSndBuffer::m_pFirstBlock,
both being modified in CSndBuffer::ackData.
Likely the reading function requires applying lock on m_BufLock.
SEVERITY: MEDIUM
FOCUS: Likely a series referring to guards for CSndBuffer.
P16-SndCurrSeqNo
==============================
Unguarded reading and writing of CUDT::m_iSndCurrSeqNo.
Important: in CUDT::packData this field is READ AND MODIFIED.
Likely atomic.
SEVERITY: MEDIUM
FOCUS: Looks like soluble by atomic, but read-and-write is kinda
more complicated.
P17-SndBuffer_Block
==============================
Sumultaneous read and write of CSndBuffer::Block::m_iLength.
This looks like simply a data race conflict between CSndBuffer::readData() and CSndBuffer::addBuffer().
Likely to be protected by CSndBuffer::m_iBufLock, but this looks much more complicated.
Report #27 additionally contains possible conflict between CChannel::sendto and CSndBuffer::addBuffer,
which seems to write into the same block, although this still seems to be caused by quite ordered read
and write by data-read ordering, just not sanctioned by atomic or mutexes.
SEVERITY: MEDIUM
FOCUS: Likely a series referring to guards for CSndBuffer.
P18-pCryptoControl
==============================
Unguarded access to CUDT::m_pCryptoControl in CRcvQueue::worker, while it might be
simultaneously deleted through the call to srt_close() from the main thread.
SEVERITY: MEDIUM
FOCUS: Likely there's no mutex employed to guard particular data.
P19-CloseGC
==============================
Unguarded (mutexes applied, but different) access to CUDTUnited::m_bClosing.
Likely atomic.
FOCUS: Likely to be merged with P21.
P21-bool-state-flags
==============================
The CUDT::m_bConnected field is read without protection.
Similar reports about m_bClosing or m_bBroken.
Likely can be fixed by atomic, but might be that CUDT::m_ConnectionLock could be locked
for reading this field.
SEVERITY: LOW
P22-GC-against-RcvUList
==============================
CRNode::m_bOnList accessed for writing without protection. The same node
is accessed for reading in a GC thread in the loop rolling over closed sockets.
This could be a bigger problem related to the containment in m_pRcvUList, that is,
the loop over the CRcvQueue::m_pRcvUList might contain CUDT's that have been already
recycled into m_ClosedSockets. The case of m_bOnList is only a marker that the
node entry is invalid, as the entity has been removed from the container. The loop
in GC is also checking the node itself. This is likely something old and exists
since the beginning, so might be that atomic could solve it painlessly.
CRcvQueue::m_pHash contains CUDT entities that may belong to socket that are recycled
and GC is going to delete them. What happened here:
Simultaneously:
The socket of this entity is being deleted, so the destructor is called:
- for CUDT
- because destroying CUDTSocket
- because the CUDTUnited::removeSocket is deleting RECYCLED sockets
Likely how to fix it: The m_pHash container must be taken out and the entities
collected some other way. This could be also a container of the same kind and
purpose, just in a different place, and the activities performed on its contents
must be done strictly under a mutex that protect its contents. For that purpose
the sockets can be using a busy flag to prevent them from deletion. And the loop
in the CRcvQueue::worker, when finding a socket that has been marked as closed,
should be removed from the continer immediately. The GC thread should also
remove the objects by itself from this container, which is why the whole operation
on the container should be guarded by a dedicated mutex.
In #38 there's a similar problem, the entity is extracted from the RcvUList list,
while the alleged entity is at the same time being deleted.
SEVERITY: HIGH
FOCUS: The socket deletion is done kinda dangerous way and conditions
under which the socket is allowed to be physically deleted are unclear.
In specific situations it may lead to deleting a socket that is under
operation in another thread and this may end up with a deleted-memory-access
and crash.
P23-QueueClosingFlag
==============================
m_bClosing flag in the queues.
Likely fixable by atomic.
SEVERITY: LOW
P24-iRexmitCount
==============================
Unguarded reading of m_iReXmitCount in CRcvQueue::worker.
Likely needs guard of m_RcvAckLock, although atomic might help, too.
SEVERITY: LOW
Findings Related to Socket Groups
See below.
Attachements
The details are collected in directories in the attached archive.
TSAN_REPORT_PACK.tar.gz
The text was updated successfully, but these errors were encountered: