Improving CUnitQueue performance and data race tolerance #2395

maxsharabayko · 2022-06-23T08:24:35Z

Protect CUnitQueue from data race

Protect CUnitQueue from data race with a dedicated mutex and atomics.
Previously only CUnitQueue::m_iCount was protected being of an atomic type.
However, a simultaneous access to CUnit::m_iFlag was possible from CUnitQueue::getNextAvailUnit() (RcvQ thread) and CUnitQueue::makeUnitFree() (RCV reading thread), see sanitizer warning below.

There is also no protection from simultaneous calls to CUnitQueue::getNextAvailUnit() from different threads. However, currently, it is not done. Just added a comment for future consideration if this changes.

As we don't want to block 'makeUnitFree()andmakeUnitGood()on a mutex while new units are being allocated or a free one is being searched,CUnit::m_iFlagandCUnitQueue::m_iNumTaken` are marked atomic and don't have to be protected by a mutex. The rest variables are left unprotected as supposed to be accessed from the same thread (RcvQ thread).

Marking each CUnit::m_iFlag atomic might have a memory overhead and some performance burden accessing it from the receiver buffer, etc. However, Comparing atomic, mutex, rwlocks/ states atomics are still better in terms of performance.

WARNING: ThreadSanitizer: data race (pid=21056)
  Write of size 4 at 0x007fedc062d0 by thread T2 (mutexes: write M171, write M298, write M296):
    #0 srt::CUnitQueue::makeUnitFree(srt::CUnit*) srt.git/srtcore/queue.cpp:244
    #1 srt::CRcvBufferNew::releaseUnitInPos(int) srt.git/srtcore/buffer_rcv.cpp:687
    #2 srt::CRcvBufferNew::readMessage(char*, unsigned long, SRT_MsgCtrl_*) srt.git/srtcore/buffer_rcv.cpp:366
    #3 srt::CUDT::receiveMessage(char*, int, SRT_MsgCtrl_&, int) srt.git/srtcore/core.cpp:6921
    #4 srt::CUDT::recvmsg2(char*, int, SRT_MsgCtrl_&) srt.git/srtcore/core.cpp:6801
    #5 srt::CUDT::recvmsg2(int, char*, int, SRT_MsgCtrl_&) srt.git/srtcore/api.cpp:3716
    #6 srt::CUDT::recvmsg(int, char*, int, long&) srt.git/srtcore/api.cpp:3699
    #7 srt_recvmsg srt.git/srtcore/srt_c_api.cpp:185

  Previous read of size 4 at 0x007fedc062d0 by thread T10:
    #0 srt::CUnitQueue::increase() srt.git/srtcore/queue.cpp:150
    #1 srt::CUnitQueue::getNextAvailUnit() srt.git/srtcore/queue.cpp:214
    #2 srt::CRcvQueue::worker_RetrieveUnit(int&, srt::CUnit*&, srt::sockaddr_any&) srt.git/srtcore/queue.cpp:1370
    #3 srt::CRcvQueue::worker(void*) srt.git/srtcore/queue.cpp:1232

Increased CUnitQueue block size

CUnitQueue allocates 32 additional units at the start and every time 90% of units are taken.

32 units of 1500 bytes equal to ~384 kbits. For example SRT buffering latency of 100ms it covers 3 Mbps stream.
For example, SRT buffering latency of 1 second covers 384 kbps stream.

Raising the size of the block to 128 units or 1.5 Mbits is at least closer to some real bitrates of a live video stream. Also CUnitQueue::increase() would be called less frequently, as each increase now allocated 1.5 Mbits instead of 384 kbits.

Likely resolves #2346 together with #2405.

RAII for `CUnitQueue`

Previously CUnitQueue() construction and CUnitQueue::init() were different functions. Now constructor allocates all necessary resources. It also allows making the CUnitQueue::m_iMSS constant, as all units are expected to be of the same size.

srtcore/queue.h

gou4shi1 · 2022-06-23T11:46:20Z

32 units of 1500 bytes equal to ~6 kbits.

32 * 1500B = 48000B = 48KB?

gou4shi1 · 2022-06-24T05:51:27Z

getNextAvailUnit() may take a long time, with mutex added, it may cause receiveMessage()->makeUnitFree() stuck unnecessary?
What about just make m_iFlag and m_iCount atomic? If they are out of sync (e.g. flag updated but count not updated yet), the worst result is just increasing the queue unnecessary?
But even with mutex added, getNextAvailUnit() may still happen before makeUnitFree(), which may still result in unnecessary increase.

maxsharabayko · 2022-07-11T10:41:48Z

getNextAvailUnit() may take a long time, with mutex added, it may cause receiveMessage()->makeUnitFree() stuck unnecessary?

If there is only one socket bound to the receiving queue, there should be no fragmentation, and m_pAvailUnit points to the next free unit.
However, if there are two or more receiving sockets bound to the same receiving queue, the search for a free unit is rather inefficient. Instead, some queue of free units should be used, although at the cost of additional memory consumption 🤔

ethouris · 2022-07-11T15:03:16Z

In earlier days I was researching the possibility of reception from multiple sockets. There are two methods how you could do it:

One common unit queue for the whole group. Picking up the units for reading a packet will require guarding, but you can free them in order in one thread. Fragmentation possible - some of the packets will be effectively rejected and not put into the receiver buffer
Every socket has its own unit queue as before, but the receiver buffer can consists of units from different queues. A unit must then contain a pointer to the queue from which it has come and the queue must be also a shared object between the socket and the group so that deletion of a socket won't invalidate the queue, until the last element of the queue is removed from the buffer (or deletion of a socket could be delayed). Fragmentation might be unlikely as even though some received packets would be effectively rejected, pickup will still be always at the head and returning at the tail, even though some packets will be returned to the queue earlier (rejected packets) than the others (packets put into the receiver buffer). Might be that if it can be ensured that pickup and return happen on separate ends of the queue, it doesn't require locking. Also the performance of the separate queues each one filled in from the separate receiver thread would be better.

using an atomic. Refactored common allocation code CUnitQueue::allocateEntry(..).

Allocates 128 additional units at the start and every time 90% of units are taken. Previously was allocating only 32 units.

gou4shi1 · 2022-07-13T14:39:12Z

srtcore/queue.h


-    CUnit* m_pAvailUnit; // recent available unit
+    /// Increase the unit queue size (by @a m_iBlockSize units).
+    /// Uses m_mtx to protect access and changes of the queue state.


m_mtx was deleted?

Yes. There is no sense in it with the atomic m_iFlag. But I forgot to update the comment :)
Thanks for noticing!

wednesdayfrogcoder · 2022-09-28T10:41:47Z

You do not specify a memory order when storing or reading from CUnit::m_iFlag which by default uses the sequentially consistent order behavior by default. Depending on the requirements you might get away with Release-Acquire ordering which is less restrictive and may reduce latency, improve performance for this atomic variable.

maxsharabayko · 2022-09-28T16:36:38Z

@wednesdayfrogcoder
Good point!
It might be a bit complicated to do though because we don't use C++ atomic directly, but rather some shim layer. Also not sure if an impact would be noticable.

maxsharabayko added Type: Maintenance Work required to maintain or clean up the code [core] Area: Changes in SRT library core labels Jun 23, 2022

maxsharabayko added this to the Next Release milestone Jun 23, 2022

maxsharabayko mentioned this pull request Jun 23, 2022

[BUG] CUnitQueue::increase cost high CPU #2346

Closed

gou4shi1 reviewed Jun 23, 2022

View reviewed changes

srtcore/queue.h Outdated Show resolved Hide resolved

maxsharabayko mentioned this pull request Jul 11, 2022

Improving CUnitQueue performance: don't adjust mcount #2405

Merged

maxsharabayko force-pushed the hotfix/cunitqueue_increase branch 3 times, most recently from 8bd763a to 026b5b3 Compare July 12, 2022 13:57

maxsharabayko marked this pull request as draft July 12, 2022 14:55

maxsharabayko force-pushed the hotfix/cunitqueue_increase branch from 026b5b3 to eb700a6 Compare July 12, 2022 15:46

maxsharabayko added 2 commits July 12, 2022 17:53

[core] Protect CUnit::m_iFlag from data race

d002b5e

using an atomic. Refactored common allocation code CUnitQueue::allocateEntry(..).

[core] Increased CUnitQueue block allocation speed.

7c81569

Allocates 128 additional units at the start and every time 90% of units are taken. Previously was allocating only 32 units.

maxsharabayko force-pushed the hotfix/cunitqueue_increase branch from eb700a6 to 7c81569 Compare July 12, 2022 15:53

maxsharabayko marked this pull request as ready for review July 12, 2022 16:04

maxsharabayko merged commit ced76c7 into Haivision:master Jul 13, 2022

maxsharabayko deleted the hotfix/cunitqueue_increase branch July 13, 2022 07:06

gou4shi1 reviewed Jul 13, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving CUnitQueue performance and data race tolerance #2395

Improving CUnitQueue performance and data race tolerance #2395

maxsharabayko commented Jun 23, 2022 •

edited

Loading

gou4shi1 commented Jun 23, 2022

gou4shi1 commented Jun 24, 2022 •

edited

Loading

maxsharabayko commented Jul 11, 2022

ethouris commented Jul 11, 2022

gou4shi1 Jul 13, 2022

maxsharabayko Jul 13, 2022

wednesdayfrogcoder commented Sep 28, 2022

maxsharabayko commented Sep 28, 2022

Improving CUnitQueue performance and data race tolerance #2395

Improving CUnitQueue performance and data race tolerance #2395

Conversation

maxsharabayko commented Jun 23, 2022 • edited Loading

Protect CUnitQueue from data race

Increased CUnitQueue block size

RAII for CUnitQueue

gou4shi1 commented Jun 23, 2022

gou4shi1 commented Jun 24, 2022 • edited Loading

maxsharabayko commented Jul 11, 2022

ethouris commented Jul 11, 2022

gou4shi1 Jul 13, 2022

Choose a reason for hiding this comment

maxsharabayko Jul 13, 2022

Choose a reason for hiding this comment

wednesdayfrogcoder commented Sep 28, 2022

maxsharabayko commented Sep 28, 2022

maxsharabayko commented Jun 23, 2022 •

edited

Loading

RAII for `CUnitQueue`

gou4shi1 commented Jun 24, 2022 •

edited

Loading