Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving CUnitQueue performance and data race tolerance #2395

Merged

Conversation

maxsharabayko
Copy link
Collaborator

@maxsharabayko maxsharabayko commented Jun 23, 2022

Protect CUnitQueue from data race

Protect CUnitQueue from data race with a dedicated mutex and atomics.
Previously only CUnitQueue::m_iCount was protected being of an atomic type.
However, a simultaneous access to CUnit::m_iFlag was possible from CUnitQueue::getNextAvailUnit() (RcvQ thread) and CUnitQueue::makeUnitFree() (RCV reading thread), see sanitizer warning below.

There is also no protection from simultaneous calls to CUnitQueue::getNextAvailUnit() from different threads. However, currently, it is not done. Just added a comment for future consideration if this changes.

As we don't want to block 'makeUnitFree()andmakeUnitGood()on a mutex while new units are being allocated or a free one is being searched,CUnit::m_iFlagandCUnitQueue::m_iNumTaken` are marked atomic and don't have to be protected by a mutex. The rest variables are left unprotected as supposed to be accessed from the same thread (RcvQ thread).

Marking each CUnit::m_iFlag atomic might have a memory overhead and some performance burden accessing it from the receiver buffer, etc. However, Comparing atomic, mutex, rwlocks/ states atomics are still better in terms of performance.

WARNING: ThreadSanitizer: data race (pid=21056)
  Write of size 4 at 0x007fedc062d0 by thread T2 (mutexes: write M171, write M298, write M296):
    #0 srt::CUnitQueue::makeUnitFree(srt::CUnit*) srt.git/srtcore/queue.cpp:244
    #1 srt::CRcvBufferNew::releaseUnitInPos(int) srt.git/srtcore/buffer_rcv.cpp:687
    #2 srt::CRcvBufferNew::readMessage(char*, unsigned long, SRT_MsgCtrl_*) srt.git/srtcore/buffer_rcv.cpp:366
    #3 srt::CUDT::receiveMessage(char*, int, SRT_MsgCtrl_&, int) srt.git/srtcore/core.cpp:6921
    #4 srt::CUDT::recvmsg2(char*, int, SRT_MsgCtrl_&) srt.git/srtcore/core.cpp:6801
    #5 srt::CUDT::recvmsg2(int, char*, int, SRT_MsgCtrl_&) srt.git/srtcore/api.cpp:3716
    #6 srt::CUDT::recvmsg(int, char*, int, long&) srt.git/srtcore/api.cpp:3699
    #7 srt_recvmsg srt.git/srtcore/srt_c_api.cpp:185

  Previous read of size 4 at 0x007fedc062d0 by thread T10:
    #0 srt::CUnitQueue::increase() srt.git/srtcore/queue.cpp:150
    #1 srt::CUnitQueue::getNextAvailUnit() srt.git/srtcore/queue.cpp:214
    #2 srt::CRcvQueue::worker_RetrieveUnit(int&, srt::CUnit*&, srt::sockaddr_any&) srt.git/srtcore/queue.cpp:1370
    #3 srt::CRcvQueue::worker(void*) srt.git/srtcore/queue.cpp:1232

Increased CUnitQueue block size

CUnitQueue allocates 32 additional units at the start and every time 90% of units are taken.

32 units of 1500 bytes equal to ~384 kbits. For example SRT buffering latency of 100ms it covers 3 Mbps stream.
For example, SRT buffering latency of 1 second covers 384 kbps stream.

Raising the size of the block to 128 units or 1.5 Mbits is at least closer to some real bitrates of a live video stream. Also CUnitQueue::increase() would be called less frequently, as each increase now allocated 1.5 Mbits instead of 384 kbits.

Likely resolves #2346 together with #2405.

RAII for CUnitQueue

Previously CUnitQueue() construction and CUnitQueue::init() were different functions. Now constructor allocates all necessary resources. It also allows making the CUnitQueue::m_iMSS constant, as all units are expected to be of the same size.

@maxsharabayko maxsharabayko added Type: Maintenance Work required to maintain or clean up the code [core] Area: Changes in SRT library core labels Jun 23, 2022
@maxsharabayko maxsharabayko added this to the Next Release milestone Jun 23, 2022
srtcore/queue.h Outdated Show resolved Hide resolved
@gou4shi1
Copy link
Contributor

32 units of 1500 bytes equal to ~6 kbits.

32 * 1500B = 48000B = 48KB?

@gou4shi1
Copy link
Contributor

gou4shi1 commented Jun 24, 2022

getNextAvailUnit() may take a long time, with mutex added, it may cause receiveMessage()->makeUnitFree() stuck unnecessary?
What about just make m_iFlag and m_iCount atomic? If they are out of sync (e.g. flag updated but count not updated yet), the worst result is just increasing the queue unnecessary?
But even with mutex added, getNextAvailUnit() may still happen before makeUnitFree(), which may still result in unnecessary increase.

@maxsharabayko
Copy link
Collaborator Author

getNextAvailUnit() may take a long time, with mutex added, it may cause receiveMessage()->makeUnitFree() stuck unnecessary?

If there is only one socket bound to the receiving queue, there should be no fragmentation, and m_pAvailUnit points to the next free unit.
However, if there are two or more receiving sockets bound to the same receiving queue, the search for a free unit is rather inefficient. Instead, some queue of free units should be used, although at the cost of additional memory consumption 🤔

@ethouris
Copy link
Collaborator

In earlier days I was researching the possibility of reception from multiple sockets. There are two methods how you could do it:

  • One common unit queue for the whole group. Picking up the units for reading a packet will require guarding, but you can free them in order in one thread. Fragmentation possible - some of the packets will be effectively rejected and not put into the receiver buffer
  • Every socket has its own unit queue as before, but the receiver buffer can consists of units from different queues. A unit must then contain a pointer to the queue from which it has come and the queue must be also a shared object between the socket and the group so that deletion of a socket won't invalidate the queue, until the last element of the queue is removed from the buffer (or deletion of a socket could be delayed). Fragmentation might be unlikely as even though some received packets would be effectively rejected, pickup will still be always at the head and returning at the tail, even though some packets will be returned to the queue earlier (rejected packets) than the others (packets put into the receiver buffer). Might be that if it can be ensured that pickup and return happen on separate ends of the queue, it doesn't require locking. Also the performance of the separate queues each one filled in from the separate receiver thread would be better.

@maxsharabayko maxsharabayko force-pushed the hotfix/cunitqueue_increase branch 3 times, most recently from 8bd763a to 026b5b3 Compare July 12, 2022 13:57
@maxsharabayko maxsharabayko marked this pull request as draft July 12, 2022 14:55
using an atomic.
Refactored common allocation code CUnitQueue::allocateEntry(..).
Allocates 128 additional units at the start and every time 90% of units are taken.
Previously was allocating only 32 units.
@maxsharabayko maxsharabayko marked this pull request as ready for review July 12, 2022 16:04
@maxsharabayko maxsharabayko merged commit ced76c7 into Haivision:master Jul 13, 2022
@maxsharabayko maxsharabayko deleted the hotfix/cunitqueue_increase branch July 13, 2022 07:06

CUnit* m_pAvailUnit; // recent available unit
/// Increase the unit queue size (by @a m_iBlockSize units).
/// Uses m_mtx to protect access and changes of the queue state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_mtx was deleted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There is no sense in it with the atomic m_iFlag. But I forgot to update the comment :)
Thanks for noticing!

@wednesdayfrogcoder
Copy link

You do not specify a memory order when storing or reading from CUnit::m_iFlag which by default uses the sequentially consistent order behavior by default. Depending on the requirements you might get away with Release-Acquire ordering which is less restrictive and may reduce latency, improve performance for this atomic variable.

@maxsharabayko
Copy link
Collaborator Author

@wednesdayfrogcoder
Good point!
It might be a bit complicated to do though because we don't use C++ atomic directly, but rather some shim layer. Also not sure if an impact would be noticable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core Type: Maintenance Work required to maintain or clean up the code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] CUnitQueue::increase cost high CPU
4 participants