Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in CRcvLossList::getFirstLostSeq() #1352

Closed
pcamina opened this issue Jun 17, 2020 · 6 comments · Fixed by #1431
Closed

Crash in CRcvLossList::getFirstLostSeq() #1352

pcamina opened this issue Jun 17, 2020 · 6 comments · Fixed by #1431
Assignees
Labels
[core] Area: Changes in SRT library core Priority: High Type: Bug Indicates an unexpected problem or unintended behavior
Milestone

Comments

@pcamina
Copy link

pcamina commented Jun 17, 2020

Bug description
My application crashes in the function CRcvLossList::getFirstLostSeq(). When the crash happens m_iHead is -1 whereas m_iLength is 1. This results in a EXC_BAD_ACCESS (SIGSEGV) exception since the index (m_iHead) into the array is -1. It does not happen very often and is difficult to reproduce, but since our application is used by many users there are many hundred SRT connections per day, it gets an issue.

I am wondering if it is a concurrency problem. I have seen that loss list accesses (remove and insert) are protected with the m_RcvLossLock lock. Since getFirstLostSeq() is not protected would it be possible that an ongoing remove operation (e.g. of the last element in the list) is interrupted at a point where m_iHead is already -1 but m_iLength is not yet 0?

To Reproduce
I am using SRT in SRTT_LIVE mode, TSBPD, non-blocking send / receive mode in sender and receiver. Up to nine connections per application.
As mentioned above it is very difficult to reproduce.

Desktop (please provide the following information):

  • OS: iOS 13
  • SRT Version / commit ID: 1.4.0

Additional context
Otherwise SRT works really well for me!

Cheers
Patrick Camina

@pcamina pcamina added the Type: Bug Indicates an unexpected problem or unintended behavior label Jun 17, 2020
@maxsharabayko maxsharabayko added the [core] Area: Changes in SRT library core label Jun 17, 2020
@maxsharabayko
Copy link
Collaborator

Hi @pcamina
This sounds like something that was fixed recently in PR #1333 and #1335.
Could you please check the latest master?

@maxsharabayko maxsharabayko added this to the v1.5.0 - Sprint 17 milestone Jun 17, 2020
@pcamina
Copy link
Author

pcamina commented Jun 18, 2020

Hi @maxsharabayko,

I checked the latest master and getFirstLostSeq() is not protected. Actually, I don't see any changes that address this issue. It seems to me that PR #1333 and #1335 are not related.

Maybe, the m_RcvLossLock lock could be used as well for the getFirstLostSeq() access?

Cheers
Patrick

@maxsharabayko
Copy link
Collaborator

@pcamina You are right, it is a different issue. Sorry, read the description too fluently.

getFirstLostSeq(..) does not modify the state of CRcvLossList. But it accesses m_caSeq[m_iHead], and theoretically after it checks m_iLength != 0 the state could be changed by another thread so that m_iHead becomes -1

On the other hand, all the work with CRcvLossList should be performed from the same thread. So the problem might actually come from some place where m_iHead and m_iLength are updated.

@maxsharabayko maxsharabayko modified the milestones: v1.5.0 - Sprint 17, v1.5.0 Jun 18, 2020
@pcamina
Copy link
Author

pcamina commented Jun 18, 2020

@maxsharabayko.

Actually, CRcvLossList::remove() is called in the CUDT::tsbpd thread, whereas the other CRcvLossListfunctions are performed in the CRcvQueue::worker thread.

So, my scenario is the following:
CRcvLossList::remove() is called in the CUDT::tsbpd thread. While remove(..) modifies the list, getFirstLostSeq(..) is called from the CRcvQueue::worker thread. This could, while removing the last entry, result in m_iLength still being 1 but m_iHeadbeing already -1 which would allow m_caSeq[m_iHead] to be accessed with a negative index.

What do you think?

Patrick

@maxsharabayko
Copy link
Collaborator

Right, I forgot about the TSBPD thread. That would explain this crash. Then the protection with m_RcvLossLock is needed.

@pcamina
Copy link
Author

pcamina commented Jul 6, 2020

I had added the lock and the problem disappeared.

@maxsharabayko maxsharabayko self-assigned this Jul 27, 2020
@maxsharabayko maxsharabayko modified the milestones: v1.5.0, v1.5.0 - Sprint 20 Jul 27, 2020
@mbakholdina mbakholdina modified the milestones: v1.5.0 - Sprint 20, v1.4.2 Oct 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core Priority: High Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants