Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix a crash inside updateConnStatus #2194

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

cg82616424
Copy link
Contributor

At First , I have to emphasize that this is a crash which is very difficult to reproduce.4f2f51f6f8ee9567b92a842396543c7cMy program is running on about 20000 android devices, and crash about 1~2 times one day.
Secondly, let me try to explain the cause and recurrence conditions of crash.
0. the characteristic that my app use srt:
1) my app use srt not ony to transmit videos, but also files, messages and so on
2) app will maintain 30 ~ 40 SRT connections. If some links are broken, a new connection will be quickly established to supplement them
3)the survival time of each link is not necessarily from a few seconds to tens of minutes
4) the network speed may be very high sometimes, from 1Mbps to 40 Mbps
5) all connection are async mode,

  1. crash stacks:

4f2f51f6f8ee9567b92a842396543c7c

  1. I struggled to figure out the reason: garbage collector thread delete CUDT instances, but receive worker still try to access this CUDT pointer, one thread execution sequence that triggered the crash like below;
    1. User Thread create a new srt connection named A in async mode
    2. after a while, A connection reach connection timeout, and updateConnStatus function in Receiver Worker thread put A connection to toRemove List, at same time, the Receive Worker thread put B,srt connection to toProcess List,
    3. User Thread call srt_close to close connection A, we call this timepoint TA;
    4. Receiver Worker try to call processAsyncConnectRequest function to do something
    5. User Thread call srt_send and let srt connection C to send a lot of data, both srt_send in User Thread andprocessAsyncConnectRequest need m_ConnectionLock, at this case, the User Thread get m_ConnectionLock first and block Receiver Worker for more than 1 seconds
    6. garbage collector try to release not used connections, we call this timepoint TB; TB-TA > 1s, garbage collector deleted the CUDT instance that A have.
    7. Receiver worker successfully get m_ConnectionLock and continue to touch A's pointer to CUDT instance which garbage collector had already deleted
    8. crash happens
  2. I provide a solution to solve this problem, plz review my code

@maxsharabayko maxsharabayko added this to the v1.4.5 milestone Dec 1, 2021
@maxsharabayko maxsharabayko added [core] Area: Changes in SRT library core Type: Bug Indicates an unexpected problem or unintended behavior labels Dec 1, 2021
@codecov-commenter

This comment has been minimized.

@cg82616424
Copy link
Contributor Author

or @maxsharabayko @ethouris ,do you have some suggestions to solve this crash more gracefully?

@maxsharabayko
Copy link
Collaborator

@cg82616424 could you please define the version of SRT you experience this issue with?

@cg82616424
Copy link
Contributor Author

thanks, @maxsharabayko my srt version is 1.4.3.rc0; but I reviewed the code on master branch, I think, this problem is not fixed yet

@cg82616424
Copy link
Contributor Author

Hi @maxsharabayko ,will this pull request be merged?

@maxsharabayko
Copy link
Collaborator

Hi, @cg82616424. It is scheduled for review.

@maxsharabayko
Copy link
Collaborator

@cg82616424
SRT v1.4.4 includes PR #1919. It could have resolved the issue you describe.
Could you please check if this issue persists with the latest master (with one more PR #1950 that could also improve things)?

@cg82616424
Copy link
Contributor Author

Hi @maxsharabayko :
thanks for your message,
I reviewed PR #1919,PR #1950 and the code of master;
I dont think these pr and code in master can fix this problem properly.

@maxsharabayko maxsharabayko removed this from the v1.5.0 milestone Jun 1, 2022
@maxsharabayko maxsharabayko added this to the Backlog milestone Jun 1, 2022
@ethouris
Copy link
Collaborator

Ok, I made my own branch to solve the merge problems; I have a question though.

Access to what exactly data does this new mutex guard?

If my understanding is correctly, it simply prevents from deletion those multiplexers that are currently being under check in the updateConnStatus, but then they should be as well guarded by m_GlobControlLock.

I'm simply trying to understand the motivation. All other things rather look good, at least the new mutex seems to be used in the right order in all places where it's used, so I see no deadlock risk here.

@ethouris
Copy link
Collaborator

I have made my own branch with changes, in case when there were problems with this one: ethouris/dev-fix-2194. If you can, please apply changes from there and retest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants