Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ICE reconnect flashings #3930

Merged
merged 7 commits into from
Mar 9, 2021
Merged

Fix ICE reconnect flashings #3930

merged 7 commits into from
Mar 9, 2021

Conversation

keianhzo
Copy link
Contributor

@keianhzo keianhzo commented Feb 18, 2021

This PR adds support for better ICE failures recovery. Now instead of rejoining in case of an ICE failure we:

  • Restart ICE for browser that support it
  • Recreate the transports and recreate producers/consumers for browser that don't support ICE updateIceServers

This PR adds a couple of methods in the Dialog side for closing a transport and getting the consumers.

This has to be merged at the same time as: Hubs-Foundation/dialog#12

Copy link
Contributor

@netpro2k netpro2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still reviewing the actual ICE related changes in naf-dialog-adapter but wanted to submit the review comments I already have. Mostly just some minor styling feedback on the events system being used.

Starting to look at naf-dialog-adapter more, at first glance it feels like (even prior to this PR) we are doing a lot of work polling the connection state and trying to reconnect things.. I wonder if we are actually hurting ourselves, since I was under the impression that the ICE agent in the browser should already be handling that for us... I want to read through more of the mediasoup-client library to see if its already doing any sort of reconnect logic that we might be getting in the way of with our own reconnect logic.

I also want to read more about TURN and our coturn implementation in particular to understand why our TURN sessions would be expiring. Surely there is some way this is supposed to be getting refreshed... If we fix that then I don't think we have to worry about special handling for Firefox since updateIceServers should never need to be called, meaning we never have to recreate the transports, which would greatly simplify this PR.

src/components/media-views.js Outdated Show resolved Hide resolved
src/components/media-views.js Outdated Show resolved Hide resolved
src/components/audio-feedback.js Outdated Show resolved Hide resolved
src/naf-dialog-adapter.js Outdated Show resolved Hide resolved
src/components/avatar-audio-source.js Outdated Show resolved Hide resolved
src/naf-dialog-adapter.js Outdated Show resolved Hide resolved
@@ -202,36 +224,91 @@ export default class DialogAdapter {
* https://developer.mozilla.org/en-US/docs/Web/API/RTCPeerConnection/setConfiguration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know its not from this PR, but kind of surprising that setConfiguration needs to be called here. The ICE servers (assuming this means STUN and TURN servers would not need to be changed in most cases (except as noted in the above comments when our TURN credentials have expired)... In the case we are using a non proxy candidate does this mean we should be able to restart with firefox without having to create new Transports?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I understand and what I saw during testing. When using TURN we need to update the ICE servers in case the credentials have expired otherwise we are good with just restarting ICE in both sides. As yous y say, if we are not using proxy candidates we can just restart ICE without updating the servers.

We can also check here if we are we are using a non proxy candidate to avoid an ICE server update as even when forcing TURN we still add a STUN server. I was assuming here that if TURN is forced is because the client is having ICE issues using STUN.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when a candidate selected via STUN fails the STUN server itself should not need to change so I think if we can fix the credentials issue we should not ever need to change the ICE servers (except in the case of a dialog server transfer

src/components/avatar-audio-source.js Outdated Show resolved Hide resolved
@keianhzo keianhzo force-pushed the ice-restart-flashing branch from 353260d to 53a0462 Compare February 23, 2021 14:21
@keianhzo
Copy link
Contributor Author

@netpro2k I'm happy to pair anytime and go over this and see if we can simplify it.

@keianhzo keianhzo force-pushed the ice-restart-flashing branch from 6f81c98 to c39789a Compare February 25, 2021 13:27
@keianhzo
Copy link
Contributor Author

@netpro2k I've updated this PR to stop handling the disconnected state.

@keianhzo keianhzo force-pushed the ice-restart-flashing branch from c39789a to 6d45372 Compare February 25, 2021 13:54
@keianhzo keianhzo requested a review from netpro2k February 25, 2021 22:48
Copy link
Contributor

@netpro2k netpro2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty happy with where we landed on this, nice work! I think the refactoring of naf-dialog-adapter is a nice step forward in starting to make it easier to reason about, and I think removing the watchdogs and handing stream recreation should hopefully lead to more reliability. I made some suggestions on future cleanup work we can do but I think most of it can be done in future PRs and we don't need to hold this back. I am still a bit confused on some of the stuff in restartSendICE and would like to discuss that, but other than that lgtm!

src/components/avatar-audio-source.js Show resolved Hide resolved
src/components/media-views.js Show resolved Hide resolved
src/naf-dialog-adapter.js Outdated Show resolved Hide resolved
src/naf-dialog-adapter.js Outdated Show resolved Hide resolved
}

// Resolve initial audio resolver since this person left.
const initialAudioResolver = this._initialAudioConsumerResolvers.get(peerId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of this PR but the whole _initialAudioConsumerResolvers thing is something I want to look into soon. This stuff still feels a bit fishy to be and would like to avoid it if we can. I would bet there are cases where we end up stalling on this for whatever reason. Its unclear why we need to wait for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that covers the case where that peer disconnected before it's audio consumer resolved which is something that we do upon connect to wait until we receive all the other peer's consumers. I agree that can cause some issues. I'm not sure why we do that though, is it really that important that we wait until we can hear all the other peers to connect?


try {
this._mediasoupDevice = new mediasoupClient.Device({});
async createSendTransport(iceServers) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't comb through the changes here very closely since I think this is just moving code around, let me know if that is not the case.

iceServers,
iceTransportPolicy: this._iceTransportPolicy
});
async createRecvTransport(iceServers) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we should do it in this PR but just noting for future cleanup. Seems like we could DRY up creating the send/receive transports. A lot of the code in here is the same for both.

@keianhzo
Copy link
Contributor Author

keianhzo commented Mar 3, 2021

@netpro2k Updated with some of the change requests above. If you are ok, I'll move forward with landing this.

@keianhzo keianhzo merged commit 70d7630 into master Mar 9, 2021
@keianhzo keianhzo deleted the ice-restart-flashing branch March 9, 2021 12:46
console.error(`Error getting video stream for ${peerId}`, e);
});
if (stream) {
videoEl.srcObject = new MediaStream(stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I discovered this doesn't work unless you also call videoEl.play(), because if the track dies the video is stopped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this happen in all browsers? Can you provide STRs? I can can't really reproduce it with current master in FF/Chrome.

The stream_update event in media-view.js and avatar-audio-source.js should only be as a consequence of a receive transport recreation, otherwise both components are not yet created so the stream_update event is not triggered on them. Maybe we are are missing some case where it's being triggered? I haven't found any but having STRs would really help.

A receive transport recreation should only happen in case where the transport has been closed but signaling is still opened. The transport disconnected/failed states are handled by the WebRTC stack and we should recover without any consumer recreation or transport updates whatsoever (so no stream_update events).

I couldn't also reproduce the issue mentioned above. How are you triggering it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants