Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't drop Binding Requests in Controlled Agent #426

Merged
merged 1 commit into from
Feb 22, 2022

Conversation

Sean-Der
Copy link
Member

A controlled Agent would discard incoming Binding Requests if it didn't
cause the pair to be selected. For UDP Candidate this would be
interpreted as packet loss. For TCP Candidates not responding with a
Binding Success could be interpreted as a failure.

Firefox's ICE Agent would disconnect TCP Candidates because of this
behavior.

Resolves to pion/webrtc#2125
Resolves to pion/webrtc#1356
See https://bugzilla.mozilla.org/show_bug.cgi?id=1756460

@codecov
Copy link

codecov bot commented Feb 21, 2022

Codecov Report

Merging #426 (fca128a) into master (fb4760c) will decrease coverage by 0.36%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #426      +/-   ##
==========================================
- Coverage   78.79%   78.43%   -0.37%     
==========================================
  Files          34       34              
  Lines        3882     3882              
==========================================
- Hits         3059     3045      -14     
- Misses        636      648      +12     
- Partials      187      189       +2     
Flag Coverage Δ
go 78.43% <100.00%> (-0.37%) ⬇️
wasm 25.16% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
candidatepair.go 83.01% <ø> (ø)
selection.go 80.18% <100.00%> (-1.89%) ⬇️
udp_mux.go 80.23% <0.00%> (-1.75%) ⬇️
gather.go 65.79% <0.00%> (-1.41%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb4760c...fca128a. Read the comment docs.

@Sean-Der
Copy link
Member Author

@jech I have it now! Mind trying again?

The behavior would exhibit itself if FireFox made another request before it responded to Pion. I was able to reproduce by adding some arbitrary delays, then debugged on the FireFox side.

A controlled Agent would discard incoming Binding Requests if it didn't
cause the pair to be selected. For UDP Candidate this would be
interpreted as packet loss. For TCP Candidates not responding with a
Binding Success could be interpreted as a failure.

Firefox's ICE Agent would disconnect TCP Candidates because of this
behavior.

Resolves to pion/webrtc#2125
Resolves to pion/webrtc#1356
See https://bugzilla.mozilla.org/show_bug.cgi?id=1756460
@Sean-Der Sean-Der force-pushed the respond-binding-success-controlled-new-candidate branch from 4bb5d2c to fca128a Compare February 21, 2022 21:01
@jech
Copy link
Member

jech commented Feb 22, 2022

Very slight improvement:

  • works when Brave (Chromium) is the offerer;
  • fails when Brave is the answerer;
  • fails with Firefox, both as offerer and as answerer.
    Tested against github.com/pion/ice/v2 v2.2.1-0.20220221210118-fca128a058ec.

@Sean-Der
Copy link
Member Author

@jech does http://18.216.219.191:8080/ work for you on FireFox (Pion is answerer)

Will try the other configurations now!

@jech
Copy link
Member

jech commented Feb 22, 2022

Yes, it works fine.

What I suspect is going on is that I'm not disabling the other traffic types, I'm just dropping traffic so that ICE has to fallback to TCP after a timeout. If the timeouts are wrong, then ICE will switch to failed before it has time to go through all of the candidates.

@Sean-Der
Copy link
Member Author

Oh yes that makes sense! Let me try that now also. In my example I am only allowing TCP.

  • Do you trickle candidates?
  • Is the browser sharing candidates, or just the server?

@jech
Copy link
Member

jech commented Feb 22, 2022

Do you trickle candidates?

Yes.

Is the browser sharing candidates, or just the server?

Both.

Another element: if I look in Firefox's about:webrtc, I never see the TCP candidates switch to failed: all the other candidates do, but the two TCP candidates stay in progress until somebody triggers an ICE restart. (Both the client and the server trigger an ICE restart on failure, I'd need to instrument my code in order to find out who exactly triggers the restart.)

@Sean-Der
Copy link
Member Author

Sean-Der commented Feb 22, 2022

@jech To repro easier I started https://github.com/Sean-Der/ice-tcp-test it allows the browser to be Offer or Answer. Right now it only does non-Trickle. It is available at http://18.216.219.191:8080/ again.

  • I blocked all UDP traffic on my server sudo iptables -I INPUT -p udp -j DROP && sudo ip6tables -I INPUT -p udp -j DROP
  • I run and set my public IP go run main.go 18.216.219.191

This example works for me in FireFox/Brave/Chrome.

I will add Trickle tomorrow and hopefully I am able to reproduce! Thank you so much for your patience on this. I am really excited by all the progress we have made on it so far. Two very long standing bugs fixed.

Copy link
Contributor

@boks1971 boks1971 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@jech
Copy link
Member

jech commented Feb 22, 2022

It is available at http://18.216.219.191:8080/ again.

I'm not seeing any IPv6 candidates on that server.

@Sean-Der
Copy link
Member Author

@jech oh good point! I moved it to http://143.198.72.93:8080/ all tests done on FireFox+Chrome on MacOS. I have IPv4 and IPv6 connectivity working.

  • Tested with IPv4/IPv6 unblocked
  • Tested with IPv4 blocked
  • Tested with IPv6 blocked

Everything seems to be working with me. Mind testing again?

@jech
Copy link
Member

jech commented Feb 22, 2022

That's bizarre... I've successfully blocked UDP, and everything works as expected (host-reflexive TCP candidate). Let me compare with Galene.

@Sean-Der
Copy link
Member Author

Glad to hear it works! Thank you so much for all your patience/debugging help on this.

High point of my week to have these longstanding issues finally figured out. I have been a bit hesitant to take these on....

@jech
Copy link
Member

jech commented Feb 22, 2022

Bizarrer and bizarrer. Your code works fine even when run on my server. I've tried copying all the details from your code into Galene, I've tried disabling trickle ICE in Galene, I've tried disabling ICE restarts in Galene, and it still fails.

I'm at a loss.

@Sean-Der
Copy link
Member Author

@jech I am going to merge + pin, and then will deploy Galene to my server!

I don't have any good debugging ideas. I spent a lot of time printf debugging FireFox, no good single source for info with this stuff :/

@Sean-Der Sean-Der merged commit e0db6d2 into master Feb 22, 2022
@Sean-Der Sean-Der deleted the respond-binding-success-controlled-new-candidate branch February 22, 2022 17:17
@jech
Copy link
Member

jech commented Feb 22, 2022

It actually starts failing after a while the first one or two TCP candidates work, then they start failing. Apparently, the only way to fix the issue is to restart the server. The sockets are properly closed, the new connections are accepted, but no data flows.

For now, you may test at https://galene.org:8444/group/public/ (note the non-standard port, port 8443 is running mainline Galene).

To test on your own server, do

git clone -b ice-tcp https://github.com/jech/galene/
cd galene
CGO_ENABLED=0 go build
mkdir groups
echo '{"presenter":[{}]}' > groups/public.json
rsync -r galene public groups server.example.com:
ssh server.example.com ./galene -tcp-port 8888

Then visit https://server.example.com:8443/group/public/

@Sean-Der
Copy link
Member Author

@jech Does adding a replace statement make a difference for you?

diff --git a/go.mod b/go.mod
index 1f4692d..1a5931e 100644
--- a/go.mod
+++ b/go.mod
@@ -17,3 +17,5 @@ require (
        golang.org/x/crypto v0.0.0-20220214200702-86341886e292
        golang.org/x/sys v0.0.0-20220209214540-3681064d5158
 )
+
+replace github.com/pion/ice/v2 => github.com/pion/ice/v2 v2.2.1-0.20220221210118-fca128a058ec

@Sean-Der
Copy link
Member Author

I tried locally and got the same thing, but I believe the replace statement fixed it! I don't know Go modules well enough to dump what exact versions are being used/dependency tree.

@jech
Copy link
Member

jech commented Feb 22, 2022

Not for me, unfortunately. I'm seeing the same results as above: good behaviour for the first few connections after server restart, then ICE failures until I restart the server. (Which is actually a good thing — ICE failures is something we can easily report to the user, unlike the erratic behaviour we were suffering from before your patches.)

Very weird. Perhaps a data structure somewhere that accumultates obsolete data?

@Sean-Der
Copy link
Member Author

Sean-Der commented Feb 23, 2022

@jech I just tested 100 PeerConnections with ice-tcp-test and everything worked! (just refreshed browser 100 times and hit button)

Will test with Galene now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants