Skip to content
This repository has been archived by the owner on Jul 12, 2023. It is now read-only.

WebRTC DTLS handsake quick connection #37

Merged
merged 8 commits into from
Apr 28, 2022

Conversation

slabajo
Copy link
Member

@slabajo slabajo commented Apr 20, 2022

What is the current behavior you want to change?

Current Kurento WebRTC connection introduces unneeded time to complete connection when using STUN or TURN candidates.
DTLS is started by the peer acting as client (with property "is-client" to TRUE) sending an initial DTLS Hello packet. That packet should be responded by the other peer and exchange of keys will happen. If the other peer does not respond, the client will wait for an amount of time before resending DTLS Hello packet. That amount of time starts in 1 second and doubles on each retry, so the retries comes at 1 second, then 2 seconds, then 4, and so on.
In KMS the DTLS connection is managed by the KmsWebrtcTransportSink component, more specifically by the embbeded dtlssrtpenc component. When instantiated in a KmsWebrtcTransportSinkNice component, the sending is managed by a nicesink element associated to the niceagent of the connection.
The problem appears because dtlssrtpenc element initiates DTLS connection as soon as it reaches the PAUSED or PLAYING state, and this always happens before the ICE negotiation has reached to a first valid candidate pair, at least when no HOST valid candidates are found. When this happens, the first DTLS Hello packet is silently dropped by the nicesink as there is no ICE connection established yet. And the timeout for next DTLS Hello starts to run.
The outcome is that when using STUN or TURN candidates, once a first valid pair is found, one or more DTLS Hello packets have already been sent, and as the timeout doubles each retry, it is likely that after ICE connection is established waiting for next retry will take a similar time to the what it took to establish ICE connection.

What is the new behavior provided by this change?

This change consist on the following modifications:

  • To prevent dtlssrtpenc to trigger DTLS initiation, we lock its state to NULL until we know for sure that it is ready to work, that is, until we know the peer is acting as server, or, if the peer is acting as client, until ICE gets to a CONNECTED state. At those points we unlock dtlssrtpenc state and synchronize with its parent. In fact, to impact less possible elements it is not the dtlssrtpenc component but the dtlsenc embbeded on it
  • We added a virtual method to the KmsWebrtcTransportSink to set the is-client property, to allow the KmsWebrtcTransportSinkNice to decide when to start DTLS. This must be taken into account for potential future KmsWebrtcTransportSink derived classes
  • We added a method to start DTLS connection. This is used on KmsWebrtcTransportSinkNice as soon as the sink knows it is server, or if it is client, as soon as ICE gets to a CONNECTED state
  • On KmsWebrtcTransportSinkNice configure, we subscribe to the ICE connection event to know when ICE connection reaches to CONNECTED state.

How has this been tested?

It has been tested building a test that is not included because it uses GStreamer 1.18 components to monitor when DTLS connection state changes (DTLS connection state is introduced in GStreamer 1.17)
Also the current tests have been validated, incidentally we have found (and provided a workaround) in the datachannel test in test_webrtcendpoint a race condition when releasing resources after test is finished.
Also we have verified the change running current KMS tutorials with the change incorporated.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature / enhancement (non-breaking change which improves the project)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • My change requires a change to the documentation
  • My change requires a change in other repository

Checklist

  • I have read the Contribution Guidelines
  • I have added an explanation of what the changes do and why they should be included
  • I have written new tests for the changes, as applicable, and have successfully run them locally

@j1elo
Copy link
Member

j1elo commented Apr 20, 2022

Also please add the tests like if you wanted them to run with the others, but leave it commented out and with a mark like this:

// UPGRADE: Uncomment this test when Kurento is made to work with GStreamer 1.18

So when we make the jump the code can be easily found and the test enabled.

@j1elo j1elo self-assigned this Apr 21, 2022
@slabajo
Copy link
Member Author

slabajo commented Apr 21, 2022

Just done, also made a change to the ice_state_changes test due to race condition that may cause a segfault

@j1elo
Copy link
Member

j1elo commented Apr 21, 2022

Sorry I've been playing with the "Review PR" feature of Github, last night I made several comments on parts of the code, but I think you didn't see them (did you before now? just to know) because I didn't click the button to "start a review"

(cherry picked from commit d3ed7b78b32aaa099ec34d0f7b7afff98b485877)
@slabajo
Copy link
Member Author

slabajo commented Apr 21, 2022

Changes done!.
In fact you are right, internal names do change from GStreamer 1.17. So, it is better to get internal elements of dtlssrtpenc by its factory name. So I have implemented that way. Unfortunately there is a method in GStreamer 1.18 that allows to do this out of the box, but not in current version, so I implemented a quick version just by iterating on internal elements (they are just 3 so it is not a big deal).

Incidentally, getting internal elements by name was already being done in KmsWebRtcTransportSink to configure internal funnel and srtp encoder. So I also changed also that, thus they are ready for updating to GStreamer 1.18+

@j1elo
Copy link
Member

j1elo commented Apr 21, 2022

Nice, thanks. I said that about internal names, because I know for a fact that it is (or was) being done in some places of Kurento already, and we found about it during the initial stages of Kurento 7.0, porting to GStreamer 1.14, where some names had already changed. So better to search by type instead of name. Only issue would be when multiple elements of the same type are in the Bin, but for simple cases like this one (only 1 element of a given type) it's trivial and more robust.

Hopefully we can upgrade soon to GStreamer 1.18. What version of Ubuntu comes (or will come) with it?

@slabajo
Copy link
Member Author

slabajo commented Apr 22, 2022

I am afraid that GStreamer 1.18 is available on a non LTE version, hirsute I think

@j1elo
Copy link
Member

j1elo commented Apr 27, 2022

@slabajo I've reviewed the newest changes. Did you test again with them? I'm not sure about something here, memory wise, and would like to review with you.

Before the latest review, you used gst_bin_get_by_name, which internally uses g_value_dup_object (see here). But in your latest code changes, that was replaced with g_value_get_object.

Checking the docs, I think the difference between get_object and dup_object is that with the former you get the actual pointer of the element that lives inside a GstBin, while the latter just gives you your own reference that you must handle.

So here I see that get_object is later followed by unref as I originally requested... but I feel like in theory now this is just freeing the memory of the objects directly inside the GstBin! which probably causes a crash or an invalid memory exception.

What do you think?

@slabajo
Copy link
Member Author

slabajo commented Apr 28, 2022

@j1elo In fact I got exactly the same doubt, because it is not clear, at least from documentation, how are iterated objects transfered or not. In principle, logic seems to tell that semantic should be as similar as possible to the get_object_by_xxxxx, but as I say it is not clear from the docs. So I just tested it.

Here you can see the reference count of one the inner elements in dtlssrtpenc bin (for instance the funnel element) before getting the iterator to the inner elements:
get_value1

You can see reference count is one. Then after the iterator moves to the selected element:
get_value2

As you can see ref count has been incremented when the iterator moves to that element, interestingly enough it is not incremented when the iterator is retrieved. It increments only when the iterator moves to the element. Also when you move the iterator to the next element, the previous one gets unreferenced

Of course, as you say the g_value_get_object does not increment reference count, it remains to 2:
get_value3

Last, but not least, when iterator is released, reference count from retrieved objects is not decreased:

get_value4

So, it seems the algorithm is ok as the loop gets the elements reffed on each gst_iterator_next and dereffed when we execute next gst_iterator_next. and only gets reffed when we reached the one we are looking for, as we don't execute any more gst_iterator_next.

In the end it seems the GStreamer guys have implemented the iterator in an easy way for the developer, avoiding to have deal with reffing and unreffing

@j1elo
Copy link
Member

j1elo commented Apr 28, 2022

OK I agree this is confusing because not freeing the GValue on each iteration means that a dangling reference is left, which just happens to coincide with the reference we need. But this, even if it works, is really working a bit against the intended use.

TL;DR: Please use g_value_dup_object() like they do in here: https://gitlab.freedesktop.org/gstreamer/gstreamer/-/blob/9e2bb037f0015807347f767d4c31d161bce0c969/subprojects/gstreamer/gst/gstbin.c#L4481, and use the g_value_reset for each iteration + g_value_unset for the end of the function.


I myself wasn't sure of how this API should be used, but now have a better picture: the GValue that is returned after each gst_iterator_next must be "freed" (released):

The caller is responsible for unsetting or resetting elem with g_value_unset or g_value_reset after usage.

Also, check the intro of GstIterator, this was the clearest to me:

The basic use pattern of an iterator is as follows:

        GstIterator *it = _get_iterator(object);
        GValue item = G_VALUE_INIT;
        done = FALSE;
        while (!done) {
          switch (gst_iterator_next (it, &item)) {
            case GST_ITERATOR_OK:
              ...get/use/change item here...
--->          g_value_reset (&item);
              break;
            case GST_ITERATOR_DONE:
              done = TRUE;
              break;
          }
        }
--->    g_value_unset (&item);
        gst_iterator_free (it);

Where the GValue is reset on each iteration, and unset (which releases everything) in the end. I think the remaining reference you noticed, would be cleared out just at this point.

Otherwise, our object would be left with an extra reference that in principle belongs to the GValue item, which is a local variable, so it disappears as soon as the function exits (kms_webrtc_transport_sink_get_element_in_dtlssrtpenc()).

(cherry picked from commit a0a036dd42bd79e04969f2eb1f511689412c398b)
@slabajo
Copy link
Member Author

slabajo commented Apr 28, 2022

Just updated the handling to make use of the g_value_dup_object. I see it is more clear now

@j1elo
Copy link
Member

j1elo commented Apr 28, 2022

@slabajo I thought that instead of having a custom implementation to search for the elements, it would be easier to maintain if we have the missing function gst_bin_iterate_all_by_element_factory_name() just copied from current GStreamer sources. That way, it would just be a matter of removing the gst code when Kurento is made to work with GStreamer 1.18, and our code will work as-is.

Also, this is even clearer code, as we can assume just a single element of the given type, so the first iterator result is already the one we're looking for.

Please enable the option to allow maintainer commits to your Pull Request: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork

@slabajo
Copy link
Member Author

slabajo commented Apr 28, 2022

For some reason the allow maintainer commit option does not appear for me, sorry for that

@j1elo
Copy link
Member

j1elo commented Apr 28, 2022

For some reason the allow maintainer commit option does not appear for me, sorry for that

OK it seems a limitation of GitHub itself: isaacs/github#1681

For some reason they allow the option for user-owned forks, but not for organization-owned ones. So I can only accept the PR and later make the changes I was talking about.

@j1elo j1elo merged commit d1769dc into Kurento:master Apr 28, 2022
@j1elo
Copy link
Member

j1elo commented Apr 28, 2022

OK the changes I was talking about are pushed now: cb3ca0f

It works by assuming that, given an iterator that filters by factory name, the very first result should be both of

a) The element we're looking for.
b) The only result of the iterator, as we'd expect only one element of each kind to be present.

@j1elo j1elo changed the title Dtls web rtc quick connection WebRTC DTLS handsake quick connection May 11, 2022
@slabajo slabajo mentioned this pull request Jun 27, 2022
8 tasks
j1elo pushed a commit that referenced this pull request Jul 7, 2022
## What is the current behavior you want to change?
This change is related to a previous PR in #37
More specifically this change covers a use case not covered in the previous PR. This is as follows:

 - When Kurento is set as DTLS server it may happen that the remote peer of the WebRTC connection reaches to ICE connected state before Kurento does. 
 - The problem appears when not only the remote peer gets ICE connected before but when the peer sends the DTLS Client Hello and it arrives before the WebRTCEndpoint in Kurento has reached the ICE connected state. 
 - If this happens, the DTLS Client hello is received in the nicesrc element from the Webrtctransportsrc and it is relayed to the dtlssrtpdec element. This in turns starts DTLS handshake and sends back a DTLS Server hello to the nicesink element. But, and this is the real problem, the nicesink finds the ICE connection not yet established and silently drops the DTLS server hello.
 - When the DTLS timeout is up (1 second this first time, but doubling each time), the endpoint will resend again the DTLS server hello packet, and hopefully this time it will find the ICE connection established and the DTLS server hello will be delivered to the remote peer. If not, it will try again on the next timeout. 
 - The problem is that each timeout introduces some additional delay in the WebRTC connection establishment. 
 - In many cases  that we have observed, the DTLS Client hello just arrives a few milliseconds before the ICE gets CONNECTED in WebRTCEndpoint. But the impact is that those few miliseconds will make the connection to delay around 1 second. 

This problem has a lesser impact than the previous PR, but in the end it is worth improving it.

## What is the new behavior provided by this change?
 - This change only takes effect when the WebRTCEndpoint is configured as DTLS server (waiter for DTLS handshake)
 - It just captures all packets coming from the nicesrc element to the dtlssrpdec and stores them until the ICE gets connected. 
 - A signal is connected so that when ICE connection changes state, and the new state is connected, it gets the temporary stored buffers and delivers them to the dtlssrtpdec element, thus ensuring the DTLS handshake process only takes place when the ICE connection is established.

According to our observations there is only one possible buffer stored that corresponds to the DTLS Client hello received from remote peer. Also, we noticed that if in the same call that notifies ICE connection we delivered the stored DTLS client hello, the answer (DTLS server hello) still finds the ICE connection not usable, so to avoid that, we just make that delivery asynchronously in a separate thread 10 ms later.

## How has this been tested?
IT has been tested with automatic tests that cannot yet be used in this Kurento version as they need GStreamer 1.18 to work.
We have also tested it in real environmentes capturing network traffic to see that the change causes the desired behaviour.
j1elo pushed a commit that referenced this pull request Jul 11, 2022
## What is the current behavior you want to change?
This change is related to a previous PR in #37
More specifically this change covers a use case not covered in the previous PR. This is as follows:

 - When Kurento is set as DTLS server it may happen that the remote peer of the WebRTC connection reaches to ICE connected state before Kurento does. 
 - The problem appears when not only the remote peer gets ICE connected before but when the peer sends the DTLS Client Hello and it arrives before the WebRTCEndpoint in Kurento has reached the ICE connected state. 
 - If this happens, the DTLS Client hello is received in the nicesrc element from the Webrtctransportsrc and it is relayed to the dtlssrtpdec element. This in turns starts DTLS handshake and sends back a DTLS Server hello to the nicesink element. But, and this is the real problem, the nicesink finds the ICE connection not yet established and silently drops the DTLS server hello.
 - When the DTLS timeout is up (1 second this first time, but doubling each time), the endpoint will resend again the DTLS server hello packet, and hopefully this time it will find the ICE connection established and the DTLS server hello will be delivered to the remote peer. If not, it will try again on the next timeout. 
 - The problem is that each timeout introduces some additional delay in the WebRTC connection establishment. 
 - In many cases  that we have observed, the DTLS Client hello just arrives a few milliseconds before the ICE gets CONNECTED in WebRTCEndpoint. But the impact is that those few miliseconds will make the connection to delay around 1 second. 

This problem has a lesser impact than the previous PR, but in the end it is worth improving it.

## What is the new behavior provided by this change?
 - This change only takes effect when the WebRTCEndpoint is configured as DTLS server (waiter for DTLS handshake)
 - It just captures all packets coming from the nicesrc element to the dtlssrpdec and stores them until the ICE gets connected. 
 - A signal is connected so that when ICE connection changes state, and the new state is connected, it gets the temporary stored buffers and delivers them to the dtlssrtpdec element, thus ensuring the DTLS handshake process only takes place when the ICE connection is established.

According to our observations there is only one possible buffer stored that corresponds to the DTLS Client hello received from remote peer. Also, we noticed that if in the same call that notifies ICE connection we delivered the stored DTLS client hello, the answer (DTLS server hello) still finds the ICE connection not usable, so to avoid that, we just make that delivery asynchronously in a separate thread 10 ms later.

## How has this been tested?
IT has been tested with automatic tests that cannot yet be used in this Kurento version as they need GStreamer 1.18 to work.
We have also tested it in real environmentes capturing network traffic to see that the change causes the desired behaviour.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants