-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange delays on data channel request/response round trip implementation #382
Comments
Hi @ozwaldorf! Welcome to str0m! I think the problem in your code is here: https://github.com/ozwaldorf/str0m-experiment/blob/main/server/src/driver.rs#L85 This wait will always happen, also when you are sending new data. The key here is that when you use the In the chat example we have this max timeout of 100ms, which means a potential write would be max delayed by 100ms. https://github.com/algesten/str0m/blob/main/examples/chat.rs#L130 but this is not optimal for a real server. When you use tokio, I recommend using the
Whichever case happens, it must swiftly loop back to Hope that helps! |
Hey @algesten, thank you for the swift response!
Interestingly, I had tried previously using this exact flow with a I've updated the example source with the notify waker put back in, if you want to take a look again (ozwaldorf/str0m-experiment@241ab16) 🙏 Another solution I tried was using a channel for outgoing payloads instead of writing to the connection directly and waiting for a notify. The driver would select over either receiving from the socket, the timeout, or receiving an outgoing payload and calling write there, but this also didn't have any effect on the delay either |
This looks better. However, there's a big difference between this and the chat example – this loop drives all clients, whereas the way chat.rs is structured there is one such loop per client (this is also how we use it in our own production system with tokio). I'm trying to think what the consequences of that would be.
Computers are fast, so for a test this maybe doesn't matter. I think it might scale badly if you get many clients. |
Definitely agree, there is wasted iteration for idle clients while others are still progressing, but that shouldn't be too impactful for this simple experiment as the test is over a local connection for a single client. With that being said, it's still super strange that it's taking ~1.5s for the first chunk to be received after sending the request to the server. With webrtc-rs, this time to first chunk is ~1ms for a local connection, so there's definitely something else off as there's several orders of magnitude difference. |
It's not clear to me what you mean with request here. Are you saying once dtls and ice is setup and you enter the "loop" that for every block number request it takes 1.5s to get 8 chunks back to the client? |
To follow in the flow chart, after the setup and the loop is entered and the request is sent, it takes around 1.5s to receive the first chunk back, and the other 7 chunks are received very quickly after with no excess delays. Every time it loops, the client sends the request, and there is the same delay on the first chunk consistently. Apologies if that wasn't clear before! I updated the flow chart to make the delay clearer as well. |
You say that there was 200ms between first and second chunk. Where is that time spent, in str0m? |
I am no expert on datachannels but things I would try. Check how webrtc-rs datachannel is configured then configure str0m the same. Configure datachannel as unordered/unreliable. Add bias in your select to ensure you prioritize incoming packets before sending new ones. |
This fixes a bad bug where a lot of packets were thrown away for big SCTP payload. Close #382
To flush the contents out we waited 12.869 to 14.344 = 1.475 seconds. The above usage pattern is one we haven't tested much, it loads up str0m with 200k+ of data then flushes it all out. This triggered a bug where we throw away half the SCTP packets. I imagine the reason it worked at all is because of resends. This test fails when I fix str0m, but it seems to be client specific:
|
@ozwaldorf thanks for reporting this bug! |
This fixes a bad bug where a lot of packets were thrown away for big SCTP payload. Close #382
This fixes a bad bug where a lot of packets were thrown away for big SCTP payload. Close #382
Awesome work @algesten! Thanks for the fix |
This fixes a bad bug where a lot of packets were thrown away for big SCTP payload. Close #382
In an experiment to replace webrtc-rs with str0m (as many others are looking to do), I ran into some strange delays happening when a javascript client and str0m server have a request/response round trip, where it takes the client ~1.5 seconds to receive the first payload after sending the request. The example code that contains the behavior can be found here: https://github.com/ozwaldorf/str0m-experiment
The request flow looks like this:
A client receives the total number of blocks for some content, and can request each block id. Each block is <=256KiB, and is chunked into 32KiB payloads and sent over the webrtc datachannel.
There is a very long time to first byte after sending each request for the next block, around 1.5 seconds, from when the javascript client sends the request to when it receives the first byte. Compared to similar code written with webrtc-rs, there aren't any delays with each ttfb being ~1ms.
The str0m server takes ~1ms to receive the request payload as a datachannel message, and ~220ms to write the 8 chunks to the rtc state and for the last chunk to be sent out over the socket. Should note, ~200ms of that time is a delay between sending the first packet and the second packet, and the rest of the other packets with the payloads only take a few ms between each other to send after that. A possible clue that I was able to find while inspecting network packets (via wireshark), is that there always seems to be a STUN request happening while there are no outgoing payload packets from the server.
I'd appreciate a glance over the example code, and any recommendations if there is any misuse of the library, but I haven't been able to find any glaring issues, thanks!
The text was updated successfully, but these errors were encountered: