Implement basic network flow control #2803

ns6089 · 2024-07-04T17:35:41Z

Description

Combat RX/TX buffer overflows and improve multi-FEC on large frames.

Adopted from #1466
Should supersede #2787

Screenshot

Issues Fixed or Closed

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Dependency update (updates to dependencies)
Documentation update (changes to documentation)
Repository update (changes to repository files, e.g. .github/...)

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have added or updated the in code docstring/documentation-blocks for new or existing methods/components

Branch Updates

LizardByte requires that branches be up-to-date before merging. This means that after any PR is merged, this branch
must be updated before it can be merged. You must also
Allow edits from maintainers.

I want maintainers to keep my branch updated

ns6089 · 2024-07-04T17:44:10Z

Basic implementation wasn't that hard, but it uncovered a few problems.

I will try to implement hybrid sleep/spin wait after all, 0.5ms timer precision should be barely enough.
On large bitrates our Reed-Solomon implementation takes significant amount of time. I will try moving it to separate thread, but in the long run it won't be enough.

@cgutman Have you considered alternative RS implementations that make use of SIMD instructions?
Like this one https://github.com/catid/leopard

cgutman · 2024-07-04T18:41:51Z

On large bitrates our Reed-Solomon implementation takes significant amount of time. I will try moving it to separate thread, but in the long run it won't be enough.

I don't think this needs to block merging this PR, unless you think these changes are going to make it worse somehow.

@cgutman Have you considered alternative RS implementations that make use of SIMD instructions?

nanors has SIMD optimizations too, but we don't really get much on x86_64 because it only guarantees SSE2. Bumping it to SSSE3 would enable the hand-optimized codepaths that should be much faster. Assuming we're willing to drop support for CPUs without SSSE3, it should just be as simple as adding -mssse3 to

Sunshine/cmake/targets/common.cmake

Lines 88 to 90 in c92e4f2

    
           set_source_files_properties("${CMAKE_SOURCE_DIR}/third-party/nanors/rs.c" 
        
                   DIRECTORY "${CMAKE_SOURCE_DIR}" "${TEST_DIR}" 
        
                   PROPERTIES COMPILE_FLAGS "-include deps/obl/autoshim.h -ftree-vectorize")

If SSSE3 resolves the performance issues you saw at the bitrates we're targeting, that's probably enough. If we want to go further, we can use GCC/Clang function multiversioning to have the compiler build AVX2, SSSE3, and SSE2 variants, but that requires more extensive modifications. We could also switch to totally new library if necessary.

ns6089 · 2024-07-04T19:06:52Z

Yeah, enabling SSSE3 for nanors gives RS encode around 6 times speedup, bringing it to sub millisecond range. Don't need to bother with separate thread then, one less thing to worry about. We should absolutely enabled it.

cgutman · 2024-07-04T20:13:26Z

Excellent, that's a pretty substantial win.

In terms of hardware support we're dropping with SSSE3, it looks like Intel CPUs prior to Core 2/Atom (~2006), AMD CPUs prior to Bulldozer (~2011), and VIA CPUs prior to Nano (~2008). It looks like all common x64 emulation layers like Rosetta 2 and XTA/Prism support SSSE3, so there should be no issues there either.

The only non-SSSE3 CPUs there that might otherwise be performant enough for some modern games/applications would be AMD K10-based processors (Phenom II). However, those also lack AES-NI, so they're going to get punished by encryption too with ~20x more cycles/byte vs AES-NI.

Overall, I think bumping up the CPU requirement to SSSE3 is probably reasonable. Users with very old CPUs can use an older version of Sunshine.

ns6089 · 2024-07-05T16:20:00Z

@cgutman This generic low latency flow control doesn't seem to be working well, at least on Windows. WSASendMsg() in particular can take more than 5ms even when using extremely low batches (4 packets each), and that's with ethernet level (IEEE 802.3x) flow control disabled at link level. With a lot of calls the combined latency grows. Maybe we can outline all possible congestion scenarios and combat them on a case by case basis? I'm interested in what the aforementioned IEEE 802.3x can't cover. The first thing that comes to mind is streaming over the internet where WAN link is slower than LAN link, but we should be able to detect such cases and maybe apply higher latency throttling with 1ms packet batches.

ns6089 · 2024-07-05T20:31:33Z

The first thing that comes to mind is streaming over the internet where WAN link is slower than LAN link, but we should be able to detect such cases and maybe apply higher latency throttling with 1ms packet batches.

Or try using QoS shaping https://learn.microsoft.com/en-us/windows/win32/api/qos2/ns-qos2-qos_flowrate_outgoing
In theory this exactly what we want in this scenario, if it works of course.

ns6089 · 2024-07-06T00:56:37Z

Oh, turns out my travel wifi router has atrocious ethernet switch and more or less requires 0.1ms batches. I guess this is the kind of devices we have problems with. Will try to optimize for it I guess, even though WSASendMsg() occasionally shoots up to 10ms even on such abysmal batches,

ns6089 · 2024-07-06T09:00:22Z

Oh, turns out my travel wifi router has atrocious ethernet switch and more or less requires 0.1ms batches. I guess this is the kind of devices we have problems with. Will try to optimize for it I guess

Two things were necessary when dealing with this switch

Increasing udp socket send buffer size with SO_SENDBUF
Calling WSASendMsg() with small batches, 10 packets worked best

No sleep or pacing was needed, socket buffer took care of the congestion.
Streams easily at constant 300+Mbps now when before it couldn't even reach 100Mbps.

I guess I can try adding some throttling logic on top of this larger buffer. In theory it should be enough to appease slow clients without resorting to costly spin waits with periodic latency spikes.

ns6089 · 2024-07-07T21:57:40Z

Should be ready. Works really well on Windows from my tests, haven't tested Linux or MacOS.

@cgutman Maybe we can distinguish WAN streaming from LAN streaming based on the packet size? And apply more aggressive throttling in this case, based on requested bitrate. Something like 2 or 3 times the bitrate, will give 1/2 or 1/3 frame time latency.
ea2b9a6#diff-a8b463faa2b4901d7f646c0a3a856d705fd293dc7b99cee500a375dae91ef6bfR1402-R1403

codecov · 2024-07-07T23:38:05Z

Codecov Report

Attention: Patch coverage is 5.81395% with 162 lines in your changes missing coverage. Please review.

Project coverage is 8.99%. Comparing base (6607a28) to head (4fb8438).

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #2803    +/-   ##
=======================================
  Coverage    8.99%   8.99%            
=======================================
  Files          95      95            
  Lines       17312   17412   +100     
  Branches     8236    8272    +36     
=======================================
+ Hits         1557    1567    +10     
- Misses      12890   13118   +228     
+ Partials     2865    2727   -138

Flag	Coverage Δ
Linux	`6.76% <0.00%> (-0.05%)`	⬇️
Windows	`4.20% <5.69%> (+0.04%)`	⬆️
macOS-12	`10.03% <0.92%> (-0.07%)`	⬇️
macOS-13	`9.94% <0.92%> (-0.07%)`	⬇️
macOS-14	`10.24% <0.92%> (-0.07%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
src/platform/common.h	`36.90% <100.00%> (+1.53%)`	⬆️
src/platform/windows/display.h	`7.24% <ø> (+0.10%)`	⬆️
src/utility.h	`28.50% <100.00%> (ø)`
src/platform/windows/display_base.cpp	`12.95% <0.00%> (+0.45%)`	⬆️
src/platform/macos/misc.mm	`6.85% <0.00%> (-0.17%)`	⬇️
src/platform/linux/misc.cpp	`9.04% <0.00%> (-0.16%)`	⬇️
src/platform/windows/misc.cpp	`2.23% <14.58%> (+0.74%)`	⬆️
src/stream.cpp	`0.96% <0.00%> (+0.05%)`	⬆️

... and 20 files with indirect coverage changes

ns6089 · 2024-07-08T09:11:31Z

Fixed the typo in linux/macos code, builds shouldn't fail now.

ns6089 · 2024-07-08T20:10:31Z

I also have the follow-up refactoring to periodic loggers that remove most of the syntactic bloat from them, improving the readability.
ns6089@60e2c2f#diff-a8b463faa2b4901d7f646c0a3a856d705fd293dc7b99cee500a375dae91ef6bf
I can add it to this PR, if anyone prefers.

cgutman

Pacing behavior seems great in my testing on Windows and Linux. I'm seeing the expected behavior of densely packed FEC blocks without the nasty packet loss I had in my previous attempt.

Once these last few issues are resolved, this looks good to merge.

cmake/targets/common.cmake

src/stream.cpp

We were too conservative in determining our max data size before needing to split, which resulted in many frames being split into multiple FEC blocks unnecessarily. We also just used a hardcoded split into 3 blocks instead of actually calculating how many blocks are actually required.

ns6089 force-pushed the paced_send branch 3 times, most recently from 04dc814 to ea2b9a6 Compare July 7, 2024 21:40

ns6089 marked this pull request as ready for review July 7, 2024 21:52

ReenigneArcher mentioned this pull request Jul 8, 2024

Bump MAX_FEC_BLOCKS to 4 #2787

Closed

11 tasks

cgutman requested changes Jul 9, 2024

View reviewed changes

cmake/targets/common.cmake Outdated Show resolved Hide resolved

src/stream.cpp Outdated Show resolved Hide resolved

src/stream.cpp Outdated Show resolved Hide resolved

cgutman mentioned this pull request Jul 10, 2024

Provide SSSE3, AVX2, and AVX512 optimized Reed-Solomon functions #2828

Merged

11 tasks

ns6089 changed the title ~~Try to implement basic network flow control~~ Implement basic network flow control Jul 10, 2024

cgutman force-pushed the paced_send branch from b609f52 to ae1d7a6 Compare July 10, 2024 23:11

cgutman and others added 3 commits July 10, 2024 18:12

Maximize timer precision on Windows

945d1f8

Add static pacing to video network packets

4fb8438

cgutman force-pushed the paced_send branch from ae1d7a6 to 4fb8438 Compare July 10, 2024 23:12

cgutman approved these changes Jul 10, 2024

View reviewed changes

ReenigneArcher enabled auto-merge (squash) July 10, 2024 23:28

ReenigneArcher merged commit 037c61d into LizardByte:master Jul 11, 2024
48 of 49 checks passed

ns6089 mentioned this pull request Jul 12, 2024

Refactor periodic loggers and limit batch size for small packets #2835

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement basic network flow control #2803

Implement basic network flow control #2803

ns6089 commented Jul 4, 2024 •

edited

Loading

ns6089 commented Jul 4, 2024

cgutman commented Jul 4, 2024

ns6089 commented Jul 4, 2024

cgutman commented Jul 4, 2024

ns6089 commented Jul 5, 2024

ns6089 commented Jul 5, 2024

ns6089 commented Jul 6, 2024

ns6089 commented Jul 6, 2024

ns6089 commented Jul 7, 2024

codecov bot commented Jul 7, 2024 •

edited

Loading

ns6089 commented Jul 8, 2024

ns6089 commented Jul 8, 2024

cgutman left a comment

Implement basic network flow control #2803

Implement basic network flow control #2803

Conversation

ns6089 commented Jul 4, 2024 • edited Loading

Description

Screenshot

Issues Fixed or Closed

Type of Change

Checklist

Branch Updates

ns6089 commented Jul 4, 2024

cgutman commented Jul 4, 2024

ns6089 commented Jul 4, 2024

cgutman commented Jul 4, 2024

ns6089 commented Jul 5, 2024

ns6089 commented Jul 5, 2024

ns6089 commented Jul 6, 2024

ns6089 commented Jul 6, 2024

ns6089 commented Jul 7, 2024

codecov bot commented Jul 7, 2024 • edited Loading

Codecov Report

ns6089 commented Jul 8, 2024

ns6089 commented Jul 8, 2024

cgutman left a comment

Choose a reason for hiding this comment

ns6089 commented Jul 4, 2024 •

edited

Loading

codecov bot commented Jul 7, 2024 •

edited

Loading