Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement basic network flow control #2803

Merged
merged 3 commits into from
Jul 11, 2024

Conversation

ns6089
Copy link
Contributor

@ns6089 ns6089 commented Jul 4, 2024

Description

Combat RX/TX buffer overflows and improve multi-FEC on large frames.

Adopted from #1466
Should supersede #2787

Screenshot

Issues Fixed or Closed

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Dependency update (updates to dependencies)
  • Documentation update (changes to documentation)
  • Repository update (changes to repository files, e.g. .github/...)

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated the in code docstring/documentation-blocks for new or existing methods/components

Branch Updates

LizardByte requires that branches be up-to-date before merging. This means that after any PR is merged, this branch
must be updated before it can be merged. You must also
Allow edits from maintainers.

  • I want maintainers to keep my branch updated

@ns6089
Copy link
Contributor Author

ns6089 commented Jul 4, 2024

Basic implementation wasn't that hard, but it uncovered a few problems.

  1. I will try to implement hybrid sleep/spin wait after all, 0.5ms timer precision should be barely enough.
  2. On large bitrates our Reed-Solomon implementation takes significant amount of time. I will try moving it to separate thread, but in the long run it won't be enough.

@cgutman Have you considered alternative RS implementations that make use of SIMD instructions?
Like this one https://github.com/catid/leopard

@cgutman
Copy link
Collaborator

cgutman commented Jul 4, 2024

On large bitrates our Reed-Solomon implementation takes significant amount of time. I will try moving it to separate thread, but in the long run it won't be enough.

I don't think this needs to block merging this PR, unless you think these changes are going to make it worse somehow.

@cgutman Have you considered alternative RS implementations that make use of SIMD instructions?

nanors has SIMD optimizations too, but we don't really get much on x86_64 because it only guarantees SSE2. Bumping it to SSSE3 would enable the hand-optimized codepaths that should be much faster. Assuming we're willing to drop support for CPUs without SSSE3, it should just be as simple as adding -mssse3 to

set_source_files_properties("${CMAKE_SOURCE_DIR}/third-party/nanors/rs.c"
DIRECTORY "${CMAKE_SOURCE_DIR}" "${TEST_DIR}"
PROPERTIES COMPILE_FLAGS "-include deps/obl/autoshim.h -ftree-vectorize")

If SSSE3 resolves the performance issues you saw at the bitrates we're targeting, that's probably enough. If we want to go further, we can use GCC/Clang function multiversioning to have the compiler build AVX2, SSSE3, and SSE2 variants, but that requires more extensive modifications. We could also switch to totally new library if necessary.

@ns6089
Copy link
Contributor Author

ns6089 commented Jul 4, 2024

Yeah, enabling SSSE3 for nanors gives RS encode around 6 times speedup, bringing it to sub millisecond range. Don't need to bother with separate thread then, one less thing to worry about. We should absolutely enabled it.

@cgutman
Copy link
Collaborator

cgutman commented Jul 4, 2024

Excellent, that's a pretty substantial win.

In terms of hardware support we're dropping with SSSE3, it looks like Intel CPUs prior to Core 2/Atom (~2006), AMD CPUs prior to Bulldozer (~2011), and VIA CPUs prior to Nano (~2008). It looks like all common x64 emulation layers like Rosetta 2 and XTA/Prism support SSSE3, so there should be no issues there either.

The only non-SSSE3 CPUs there that might otherwise be performant enough for some modern games/applications would be AMD K10-based processors (Phenom II). However, those also lack AES-NI, so they're going to get punished by encryption too with ~20x more cycles/byte vs AES-NI.

Overall, I think bumping up the CPU requirement to SSSE3 is probably reasonable. Users with very old CPUs can use an older version of Sunshine.

@ns6089
Copy link
Contributor Author

ns6089 commented Jul 5, 2024

@cgutman This generic low latency flow control doesn't seem to be working well, at least on Windows. WSASendMsg() in particular can take more than 5ms even when using extremely low batches (4 packets each), and that's with ethernet level (IEEE 802.3x) flow control disabled at link level. With a lot of calls the combined latency grows. Maybe we can outline all possible congestion scenarios and combat them on a case by case basis? I'm interested in what the aforementioned IEEE 802.3x can't cover. The first thing that comes to mind is streaming over the internet where WAN link is slower than LAN link, but we should be able to detect such cases and maybe apply higher latency throttling with 1ms packet batches.

@ns6089
Copy link
Contributor Author

ns6089 commented Jul 5, 2024

The first thing that comes to mind is streaming over the internet where WAN link is slower than LAN link, but we should be able to detect such cases and maybe apply higher latency throttling with 1ms packet batches.

Or try using QoS shaping https://learn.microsoft.com/en-us/windows/win32/api/qos2/ns-qos2-qos_flowrate_outgoing
In theory this exactly what we want in this scenario, if it works of course.

@ns6089
Copy link
Contributor Author

ns6089 commented Jul 6, 2024

Oh, turns out my travel wifi router has atrocious ethernet switch and more or less requires 0.1ms batches. I guess this is the kind of devices we have problems with. Will try to optimize for it I guess, even though WSASendMsg() occasionally shoots up to 10ms even on such abysmal batches,

@ns6089
Copy link
Contributor Author

ns6089 commented Jul 6, 2024

Oh, turns out my travel wifi router has atrocious ethernet switch and more or less requires 0.1ms batches. I guess this is the kind of devices we have problems with. Will try to optimize for it I guess

Two things were necessary when dealing with this switch

  1. Increasing udp socket send buffer size with SO_SENDBUF
  2. Calling WSASendMsg() with small batches, 10 packets worked best

No sleep or pacing was needed, socket buffer took care of the congestion.
Streams easily at constant 300+Mbps now when before it couldn't even reach 100Mbps.

I guess I can try adding some throttling logic on top of this larger buffer. In theory it should be enough to appease slow clients without resorting to costly spin waits with periodic latency spikes.

@ns6089 ns6089 force-pushed the paced_send branch 3 times, most recently from 04dc814 to ea2b9a6 Compare July 7, 2024 21:40
@ns6089 ns6089 marked this pull request as ready for review July 7, 2024 21:52
@ns6089
Copy link
Contributor Author

ns6089 commented Jul 7, 2024

Should be ready. Works really well on Windows from my tests, haven't tested Linux or MacOS.

@cgutman Maybe we can distinguish WAN streaming from LAN streaming based on the packet size? And apply more aggressive throttling in this case, based on requested bitrate. Something like 2 or 3 times the bitrate, will give 1/2 or 1/3 frame time latency.
ea2b9a6#diff-a8b463faa2b4901d7f646c0a3a856d705fd293dc7b99cee500a375dae91ef6bfR1402-R1403

Copy link

codecov bot commented Jul 7, 2024

Codecov Report

Attention: Patch coverage is 5.81395% with 162 lines in your changes missing coverage. Please review.

Project coverage is 8.99%. Comparing base (6607a28) to head (4fb8438).

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #2803    +/-   ##
=======================================
  Coverage    8.99%   8.99%            
=======================================
  Files          95      95            
  Lines       17312   17412   +100     
  Branches     8236    8272    +36     
=======================================
+ Hits         1557    1567    +10     
- Misses      12890   13118   +228     
+ Partials     2865    2727   -138     
Flag Coverage Δ
Linux 6.76% <0.00%> (-0.05%) ⬇️
Windows 4.20% <5.69%> (+0.04%) ⬆️
macOS-12 10.03% <0.92%> (-0.07%) ⬇️
macOS-13 9.94% <0.92%> (-0.07%) ⬇️
macOS-14 10.24% <0.92%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/platform/common.h 36.90% <100.00%> (+1.53%) ⬆️
src/platform/windows/display.h 7.24% <ø> (+0.10%) ⬆️
src/utility.h 28.50% <100.00%> (ø)
src/platform/windows/display_base.cpp 12.95% <0.00%> (+0.45%) ⬆️
src/platform/macos/misc.mm 6.85% <0.00%> (-0.17%) ⬇️
src/platform/linux/misc.cpp 9.04% <0.00%> (-0.16%) ⬇️
src/platform/windows/misc.cpp 2.23% <14.58%> (+0.74%) ⬆️
src/stream.cpp 0.96% <0.00%> (+0.05%) ⬆️

... and 20 files with indirect coverage changes

@ns6089
Copy link
Contributor Author

ns6089 commented Jul 8, 2024

Fixed the typo in linux/macos code, builds shouldn't fail now.

@ReenigneArcher ReenigneArcher mentioned this pull request Jul 8, 2024
11 tasks
@ns6089
Copy link
Contributor Author

ns6089 commented Jul 8, 2024

I also have the follow-up refactoring to periodic loggers that remove most of the syntactic bloat from them, improving the readability.
ns6089@60e2c2f#diff-a8b463faa2b4901d7f646c0a3a856d705fd293dc7b99cee500a375dae91ef6bf
I can add it to this PR, if anyone prefers.

Copy link
Collaborator

@cgutman cgutman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pacing behavior seems great in my testing on Windows and Linux. I'm seeing the expected behavior of densely packed FEC blocks without the nasty packet loss I had in my previous attempt.

Once these last few issues are resolved, this looks good to merge.

cmake/targets/common.cmake Outdated Show resolved Hide resolved
src/stream.cpp Outdated Show resolved Hide resolved
src/stream.cpp Outdated Show resolved Hide resolved
@ns6089 ns6089 changed the title Try to implement basic network flow control Implement basic network flow control Jul 10, 2024
cgutman and others added 3 commits July 10, 2024 18:12
We were too conservative in determining our max data size before needing to split, which resulted in many frames being split into multiple FEC blocks unnecessarily.

We also just used a hardcoded split into 3 blocks instead of actually calculating how many blocks are actually required.
@ReenigneArcher ReenigneArcher enabled auto-merge (squash) July 10, 2024 23:28
@ReenigneArcher ReenigneArcher merged commit 037c61d into LizardByte:master Jul 11, 2024
48 of 49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants