-
-
Notifications
You must be signed in to change notification settings - Fork 950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve frametiming for linux capture #2333
Improve frametiming for linux capture #2333
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## nightly #2333 +/- ##
==========================================
+ Coverage 7.35% 7.41% +0.06%
==========================================
Files 95 95
Lines 18949 18965 +16
Branches 8130 8070 -60
==========================================
+ Hits 1393 1406 +13
- Misses 15857 16519 +662
+ Partials 1699 1040 -659
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Don't worry about the patch "failure" but the lint failure will need to be addressed. https://github.com/LizardByte/Sunshine/actions/runs/8493409915/job/23268083421?pr=2333#step:5:12 |
I will test this PR ASAP. I'll look also into supporting NvFBC - if applicable. |
b8f6c4d
to
7a7522d
Compare
I force-pushed the revised commit. Unfortunately it's late again and my time for actual testing is limited. I'll do my best. (At least testing something game related is fun ;-))
|
As for the nvfcb codepath, I highly suggest to open a new issue (cc @peperunas ) and make it only about nvfbc on linux. The framerate is configured here: Sunshine/src/platform/linux/cuda.cpp Line 757 in bb7c2d5
and then it's simply passed to nvidia's software: Sunshine/src/platform/linux/cuda.cpp Line 651 in bb7c2d5
At least that's my understanding. What I find peculiar though is that This could be problematic depending on how this is working in detail. As the sampling rate parameter is an integer, the delay between successive iterations will be 16ms instead of 16.66666666. IF the nvfbc code does not wait and block until the next vsync, and vsync is probably disabled on the host, this will lead to an inexact timing of the captured frames. In essence it would be similar to the situation that this PR addresses for kmsgrab and x11grab. What I find puzzling though is that this would in effect mean that this aspect of the API itself were broken, since it forces the interval to be quite imprecise. I think that a) someone a bit more familiar with this should have a look at this and think it through, and b) there should be some in-depth testing as I did in my original bug report, i.e. check the long-term average framerate that moonlight-qt writes to its log at the end of the stream and post it in the newly opened issue. Even a tiny (but supposedly regular) deviation will easily lead to microstutter. Theoretically it would lead to 62.5 frames per second, some of which would have to be dropped somewhere. Or if nvidia only emits a new frame when the display content has actually changed (let's say at a precise but unrealistic 16.66666ms interval), there will regularly be situations where a whole 16ms interval will fit entirely in the 16.666ms interval between two successive frames. This would then lead to an iteration without capture of a new frame. The next frame, a few microseconds later, would then have to wait almost a complete frame interval until it's finally captured by the next iteration and then encoded end emitted. I can't imagine how this could result in a perfectly smooth stream. |
And even without that 16ms interval fitting right into a 16.666ms interval, you'd still have two frequencies close to each other. That will always lead to beating and hence issues with frame pacing. Unless each 16ms interval is dragged out via some synching mechanism until the next 16.666ms frame render interval is done. |
Oh well, I'll probably have to take all that stuff about nvfbc back. I don't know how, but somehow I must have missed this part: Sunshine/src/platform/linux/cuda.cpp Lines 808 to 812 in bb7c2d5
which does exhibit the same issue as for kmsgrab and x11grab. The dwSamplingRateMs thing is probably just to populate the structure passed to nvfbc.
I must admit that I'm struggling a bit with following the code. |
I've gone ahead and copy-pasted the change from (kms|x11)grab to cuda.cpp. Let's see how it turns out... cc @peperunas |
As me tioned in the related issue, the patch works well for me. Great job! |
Thanks for testing! Just to make sure: That's still using NvFBC for capturing (using patched drivers)? |
We should probably remove this note: https://github.com/LizardByte/Sunshine/pull/1438/files |
I'm not sure either way, but I can of course remove that note. There seem to be two separate issues though in #1429:
That second microstutter could well be the one fixed here. Not sure about the first one. Also: Does it introduce that stutter only while streaming or also when using gamescope as compositor and playing locally on the host? (cc @Arbitrate3280 ) As for the first one it does remind me of my oldish mini pc: if I start moonlight from inside a desktop environment using ubuntu's default compositor (gnome's mutter) it's an uneven mess, but if I start from a plain virtual console it's buttery smooth (with my PR that is...). That's client-side though, not host-side. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also update wlgrab.cpp?
Sure, I just didn't think of it! But just now I'm noticing that it does raise all of those questions that I didn't want to raise in this PR. (Because I'm not sure I'll have the time to properly follow through and do all the investigations and experimentation.) Anyway, the waiting loop in (kms|x11)grab is like this (before my PR): Sunshine/src/platform/linux/kmsgrab.cpp Lines 1199 to 1206 in 2da6fb0
It first waits using sleep_for and then it tries to do a kind of busy-loop until the right nanosecond (which is unrealistic). I was wondering what the point of this approach is? Is ita) because no sleep function in the standard library is precise enough? b) a vague attempt to spin up the cpu and have it ramp up its frequency? I suppose it's a). But won't any sleep_for yield to the kernel scheduler anyway? In that case line 1203 would negate the whole idea of first having the supposedly inexact sleep_for() for 2/3 of the interval complemented by a busy-wait. (I'm noticing that top gives some 7-9% cpu usage when sunshine is streaming and not the expected 33% of a core if it were indeed busy-waiting. How is cpu usage currently using wlgrab?)
Within Sunshine/src/platform/linux/wlgrab.cpp Lines 135 to 141 in 2da6fb0
So in essence I'm wondering if Here is the commit that introduced the 1ns-sleep: e3f642a To be honest I'm wondering how important all of that timing micro-optimization really is. In particular if we ensure that on average the pacing is the theoretically exact 60.00 fps (or whatever is requested by moonlight). How large could this overshoot be? Is it even reasonable to reserve a third(!) of the time for busy-waiting until the precise nanosecond? I googled around a bit, and all I could find was that in practice the What do you think? |
BTW software encoding (kmsgrab) also leads to the expected 60.00 fps:
(I'm not sure why the encoding misbehaved 3 or 4 times in those 90 minutes (could well be that the host did some unrelated background processing), but apart from that it was as smooth as can be expected given those global stats. |
Investigating all those timing questions in my previous post was unexpectedly straightforward, thanks to the contributor that already added all the overshoot measurement code to the Windows side of Sunshine... IOW I just copied over the code from Windows to Linux kmsgrab and did some tests. First testCurrent sleep methodology (sleep_for for 2/3 of the time, then suspicious "busy"-waiting), streaming my Gnome desktop, mostly idle, running glxgears, wiggling it around a bit, light stuff. As you can see below (in the first part of the log output) the maximum overshoot is negligible (<1ms). Then I loaded the CPU with
Second testSame as before, except that I removed the busy-wait logic and replaced it by a single
Only by adding more stress (starting a youtube video in firefox in addition to Of course this test might not correspond perfectly to the main use case, but I think it should be a good enough test of how stable the timings are when under CPU load. My plan for this PR is now:
If it turns out after more varied testing of the nightlies that the overshoot is deemed problematic after all, it's fairly easy to add a busy-wait back in, but in that case it probably should not call 3rd testI also tested (still with kmsgrab) the true busy-wait loop as it can currently be found in
|
Outstanding work @gschintgen! |
Before this commit kmsgrab, x11grab, cuda used a "busy" wait that did not in fact keep the thread busy since it called `sleep_for()`. The whole waiting period is now slept for at once. Wlgrab used a true busy wait for a third of the waiting time. This entailed a rather high CPU usage. The watiting method has been aligned to the simplified method of the other Linux capture backends. (Benchmark data can be found in the discussion of PR LizardByte#2333.)
sleep_overshoot_tracker.collect_and_callback_on_interval(overshoot_ns.count() / 1000000., print_info, 20s); | ||
} | ||
std::chrono::nanoseconds overshoot_ns = std::chrono::steady_clock::now() - sleep_target; | ||
log_sleep_overshoot(overshoot_ns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to avoid too much code duplication, I moved out the overshoot logging to common platform code. Unfortunately I don't have a Windows build environment set up, so I couldn't test these changes.
In the latest commits I harmonized the sleeping code for all Linux capture methods. I did only minor testing to confirm that nothing broke and that the timings remain as good as with the original changes in this PR. (I couldn't test the minor changes to the Windows capture code.) Please review. (And re-testing would be nice too.) |
I am testing your code as I write. One input I have is that maybe the logging might be done in a debug build - if such a build is supported by Sunshine's build system - to save some calculations. |
Thanks for testing! The logging & printing code is only executed when the log level in sunshine's configuration is set to debug or higher. The log level can easily be changed in the UI. Here is the relevant |
No issues from my side, LGTM! |
This commit computes the time of the next screen capture based on the current frame's theoretical time point instead of the actual capture time. This should be slightly more precise and lead to better frame timing.
Before this commit kmsgrab, x11grab, cuda used a "busy" wait that did not in fact keep the thread busy since it called `sleep_for()`. The whole waiting period is now slept for at once. Wlgrab used a true busy wait for a third of the waiting time. This entailed a rather high CPU usage. The watiting method has been aligned to the simplified method of the other Linux capture backends. (Benchmark data can be found in the discussion of PR LizardByte#2333.)
Logging code for sleep overshoot analysis has been added to all Linux capture backends. Duplicated code has been moved out to common platform code.
87a5d93
to
5737f0d
Compare
Thanks for reviewing and merging @cgutman! |
Description
This commit computes the time of the next screen capture based on the current frame's theoretical time point instead of the actual capture time. This should be slightly more precise and lead to better frame timing.
I made the same change in three parts of the codebase:
kmsgrab.cpp
(display_ram_t
anddisplay_vram_t
). IIUC these correspond to capture to system RAM for software encoding and capture to VRAM for hardware encoding respectively. Please correct me if I'm wrong; I did not try to understand the whole pipeline.x11grab.cpp
which presented the same issue.At this point I did only preliminary testing of the kmsgrab & VA-API case on an AMD 6650. The results were very encouraging with a precise 60.0 Hz instead of the previous average of 59.94 Hz. (See #2286 for details.) I'll vary my testing over the next few days and report back. Any feedback and review is welcome, of course.
If I'm not mistaken this fix should benefit all Linux users with the exception of NvFBC configurations.
Screenshot
Issues Fixed or Closed
Fixes #2286
Type of Change
.github/...
)Checklist
Branch Updates
LizardByte requires that branches be up-to-date before merging. This means that after any PR is merged, this branch
must be updated before it can be merged. You must also
Allow edits from maintainers.