Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a time-based limit in Terminal._innerWrite #1818

Merged
merged 6 commits into from
Mar 4, 2019

Conversation

juancampa
Copy link
Contributor

I'm opening up this PR mostly to hear what you guys think this proposal

Problem

Currently WRITE_BATCH_SIZE limits the number of writes to allow the renderer to draw a frame this allows the terminal to update but framerate varies widely

Solution

The approach I'm proposing looks at the clock to determine when to stop as opposed to a fixed number of iterations. Effectively this allocates some time for writing, while the rest can be used for rendering.

Tests run

From my tests this change makes the terminal feel a lot smoother when processing a lot of output with the drawback that it takes a bit longer to print it all. I ran tests with different limits and here are the results:

Test setup:

  • Using hyper
  • 165 rows x 295 cols @ 5px font
  • command: find ~
LIMIT FPS Duration
WRITE_TIMEOUT_MS=12ms 33~40 11.7s
WRITE_TIMEOUT_MS=24ms 25~30 11.5s
WRITE_TIMEOUT_MS=30ms 20~25 10s
WRITE_TIMEOUT_MS=MAX_INT 9~13 8.6s
Existing approach (300 iterations) 8~18 8.8
  • First column is the value of the new constant, the approach to limit writing
  • Second column is the framerate as measured by electron's dev tools
  • Third column is the time it takes for the command to finish

As you can see, it takes a bit longer to display everything, which is expected because more frames are being drawn, but IMO it actually feels faster because you see a lot more of the output, and the terminal doesn't look like it's choking.

Caveats

This was only tested with the webgl renderer by @Tyriar which is still WIP (and also working amazing btw)

The idea is that it should run for a bit and then let the renderer draw
a frame so that the terminal look responsive. The existing approach
limits the work done using a fixed number elements from the write
buffer so the duration of a frame can vary widely.

This approach looks at the clock to determine when to stop, we basically
allocate an amount of time each frame to write, while the rest can be
used for rendering.

From my tests this change makes the terminal feel a lot smoother.
@juancampa juancampa changed the title Use a time-based limit to Terminal._innerWrite Use a time-based limit in Terminal._innerWrite Dec 8, 2018
@juancampa
Copy link
Contributor Author

I could also put this behind a flag so users can choose between: smoothness vs. quick to finish.

@jerch
Copy link
Member

jerch commented Dec 8, 2018

@juancampa Yeah the current buffer write time estimation needs some refinement, I like the idea to do it time based. Problem is that there are 3 major timing factors, that are under heavy changes atm:

  • time to store incoming data in the terminal buffer
    With the new TypedArray buffer the average input rate doubles (goes from ~7 MB/s up to ~18 MB/s on my machine). There are changes pending that will further speedup things up to >25 MB/s (Deprecate and remove JS array buffer version #1811).
  • time to draw one terminal page
    This depends on the used renderer, their speed is Webgl > canvas > DOM, where the DOM renderer will take 5 to 10 times longer than the Webgl renderer.
  • GC
    If it pops up it easily takes many milliseconds to finish. We cannot control it by any means, but the transition to the TypedArray buffer already reduced the GC pressure by a great amount (dropped from ~20% to <3% runtime during input flow).

Input flow hiccups occur whenever the sum of the 3 factors differ to much from the average sum for one rendered frame. I phrased it this vagely because users' perception works more with differences than absolute numbers - means a constantly low framerate is typically seen as a better experience than a very high framerate with random hiccups. Its all about difference, the smoother the flow the better.
We currently work on making all 3 factors less time consuming, thus the framerate will go up (I have a playground branch which shows 40-60 fps without patching Terminal._innerWrite). But the faster we make the average fps the more obvious will be a slightly longer taking single frame - a hiccup. This can happen when either the input chunk is much bigger than average and/or a GC event occurs. The render time is typically pretty stable. Furthermore the exp will go downhill with a less powerful machine (devtools has this great feature to simulate a 4x slower CPU, if set the input starts to "flicker").

To sum this up: I think we need a tiny benchmark that tracks runtimes of typical input data chunks. Based on that data we then can adjust the chunks size to keep the fps at a certain level. We would have to leave some room for "random" GC events, something like this:

// some fps
// note to guarantee a certain fps this has to be doubled due to sampling theorem
FPS = 60;
RENDER_TIME = 2; // in msec, depends on renderer
DATA_THROUGHPUT = 20000; // in bytes/ms, benchmarked
GC_GAP = 5; // in msec, some room for GC
...
time_to_spend = 1000 / FPS;  // in msec
max_chunk_size = (time_to_spend - GC_GAP - RENDER_TIME) * DATA_THROUGHPUT;
// now fix chunk size and parse
parse(chunk.slice(0, max_chunk_size);

Now what happens if the incoming data chunks are rather small and many? Yeah this is bigger problem than it might seem - its actually what yes does (due its small chunks and heavy context switches its a good burn in test for any data pumping and other emulators as well). With a logic like above the next small data chunk handling competes with the renderer in a weird way - in the JS event loop. They will delay each other which leads to poor perf over all. Therefore the incoming chunks should be aggregated to some sane sizes as well, and better done beforehand (not within the critical event loop, see #1744). Another way to deal with this would be to offload the input handling into a webworker.

Copy link
Member

@Tyriar Tyriar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, I've been wanting this for a while. We may need to hold off on merging for a bit as it will probably mess up all our test numbers for the big PRs but i definitely want this to go in 😃

src/Terminal.ts Outdated Show resolved Hide resolved
const data = writeBatch.shift();
const time = Date.now();
while (this.writeBuffer.length > 0) {
const data = this.writeBuffer.shift();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When #1796 goes in we can tweak this further to avoid the shift call (pass in a code(s) or start/offset).

src/Terminal.ts Outdated Show resolved Hide resolved
@jerch
Copy link
Member

jerch commented Dec 9, 2018

👍 for holding it off abit until we are done with the big PRs.

Imho we should also establish some way to split data chunks to avoid the hiccup problem as indicated above. Currently a really big chunk would simply "stall" the terminal.

@juancampa
Copy link
Contributor Author

Imho we should also establish some way to split data chunks to avoid the hiccup problem as indicated above. Currently a really big chunk would simply "stall" the terminal.

Agreed. Big chunks are uninterruptible. We should probably tackle that in a follow-up PR.

@juancampa
Copy link
Contributor Author

Also, @jerch thanks for the detailed explanation! I agree with your proposal. This solution is not final as it doesn't achieve the desired FPS perfectly, but it does take us most of the way there. I'm down for refining it as a follow-up task.

@Tyriar
Copy link
Member

Tyriar commented Dec 10, 2018

Currently a really big chunk would simply "stall" the terminal.

@jerch can you clarify what you mean here? Doesn't this PR solve this provided rendering is fast enough?

@jerch
Copy link
Member

jerch commented Dec 10, 2018

@Tyriar
Sure. Currently the measurement is done on the number of chunks which leads to a good estimate when the chunks dont differ to much in the size. The input runtime though scales with bytes, thus the sum of chunk sizes worked on. If there is one really big chunk the input handling will take much longer - a hiccup.

We can easily avoid this by inspecting the chunks sizes and do the estimate based on that instead of the number of chunks.

@juancampa
Copy link
Contributor Author

Here's a screenshot that illustrate how a single call to parse can take a long time (in this case 19ms, already too much for 60FPS). I'm gonna try the mini-benchmark approach where we measure how long it's taking to process, say, 1K, and then call pass with ~16ms worth of bytes at the most.

2018-12-09_23-22

@jerch
Copy link
Member

jerch commented Dec 10, 2018

@juancampa I had a playground branch in summer that did the mini benchmark approach with good results (was not consuming much of its own). It basically shimmed in from time to time and tracked the runtime for the bunch of chunks at hand and created a throughtput number from it. Only tricky thing was the time resolution of the browser due to spectre, most chunks were to short/quickly handled to get a reliable number.
The rest was like the stuff above, but without the GC gap.

@juancampa
Copy link
Contributor Author

@jerch Mind sharing it? :)

@jerch
Copy link
Member

jerch commented Dec 11, 2018

@juancampa Sorray cant find it anymore, was around the testbed PRs (#1528, #1529, #1530) but did not ended up in one of em.

@jerch
Copy link
Member

jerch commented Dec 15, 2018

@juancampa
Slightly offtopic: Since you come from hyper - you might want to check whether hyper would benefit from a tiny buffer between pty and browser engine. We had the same problem in the demo app, the event loop got bombed by tons of small chunks by the websocket and made testing totally unreliable. #1744 solved it.

(Hint: It will double the throughput speed of hyper, but fps will go down slightly without a big chunk fix as proposed above.)

@juancampa
Copy link
Contributor Author

juancampa commented Dec 15, 2018 via email

juancampa added a commit to vercel/hyper that referenced this pull request Dec 29, 2018
Copy link
Member

@Tyriar Tyriar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the buffer-related changes are in and WebGL will probably be an addon when it gets merged (which we can re-test), I say let's push this in for 3.12.0 🎉

@Tyriar Tyriar merged commit ce1c45f into xtermjs:master Mar 4, 2019
@Tyriar
Copy link
Member

Tyriar commented Mar 4, 2019

Created #1955 to follow up the removal of .shift. Thanks again @juancampa 👌

@Tyriar Tyriar added this to the 3.12.0 milestone Mar 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants