Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing large data is extremely slow (in VMware) #3075

Closed
egmontkob opened this issue Oct 5, 2019 · 6 comments
Closed

Processing large data is extremely slow (in VMware) #3075

egmontkob opened this issue Oct 5, 2019 · 6 comments
Labels
Area-Performance Performance-related issue Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Tag-Fix Doesn't match tag requirements Priority-1 A description (P1) Product-Terminal The new Windows Terminal.
Milestone

Comments

@egmontkob
Copy link

Environment

Windows build number: Win32NT 10.0.18362.0
Windows Terminal version (if applicable): 0.5.2762.0

inside VMware Workstation Player 15.5.0 on Ubuntu 19.10

Core i5-6200U @ 2.30GHz

Steps to reproduce / Actual behavior

cat'ing my favorite test file for speed measurement takes extremely long time.

Inside VMware, in Windows Terminal it takes about 10 minutes of wall clock time.

On Linux (inside the same VMware) it takes from 2 to 22 seconds mesasured in various terminal emulators, namely: Konsole (KDE), Pterm (PuTTY), St (suckless), Terminology (Enlightenment), Urxvt, VTE, Xterm.

The test file is ~42MiB large, contains ~667k lines. It's the output of a ls -lR --color=always / on Ubuntu. (I'm not attaching it since it could leak private stuff, plus it would be a pointless waste of storage space.)

cat is executed either locally in PowerShell, or remotely over ssh to my host computer, it doesn't matter.

On the previous version of WT which I installed about a week ago, it took about 9:22 (the same time twice) to cat this file at the terminal's default size. If the terminal was iconified, or I switched to another (idle) tab, the time dropped to 6:50-ish. Interestingly, in a tiny but visible terminal (approx. 30x4) the time increased to 11 minutes. In a giant terminal (maximized with pretty small font) the time hardly increased, to 9:37.

With the current WT version 0.5.2762.0 now I'm seeing even larger numbers: 10:39 at the default size (measured only once), 7:30-ish in minimized window or when viewing another tab.

The exact times probably don't matter too much, we're talking about the magnitudes here, it's ~100x slower than VTE for example with its 5.2 seconds if it's in a good mood.

The given example is sure an extreme one (why would anyone cat such a giant file?), but smaller files, such as /etc/services already take a noticeable ~0.5 seconds, whereas on Linux terminals it's instantaneous. For verbose compilations of large projects, this could actually cause a noticeable productivity loss for developers.

I don't know whether running under VMware (e.g. no hardware graphics acceleration) is relevant, but again, the Linux numbers were also measured inside VMware, on a Fedora 30 guest.

Expected behavior

Windows Terminal should be comparably fast to most graphical terminals on *NIX.

Since I don't know the reason for the slowness (which needs to be investigated first), and I don't know whether it's specific to VMware, these are random guesses only and might completely miss the actual problem:

The terminal should read and parse as much data as possible, only stopping for updating its UI according to the monitor refresh rate, typically 60 times per second (or maybe at a hardwired 60Hz if refresh rate can't be detrmined or the concept doesn't exist – I don't know how it goes in VMware). If updating the UI takes so much time that there's hardly any time left for processing incoming data, it should start dropping frames (VTE counterpart, kind of). In iconified state or when another tab is selected, it shouldn't spend any time on drawing.


Note that I've checked other bugreports about slowness, e.g. #1064, but they don't seem to be about this kind of extreme slowness.

@ghost ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Oct 5, 2019
@carlos-zamora carlos-zamora added Area-Performance Performance-related issue Issue-Bug It either shouldn't be doing this or needs an investigation. Product-Terminal The new Windows Terminal. labels Oct 7, 2019
@ghost ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Oct 7, 2019
@zadjii-msft zadjii-msft added this to the Terminal v1.0 milestone Oct 24, 2019
@DHowett-MSFT DHowett-MSFT removed the Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting label Oct 29, 2019
@zadjii-msft zadjii-msft added the Priority-3 A description (P3) label Jan 22, 2020
@cinnamon-msft cinnamon-msft added Priority-2 A description (P2) and removed Priority-3 A description (P3) labels Feb 28, 2020
@DHowett-MSFT
Copy link
Contributor

Re-marking this one for post-1.0 instead of pre-1.0. We're making improvements here, such that we can cut down on a bunch of unnecessary rendering, but we're not quite to the point where we can just shuttle all the data through the console driver at a great enough speed to make this "instantaneous". We'll keep investigating after our launch. Thanks for the robust bug reports, as always, @egmontkob.

@miniksa
Copy link
Member

miniksa commented Jun 29, 2020

90a24b2 is my attempt at experimenting to see if we can make this go even faster by breaking the locking that is occurring here.

For big.txt from #1064 (which is about 6MB), I go from

real    0m3.838s
user    0m0.000s
sys     0m0.150s

to

real    0m0.124s
user    0m0.000s
sys     0m0.113s

The graphical output still takes longer than that, but it's not backing up the actual I/O. Also, it's a super dumb and rough implementation to try to prove whether this is worth pursuing. It's nowhere near ready. But I think it proves that there's a good return on investment to be had in this area by breaking up the locking.

@NeKJ
Copy link

NeKJ commented Jul 30, 2020

Yes, IMO it's definitely worth it. The display/rendering should never block IO/CPU operations (as it happens now if I understood correctly).

@miniksa
Copy link
Member

miniksa commented Jul 30, 2020

Yes, IMO it's definitely worth it. The display/rendering should never block IO/CPU operations (as it happens now if I understood correctly).

Well... I can't have it "never" block IO/CPU unless I consume an infinite amount of memory or otherwise optimize the entire pipeline to be balanced.

ghost pushed a commit that referenced this issue Feb 16, 2021
…nd bitmap runs and utf8 conversions (#8621)

## References
* See also #8617 

## PR Checklist
* [x] Supports #3075
* [x] I work here.
* [x] Manual test.

## Detailed Description of the Pull Request / Additional comments

### Window Title Generation
Every time the renderer checks the title, it's doing two bad things that
I've fixed:
1. It's assembling the prefix to the full title doing a concatenation.
   No one ever gets just the prefix ever after it is set besides the
   concat. So instead of storing prefix and the title, I store the
   assembled prefix + title and the bare title.
2. A copy must be made because it was returning `std::wstring` instead
   of `std::wstring&`. Now it returns the ref.

### Dirty Area Return
Every time the renderer checks the dirty area, which is sometimes
multiple times per pass (regular text printing, again for selection,
etc.), a vector is created off the heap to return the rectangles. The
consumers only ever iterate this data. Now we return a span over a
rectangle or rectangles that the engine must store itself.
1. For some renderers, it's always a constant 1 element. They update
   that 1 element when dirty is queried and return it in the span with a
   span size of 1.
2. For other renderers with more complex behavior, they're already
   holding a cached vector of rectangles. Now it's effectively giving
   out the ref to those in the span for iteration.

### Bitmap Runs
The `til::bitmap` used a `std::optional<std::vector<til::rectangle>>`
inside itself to cache its runs and would clear the optional when the
runs became invalidated. Unfortunately doing `.reset()` to clear the
optional will destroy the underlying vector and have it release its
memory. We know it's about to get reallocated again, so we're just going
to make it a `std::pmr::vector` and give it a memory pool. 

The alternative solution here was to use a `bool` and
`std::vector<til::rectangle>` and just flag when the vector was invalid,
but that was honestly more code changes and I love excuses to try out
PMR now.

Also, instead of returning the ref to the vector... I'm just returning a
span now. Everyone just iterates it anyway, may as well not share the
implementation detail.

### UTF-8 conversions
When testing with Terminal and looking at the `conhost.exe`'s PTY
renderer, it spends a TON of allocation time on converting all the
UTF-16 stuff inside to UTF-8 before it sends it out the PTY. This was
because `ConvertToA` was allocating a string inside itself and returning
it just to have it freed after printing and looping back around again...
as a PTY does.

The change here is to use `til::u16u8` that accepts a buffer out
parameter so the caller can just hold onto it.

## Validation Steps Performed
- [x] `big.txt` in conhost.exe (GDI renderer)
- [x] `big.txt` in Terminal (DX, PTY renderer)
- [x] Ensure WDDM and BGFX build under Razzle with this change.
@Po-wei
Copy link

Po-wei commented May 7, 2021

@miniksa
Just a small finding.
Using the same big.txt from #1064 with my computer.
time cat big.txt took

real    11.26s
user    0.00s
sys     1.61s

But if I do a ssh localhost first and then type the same command
it took

real    3.38s
user    0.00s
sys     0.17s

For the rendering, both looks identical for me.
Not sure if ssh do a larger buffer in the background or something

win 10.0.19042.928
terminal 1.8.1032.0
ubuntu 20.04

ghost pushed a commit that referenced this issue May 13, 2021
…st scenarios (#10071)

Implement PGO in pipelines for AMD64 architecture; supply training test scenarios

## References
- #3075 - Relevant to speed interests there and other linked issues.

## PR Checklist
* [x] Closes #6963
* [x] I work here.
* [x] New UIA Tests added and passed. Manual build runs also tested.

## Detailed Description of the Pull Request / Additional comments
- Creates a new pipeline run for creating instrumented binaries for Profile Guided Optimization (PGO).
- Creates a new suite of UIA tests on the full Windows Terminal app to run PGO training scenarios on instrumented binaries (and incidentally can be used to write other UIA tests later for the full Terminal app.)
- Creates a new NuGet artifact to store trained PGO databases (PGD files) at `Microsoft.Internal.Windows.Terminal.PGODatabase`
- Creates a new NuGet artifact to supply large-scale test content for automated tests at `Microsoft.Internal.Windows.Terminal.TestContent`
- Adjusts the release pipeline to run binaries in PGO optimized mode where content from PGO databases is leveraged at link time to optimize the final release build

The following binaries are trained:
- OpenConsole.exe
- WindowsTerminal.exe
- TerminalApp.dll
- TerminalConnection.dll
- Microsoft.Terminal.Control.dll
- Microsoft.Terminal.Remoting.dll
- Microsoft.Terminal.Settings.Editor.dll
- Microsoft.Terminal.Settings.Model.dll

In the future, adding `<PgoTarget>true</PgoTarget>` to a new `vcxproj` file will automatically enroll the DLL/EXE for PGO instrumentation and optimization going forward.

Two training test scenarios are implemented:
- Smoke test the Terminal by just opening it and typing a bit of text then exiting. (Should help focus on the standard launch path.)
- Optimize bulk text output by launching terminal, outputting `big.txt`, then exiting.

Additional scenarios can be contributed to the `WindowsTerminal_UIATests` project with the `[TestProperty("IsPGO", "true")]` annotation to add them to the suite of scenarios for PGO.

**NOTE:** There are currently no weights applied to the various test scenarios. We will revisit that in the future when/if necessary.

## Validation Steps Performed
- [x] - Training run completed at https://dev.azure.com/ms/terminal/_build?definitionId=492&_a=summary
- [x] - Optimization run completed locally (by forcing `PGOBuildMode` to `Optimize` on my local machine, manually retrieving the databases with NuGet, and building).
- [x] - Validated locally that x86 and ARM64 do not get trained and automatically skip optimization as databases are not present for them.
- [x] - Smoke tested optimized binary versus latest releases. `big.txt` output through CMD is ~11-12seconds prior to PGO and just over 8 seconds with PGO.
lhecker added a commit that referenced this issue May 14, 2021
* #8000 - Supports buffer rewrite work. A re-use of `til::rle` will be
  useful as a column counter as we pursue NxM storage and presentation.
* #3075 - The new iterators allow skipping forward by multiple units,
  which wasn't possible under `TextBuffer-/OutputCellIterator`.
  Additionally it also allows a bulk insertions.
* #8787 and #410 - High probability this should be `pmr`-ified
  like `bitmap` for things like `chafa` and `cacafire`
  which are changing the run length frequently.

* [x] Closes #8741
* [x] I work here.
* [x] Tests added.
* [x] Tests passed.

- [x] Ran `cacafire` in `OpenConsole.exe` and it looked beautiful
- [x] Ran new suite of `RunLengthEncodingTests.cpp`

Co-authored-by: Michael Niksa <miniksa@microsoft.com>
lhecker added a commit that referenced this issue May 14, 2021
## Summary of the Pull Request

Introduces `til::rle`, a vector-like container which stores elements of
type T in a run length encoded format. This allows efficient compaction
of repeated elements within the vector.

## References

* #8000 - Supports buffer rewrite work. A re-use of `til::rle` will be
  useful as a column counter as we pursue NxM storage and presentation.
* #3075 - The new iterators allow skipping forward by multiple units,
  which wasn't possible under `TextBuffer-/OutputCellIterator`.
  Additionally it also allows a bulk insertions.
* #8787 and #410 - High probability this should be `pmr`-ified
  like `bitmap` for things like `chafa` and `cacafire`
  which are changing the run length frequently.

## PR Checklist

* [x] Closes #8741
* [x] I work here.
* [x] Tests added.
* [x] Tests passed.

## Validation Steps Performed

* [x] Ran `cacafire` in `OpenConsole.exe` and it looked beautiful
* [x] Ran new suite of `RunLengthEncodingTests.cpp`

Co-authored-by: Michael Niksa <miniksa@microsoft.com>
lhecker added a commit that referenced this issue May 14, 2021
## Summary of the Pull Request

Introduces `til::rle`, a vector-like container which stores elements of
type T in a run length encoded format. This allows efficient compaction
of repeated elements within the vector.

## References

* #8000 - Supports buffer rewrite work. A re-use of `til::rle` will be
  useful as a column counter as we pursue NxM storage and presentation.
* #3075 - The new iterators allow skipping forward by multiple units,
  which wasn't possible under `TextBuffer-/OutputCellIterator`.
  Additionally it also allows a bulk insertions.
* #8787 and #410 - High probability this should be `pmr`-ified
  like `bitmap` for things like `chafa` and `cacafire`
  which are changing the run length frequently.

## PR Checklist

* [x] Closes #8741
* [x] I work here.
* [x] Tests added.
* [x] Tests passed.

## Validation Steps Performed

* [x] Ran `cacafire` in `OpenConsole.exe` and it looked beautiful
* [x] Ran new suite of `RunLengthEncodingTests.cpp`

Co-authored-by: Michael Niksa <miniksa@microsoft.com>
lhecker added a commit that referenced this issue May 14, 2021
## Summary of the Pull Request

Introduces `til::rle`, a vector-like container which stores elements of
type T in a run length encoded format. This allows efficient compaction
of repeated elements within the vector.

## References

* #8000 - Supports buffer rewrite work. A re-use of `til::rle` will be
  useful as a column counter as we pursue NxM storage and presentation.
* #3075 - The new iterators allow skipping forward by multiple units,
  which wasn't possible under `TextBuffer-/OutputCellIterator`.
  Additionally it also allows a bulk insertions.
* #8787 and #410 - High probability this should be `pmr`-ified
  like `bitmap` for things like `chafa` and `cacafire`
  which are changing the run length frequently.

## PR Checklist

* [x] Closes #8741
* [x] I work here.
* [x] Tests added.
* [x] Tests passed.

## Validation Steps Performed

* [x] Ran `cacafire` in `OpenConsole.exe` and it looked beautiful
* [x] Ran new suite of `RunLengthEncodingTests.cpp`

Co-authored-by: Michael Niksa <miniksa@microsoft.com>
DHowett added a commit that referenced this issue May 18, 2021
commit 4b0eeef
Author: Leonard Hecker <lhecker@microsoft.com>
Date:   Fri May 14 23:56:08 2021 +0200

    Introduce til::rle - a run length encoded vector

    ## Summary of the Pull Request

    Introduces `til::rle`, a vector-like container which stores elements of
    type T in a run length encoded format. This allows efficient compaction
    of repeated elements within the vector.

    ## References

    * #8000 - Supports buffer rewrite work. A re-use of `til::rle` will be
      useful as a column counter as we pursue NxM storage and presentation.
    * #3075 - The new iterators allow skipping forward by multiple units,
      which wasn't possible under `TextBuffer-/OutputCellIterator`.
      Additionally it also allows a bulk insertions.
    * #8787 and #410 - High probability this should be `pmr`-ified
      like `bitmap` for things like `chafa` and `cacafire`
      which are changing the run length frequently.

    ## PR Checklist

    * [x] Closes #8741
    * [x] I work here.
    * [x] Tests added.
    * [x] Tests passed.

    ## Validation Steps Performed

    * [x] Ran `cacafire` in `OpenConsole.exe` and it looked beautiful
    * [x] Ran new suite of `RunLengthEncodingTests.cpp`

    Co-authored-by: Michael Niksa <miniksa@microsoft.com>
ghost pushed a commit that referenced this issue May 20, 2021
## Summary of the Pull Request

Introduces `til::rle`, a vector-like container which stores elements of
type T in a run length encoded format. This allows efficient compaction
of repeated elements within the vector.

## References

* #8000 - Supports buffer rewrite work. A re-use of `til::rle` will be
  useful as a column counter as we pursue NxM storage and presentation.
* #3075 - The new iterators allow skipping forward by multiple units,
  which wasn't possible under `TextBuffer-/OutputCellIterator`.
  Additionally it also allows a bulk insertions.
* #8787 and #410 - High probability this should be `pmr`-ified
  like `bitmap` for things like `chafa` and `cacafire`
  which are changing the run length frequently.

## PR Checklist

* [x] Closes #8741
* [x] I work here.
* [x] Tests added.
* [x] Tests passed.

## Validation Steps Performed

* [x] Ran `cacafire` in `OpenConsole.exe` and it looked beautiful
* [x] Ran new suite of `RunLengthEncodingTests.cpp`

Co-authored-by: Michael Niksa <miniksa@microsoft.com>
@zadjii-msft zadjii-msft added Priority-1 A description (P1) and removed Priority-2 A description (P2) labels Jul 12, 2021
@zadjii-msft zadjii-msft modified the milestones: Terminal v2.0, 22H1 Jan 4, 2022
@zadjii-msft zadjii-msft modified the milestones: 22H1, Terminal v1.14 Feb 2, 2022
@zadjii-msft zadjii-msft modified the milestones: Terminal v1.14, 22H1 Mar 10, 2022
@miniksa miniksa removed their assignment Mar 31, 2022
@zadjii-msft zadjii-msft modified the milestones: 22H1, 22H2 Aug 3, 2022
@zadjii-msft zadjii-msft modified the milestones: 22H2, Backlog Jul 5, 2023
@lhecker
Copy link
Member

lhecker commented Oct 10, 2024

I've had to modify ls -lR --color=always / to exclude /usr and /mnt in WSL2, because the former contains the WSL driver mount point (= slow), and the latter the NTFS mounts (same). The remaining output is ~10MB (measured with wc -c). It takes roughly 0.8s to produce the output (output redirected to /dev/null).

Printing the output in Windows 11 24H2 with conhost takes roughly 2.4s and about as long in Windows Terminal Preview 1.22. That's not quite as good as your best value of 2s, but that's because of the small WSL2 pipe buffer size of just 4KiB. I've meant to ask them to increase it to 128KiB which should double our performance down to <2s.

I'll be closing this issue then. 😊

@lhecker lhecker closed this as completed Oct 10, 2024
@microsoft-github-policy-service microsoft-github-policy-service bot added the Needs-Tag-Fix Doesn't match tag requirements label Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Performance Performance-related issue Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Tag-Fix Doesn't match tag requirements Priority-1 A description (P1) Product-Terminal The new Windows Terminal.
Projects
None yet
Development

No branches or pull requests

10 participants