Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parser worker isolation considerations #3368

Closed
jerch opened this issue Jun 11, 2021 · 14 comments
Closed

parser worker isolation considerations #3368

jerch opened this issue Jun 11, 2021 · 14 comments
Assignees
Labels
area/parser area/performance breaking-change Breaks API and requires a major version bump type/enhancement Features or improvements to existing features

Comments

@jerch
Copy link
Member

jerch commented Jun 11, 2021

Once again some considerations about the idea to separate the parser / VT emulation from the main thread, this time with its now dedicated issue.

Where are we at?
Currently xterm.js is 100% main thread bound, thus everything from reading chunks in term.write to rendering happens on the main thread.

The Good...
This design works pretty nicely for casual integrations, where the terminal does not do much. Furthermore it is quite convenient to deal with the terminal object and its API for custom extensions, everything is at your hands in the main thread context.

The Bad...
The picture changes for more complex integrations, where the main thread is already under high pressure for whatever reason, where any further computational stress might cause lag spikes or make things feel sluggish. This scenario can be triggered quite easy with xterm.js - just try to run multiple terminal instances within one embedding document (like multiplexing) and give the terminals some IO. Here the terminals will behave greedy with the main thread runtime as limited resource, every instance will try to saturate the main thread on its own (based on a 60 fps approximation, thus 2 instances --> 30 fps, 4 --> 15 fps and so on) not taking outer stress into the equation. Any outer stress will simply prolong the update cycles further (up to lag spikes), multiple terminals in one page will slow down each other.

The Ugly...
Workers for the rescue? As we all know the real limiting resource beneath all that are the horse powers of your machine's CPU (plus a few more like memory, cache sizes, bus speed, ...). But JS with its strong single thread affiliation does not do a good job in utilizing CPU cores per se, we are basically bound to one core with cooperative multitasking in JS, while the underlying OS offers measures to saturate more cores with preemptive style.
Dont want to go into the "Why is it that bad in JS" rant, I understand that the language's flexibility creates a global context burden, thats hard to parallelize, still I wonder about the general future of JS here, as CPU designs clearly go the multicore path (Time to get an Erlang clone into the browser?)
Well in JS we got workers as a rather expensive way to still benefit from multicore design. Can we use it for xterm.js to lower the main thread stress? Well yes and no - welcome to the Ugly.

What could be "workered"?
The main workload in xterm.js for high input comes from the parsing and state changes itself, which is synchronous blocking code with none to a few references to the outer browser (main thread) env. With some rework of the internal interfaces this would be a hot candidate for a worker.
The second highest in the workload books is the rendering task with quite some differences across selected renderer type (as parsing:rendering ratio):

  • DOM: <60:40
  • canvas: ~80:20
  • webGL: >90:10

By its nature rendering has to deal with much more browser env main thread references, and it is almost impossible to offload this solely into a worker (beside going full OffscreenCanvas, which I dont consider a viable goal atm as the support is still pretty lousy across browser engines).
The remaining tasks are user input / event handling, which are typically <1% in workload, thus not worth to be delegated to a worker. (Some actions like resize still create very high workload, but are buffer bound, thus would automatically happen in a worker if we moved parsing there.)
As an upper approximation - we could relieve the mainthread of 60% - 90% workload created by xterm.js depending on the selected renderer.

Why "The Ugly"?
Offloading the parser/VT emulation introduces several hard to swallow downsides:

  • The parser / VT emulation runs in its own JS context, there is no easy way anymore to manipulate the terminal object from the main thread due to the strict context isolation. Everything needs to be interfaced over a custom message protocol, e.g. custom parser hooks must be moved into the worker context. For this they must be serializable, thus never rely on main thread scoped variables. Same with browser events modifying the inner terminal state - they need to be channeled through to the worker.
  • Next big issue is the data transfer to/from the worker. We have basically 3 options here - hardcopy data (really bad performance), transferables (okish) or SAB (superior in terms of "throughput"). While the first two always involve some sort of ongoing messaging, the last one is real shared memory and would outperform the others by far (in theory opens the door for a lock-free implementation). But it is also the most cumbersome one to adopt. The main data load happening in xterm.js is getting data from the internal buffer state to render something. With a SAB basically all states needed for rendering should be arraybuffer driven, quite the refactoring effort.
    Whatever data transfer will be suitable here, it adds additional runtime on top lowering the relief numbers above again.
  • With workers synchronization is a thing. Imho a full lock-free implementation with SABs is not possible, we simply have too many mass actions, that would create illegal interim states and should not end up on the screen. Thus we need some synchronization mechanism, maybe with Atomics, to coordinate the render update cycles (can be pretty lightweight). Still this raises tons of detail questions to avoid common lock/sync issues (like thread starvation).

Is it worth the trouble?
I am not sure about that. The promise to lower main thread occupation by 60% - 90% is a really big one, on the other hand the coding efforts to get something reliable with SABs done are huge as well (baby steps into that direction are already outlined by those nodejs build target PRs). Furthermore we prolly would need a main thread fallback as SABs cannot be used everywhere due their security constraints (Do we need different build targets for SAB-worker based?). In any case a worker solution always creates higher integration needs due to separated JS resources.

Personally I like challenges like that and I am up for trying at least. But from a more realistic perspective I think this will involve tons of changes to xterm.js both internally and its API (>30% rewrite), and I am not sure if we can manage to maintain the same convenience for integrators in the end. So any attempt into the worker direction might turn out as a dead end for not being easy usable/customizable.

Up for discussion. 😸

@jerch jerch added type/enhancement Features or improvements to existing features breaking-change Breaks API and requires a major version bump area/performance area/parser labels Jun 11, 2021
@jerch jerch self-assigned this Jun 11, 2021
@Tyriar
Copy link
Member

Tyriar commented Jun 11, 2021

I've gone back and forth a bit over time on this and ended up landing here:

I think we should not do only workers, but offload only when incoming data pressure is high (eg. if the parsing time limit was hit in a frame). If we took this approach is should be a lot easier to manage as we'd keep snappy latency, not needing to jump to another thread for most cases, but do switch to another thread when latency isn't as important. Additionally we don't need to go all in on ArrayBuffers for data transfer because of this, we would probably still need to refactor such that we have a model object or something that can easily be transferred back and forth.

Graceful degradation is also extremely important here:

  • Browsers are moving towards hiding the SharedArrayBuffer object unless certain headers are sent from the server.
  • We shouldn't assume workers will be there or that the worker file is setup correctly.

This approach would also be a lot easier to do incrementally; model refactors, worker adoption, sab support with fallback, etc.

@jerch
Copy link
Member Author

jerch commented Jun 11, 2021

@Tyriar Yes, I agree with the degradation needs, a proper worker+SAB setup is not easy anymore. In any case we would need a failover logic (maybe just drop back to fully mainthread-driven variant).

I am not so sure about your middle ground idea - doesnt that mean we'd have to implement the full deal for both cases, additional we'd need to implement the "high pressure" transition logic. And about "all in on ArrayBuffers" - you mean like partially using array buffers, but transmit other states by messaging? Doesn't that create a much more cumbersome situation, as we'd have to synchronize across several worker interaction models (e.g. SAB atomics + the message states). Furthermore as soon as we use messaging for any state transmission, we need active code parts on both ends providing/digesting the message. Not so with a pure SAB solution.

Well I still have to think about that idea, maybe I just dont see the benefits yet. 🤔

@Tyriar
Copy link
Member

Tyriar commented Jun 11, 2021

doesnt that mean we'd have to implement the full deal for both cases

I'd think the code would be almost entirely shared between main/worker.

Doesn't that create a much more cumbersome situation, as we'd have to synchronize across several worker interaction models (e.g. SAB atomics + the message states).

It's trying to solve it only when it becomes a problem, if the main thread is about to be busy, offload to a worker until it calms down. If we do fallback right it probably wouldn't be too cumbersome to switch between the 2 modes.

@mofux
Copy link
Contributor

mofux commented Jun 11, 2021

From may experimentation with rendering the terminal through a Monaco Editor, I just recently faced the same questions.

My solution now is to run xterm.js in a node.js thread which also spawns the pty.

I'm hooking to xterm._core._inputHandler.onRequestRefreshRows() which fires whenever the buffer content gets changed. I'm then computing the diff between those changes and send them to the frontend, which will apply these changes to the monaco editor text model.

In my case, these changes operate on a line level granularity, and they basically contain the line text that has changed, and an array of line decorations (range, foreground, background colors, font styles etc.). I'm also sending some buffer state to the frontend that is necessary for styling (e.g. if mouse events are enabled, so we can draw a pointer):

{
  buffer: {
    normal: boolean,
    rows: number,
    cols: number,
    startId: number,
    endId: number,
    mouseEvents: boolean,
    hasScrollback: boolean,
    scrollbackVisible: boolean,
    applicationCursorKeys: boolean,
    bracketedPasteMode: boolean,
    sendFocus: boolean,
    cursor: {
      visible: boolean,
      lineId: number,
      colIndex: number
    }
  },
  edits: [
    startLineNumber: number,
    startColumn: number,
    endLineNumber: number,
    endColumn: number,
    text?: string
  ][],
  decorations: [
    lineId: number,
    decorations: [
      lineNumber: number,
      startColumn: number,
      endColumn: number,
      attrs: TerminalRendererDecorationAttributes
    ][]
  ][]
}

Note that with this kind of design I don't have to synchronise the terminal buffer of the backend with the frontend. I only send "render" instructions to the frontend, which really lowers the memory consumption and overall computation time of the frontend.

I discovered this design has several advantages:

  • Since the xterm.js parser and the pty process are running in the same node process, I don't have to care about backpressure anymore. It also eliminates the data conversions and network delay in the transport layout, almost doubling the throughput from what I currently get from the VSCode implementation.
  • I can easily delay all render updates until the previous render instruction sent to the frontend has been processed and acknowledged by it. This really takes away alot of overhead for fast producers like ls -lR /, and it allows me to completely pause the stream of render instructions in cases where the terminal is not even visible in the frontend.

Caveats

Running xterm.js in a node.js process requires some hacks at the moment. There is internal code that relies on the global window and self Objects to be available at load time - these globals do not exist in node context. This can be easily worked around via:

global.self = global.window = {};
const { Terminal } = require('xterm');
const { Unicode11Addon } = require('xterm-addon-unicode11');
delete global.self;
delete global.window;

Additionally, you cannot initialise the rendering part of xterm.js via terminal.open(element: HTMLElement). Unfortunately some of the Mouse and Keyboard event handling code is not abstract enough to use them in a non-browser environment, so I had to reimplement those parts myself.

Last but not least you need a way to track an individual line in the buffer via an ID. This is very important for diff computing, because our Buffer is circular and will remove lines at the top of the viewport once the maximum number of scrollback lines are reached. It's therefore not feasible to use the line index of the buffer for this. I'm hooking to the Buffer.onTrim() event to increase a counter whenever a line at the top of the viewport gets removed. My id then is this counter plus the line index.

Conclusion

IMHO utilising a WebWorker for the parser may create more overhead than it tries to solve. I think we should rather optimize towards a scenario where the xterm.js core can be run in the same process as the pty (in the node.js backend), and then create an interface that is specifically crafted to update a renderer in the frontend. There also have to be interfaces in the other direction to propagate abstracted mouse and keyboard events from the renderer to the core.

** Extra **

Here's a capture of running seq 1000000 in my Monaco Editor implementation (left) and VSCode (right) in parallel.

Kapture.2021-06-11.at.17.05.05.mp4

@jerch
Copy link
Member Author

jerch commented Jun 11, 2021

@mofux

Isn't

I think we should rather optimize towards a scenario where the xterm.js core can be run in the same process as the pty (in the node.js backend), and then create an interface that is specifically crafted to update a renderer in the frontend.

the same as utlizing a webworker for the core parts (+ IO chunk handling in your case, which only applies to electron/nodejs apps), while you say

IMHO utilising a WebWorker for the parser may create more overhead than it tries to solve.

Can you explain, what would be different to a webworker isolation of the parser/VT emulation?

Edit: Maybe my text above was misleading - with parser/VT emulation I mean the whole inner state handling including the terminal buffer itself (not only the parser).

@mofux
Copy link
Contributor

mofux commented Jun 11, 2021

Yes it's technically the same, except for the data flow.

Using a Web Worker means that all of the PTY stream data (backend) has to get to the render thread of the browser (usually via WebSockets or IPC), then from there it has to get to the WebWorker where the xterm.js core is running (usually via postMessage).

I think it would be smarter if the PTY stream and the xterm.js core were living in the same backend process, where the stream data Buffers could be easily shared (by reference) without getting copied across multiple communication layers (as outlined above).

The idea is very simple: The parser (xterm.js core) must consume the whole PTY stream without gaps to keep the internal Terminal Buffer state consistent. A renderer does not. It can at any point in time simply get a view on that Terminal Buffer state and it can skip frames if the updates come in too often.

Imagine you were implementing a new Canvas renderer for xterm.js. With every render cycle, you iterate the Buffer lines that have changed since your last render cycle and create draw instructions from them. Now, instead of actually drawing them to the Canvas, you serialize the instructions as JSON. Now imagine this renderer was living in the xterm.js core and runs in the backend.

Now in the Browser frontend, you create an implementation that can request a render from your backend, gets the serialized instructions and draws them to the real Canvas.

That is basically what I am proposing, and what I've done with the Monaco Editor renderer. Now the PTY data stream never has to pass the Browser render thread. It can stay in the backend with minimal overhead. All that passes to the browser are drawing instructions whenever desirable.

@jerch
Copy link
Member Author

jerch commented Jun 11, 2021

Some greenfield out-of-the-box thinking - ideally things could work like this:

VT emulation worker:

  • handles pty IO on its own, either via websocket (browser) or direct pty binding (electron)
  • spawns CoreTerminal with its buffer logic and the buffer in SABs and does the chunk processing
  • provides several frontend endpoints to digest frontend events (like input & mouse events)

mainthread:

  • does the rendering
  • collects and delegates user events (key strokes and mouse) to the worker

The rendering parts needs a bit more thinking, whether we are fully SAB'ed or not. Again thinking it from an ideal world we have everything in SABs. Then it is quite easy:

  • frontend has some own FPS logic triggering refreshs periodically
  • a SAB maintains entries about currently visible bufferlines + some update indicator (whether a line was changed and needs repaint), this is in a SAB only one O(1) lookup away, no expensive messaging needed
  • optional (non lock-free): the renderer may lock lines while outputting them effectively blocking any write attempt in the worker (read lock)
  • optional (non lock-free): the worker might want to lock lines in the visible area to skip frame updates while writing there to avoid screwed up output (write lock)
  • frontend renders stuff from SAB data

This SAB model has one big advantage - it does not need any active messaging code on both ends, it can be orchestrated solely by some locks on the SAB itself for the rendering, thus no stocking up of messages in queues, no need to drop back to event loop at all on parser side. Technically even the locks are not needed (well we still would want them to not get totally screwed up output in between).

Ofc it is not that easy in conjunction with key/mouse event delegation anymore, they prolly need some messaging to the worker again (more tricky to abstract with SAB, though not impossible), then the worker would have to ground back to top level event loop from time to time again.

Well that's basically what I think could work and would be most performant solution (not lock-free, but almost without any data copying between mainthread/worker). As far I understand Mofux's solution, it uses hardcopy messaging of changed bufferlines, which creates the need to store things on both ends (waste of memory) and is much slower in the transfer.
(Note: I dont really care if the parser worker is a real webworker or a nodejs worker thread - as long the memory sharing still works, it should not make much of a difference.)

Using a Web Worker means that all of the PTY stream data (backend) has to get to the render thread of the browser (usually via WebSockets or IPC)

Why do you think so? Imho websockets are fine within webworkers.

@mofux
Copy link
Contributor

mofux commented Jun 11, 2021

Why do you think so? Imho websockets are fine within webworkers.

It's technically possible, but I think impractical for most application architectures - they likely establish only one single websocket connection for the whole frontend session (dealing with authentication and stuff) and dispatch messages over that single connection. Creating additional websocket connections means that you have to expose additional websocket endpoints on your server, map them to sessions, make sure they are authenticated and what not. It's certainly possible, but impractical IMHO.

Also, when looking into the electron world, I'm not so sure if a WebWorker can create an IPC communication directly with Main thread.

That being put aside - my point was to keep the PTY data stream as close to the core as possible, whenever possible 😁 Of course, if the backend is not node.js, then you'll likely run the xterm core in frontend anyway.

I'd love to carry these discussions forward via a video call. Unfortunately my vacation starts tomorrow and I'll not be back on my PC until mid July 🥺

@Tyriar
Copy link
Member

Tyriar commented Jun 11, 2021

I discovered this design has several advantages:

The biggest issue imo is that other actions are bound to input latency like scrolling, also a model like this break some assumptions made and end up breaking complex integrations like task problem matchers and live share in VS Code.

I think it would be smarter if the PTY stream and the xterm.js core were living in the same backend process, where the stream data Buffers could be easily shared (by reference) without getting copied across multiple communication layers (as outlined above).

I'd like to support the ability to do this via the node target of xterm.js #2749 and the serialize addon (this could build smarter diffs too). I don't think it should be the default or recommended setup though because it's much more complicated to setup and comes with its own set of problems for little pay off imo. Having this as a it's possible to do thing seems like enough imo. Those caveats seem like improvements could be done in the lib so you don't need the hacks.

The idea is very simple: The parser (xterm.js core) must consume the whole PTY stream without gaps to keep the internal Terminal Buffer state consistent. A renderer does not. It can at any point in time simply get a view on that Terminal Buffer state and it can skip frames if the updates come in too often.

Another idea is to just allow almost all of xterm.js to live inside the worker, use transferControlToOffscreen and the main thread side just acts as a passthrough. I don't think this would work with the DOM renderer though (a fallback case). That way we don't need to care about transfer of data, it all lives on the worker. The main downside being an extra hop = increased latency, even if it's just local.

Also, when looking into the electron world, I'm not so sure if a WebWorker can create an IPC communication directly with Main thread.

I don't think this is possible, I'd love it if it was though microsoft/vscode#113705 (comment)

I'd love to carry these discussions forward via a video call. Unfortunately my vacation starts tomorrow and I'll not be back on my PC until mid July 🥺

We can always set one up 👍, it's just tricky aligning everyone's schedule.

@jerch
Copy link
Member Author

jerch commented Jun 11, 2021

It's technically possible, but I think impractical for most application architectures - they likely establish only one single websocket connection for the whole frontend session (dealing with authentication and stuff) and dispatch messages over that single connection. Creating additional websocket connections means that you have to expose additional websocket endpoints on your server, map them to sessions, make sure they are authenticated and what not. It's certainly possible, but impractical IMHO.

Kinda dont get it - wouldn't that be an argument against your own idea, to move the chunk IO closer to ??? - sorry, I currently dont know what ??? is for you, because you say worker is it not. But workers are the only option for a browser env, that dont involve really nasty mutli-page/iframe hacks (which is a clear nogo).
Furthermore if peeps dont want fast terminal IO delivery, instead push everything through one single socket in the frontend - well, then they have to dispatch things on their own including the performance penalty. All we could do about it - provide some alternative convenient way to inject data to the worker reachable from the mainthread.

I'd love to carry these discussions forward via a video call. Unfortunately my vacation starts tomorrow and I'll not be back on my PC until mid July pleading_face

This topic is not pressing at all, we should take all the time we need to get through the cumbersome details.
Enjoy your holidays 😃

@Tyriar
Copy link
Member

Tyriar commented Jun 11, 2021

@jerch @mofux sent out an invite

@jerch
Copy link
Member Author

jerch commented Jun 12, 2021

I think it would be smarter if the PTY stream and the xterm.js core were living in the same backend process, where the stream data Buffers could be easily shared (by reference) without getting copied across multiple communication layers (as outlined above).

I'd like to support the ability to do this via the node target of xterm.js #2749 and the serialize addon (this could build smarter diffs too). I don't think it should be the default or recommended setup though because it's much more complicated to setup and comes with its own set of problems for little pay off imo. Having this as a it's possible to do thing seems like enough imo. Those caveats seem like improvements could be done in the lib so you don't need the hacks.

Doing the VT emulation in the backend, like on the server for browser based apps? 😱 Imho thats not a good idea as general purpose solution, kinda all web driven cluster orchestrators with potentially tons of terminals attached will fall short here. Beside the really high costs for the server side emulation, constantly pushing screen updates over the wire will create much more traffic and show bad latency. With the PTY data stream we kinda already have the best deal in terms of packrate/latency.

What would be possible: Once we got the browser refs completely out of common, we can create several build targets:

  • webworker target meant for browsers (if we are keen, maybe also to work with nodejs' worker-threads, no clue yet about semantic differences)
  • single nodejs process target for offscreen emulation, which also can be used for electron on backend side with some additional IPC shim

This might be important to note: For me electron is the special case here and should not be the driving force behind all that, there are many other frameworks with different semantics, and I see a vanilla browser as the default integration env. You might tend to disagree here, I know that you all are very electron centric, but imho moving xterm.js from a pure browser frontend target to more electron specific needs by default would be a big mistake.

@jerch
Copy link
Member Author

jerch commented Jun 16, 2021

(Slightly offtopic and more a reminder for later on ...)

There is another possibility linked to the parser worker isolation - a WASM implementation of the utf32 converter + the parser class (inner sequence parsing logic). Not sure yet if that is a viable goal as it raises the maintenance complexity, yet it would be easy possible with a C or Rust replacement, since both classes are very C-ish operating on fixed borrowed memory.

Some rough estimations of possible gains here:
Did a quick test with data.sixel from https://github.com/jerch/sixel-bench, which is a 119 MB file of sixel video frames. The type of data triggers roughly this call path during parsing:

... --> InputParser.parse --> EscapeSequenceParser.parse --> DCS.hook|put|unhook
               |                                                  |
        Utf8ToUtf32.decode                                   (image decoding in worker)

For this 119 MB pile of data I see the following runtime numbers in Chrome and Firefox (all self time in ms):

call Chrome Firefox
InputParser.parse 30 47
Utf8ToUtf32.decode 335 1030
EscapeSequenceParser.parse 514 1400
main thread runtime 2500 4500
image decoding (worker) 5200 unknown
total runtime 6800 10400

Ofc most work load is created by image decoding on the worker, which is not further considered here. The interesting bits on main thread are the self time of Utf8ToUtf32.decode and EscapeSequenceParser.parse, which add up to 849ms on Chrome resp. 2430ms on Firefox (or in % of the main thread script runtime: 34% resp. 54%).

These numbers are surprising for 2 reasons:

  • very high numbers on both engines, considering the fact that both calls just pass the data along without any real conversion
  • Firefox does a bad job in optimizing these calls, Chrome is ~3 times faster

And thats where a WASM implementation can help to level out the field. With WASM there is no JS JIT involved anymore, the code runs from call one on pre-optimized. Also those big engine difference should go away (it still depends on the engine's WASM speed, which I dont know for Firefox).

Ofc, whether we can gain anything for these tasks remains uncertain until we actually tried. Going back to my Rust/WASM experiments during the buffer reshaping 2ys ago, I remember that the parser logic was ~30% faster compared to JS numbers in nodejs/Chrome. I have no real numbers for Firefox here, but if the promise of WASM, to perform fairly equal, is true, that would speed up the inner parsing ~3 times on Firefox.

TL;DR
Utf8ToUtf32.decode and EscapeSequenceParser.parse are runtime hungry tasks during parsing, that prolly can perform better with a WASM implementation. Estimated speedup is ~30% on Chrome and ~3 times on Firefox.

Edit:
Indeed, a first dummy test just for Utf8ToUtf32.decode shows roughly the same runtime of only 250ms for both browsers with Rust/WASM. No clue why Firefox performs this bad in the first place, maybe we use V8 idiomatic JS code, idk. The sad part about WASM - it currently does not support copy-free mapping of arraybuffers from JS to WASM and vice versa, thus the dummy test contains a nonsense initial copy step (see WebAssembly/design#1162). This lowers the general usefulness of WASM for xterm.js alot, as most tasks are quite data intensive. The nice gains from optimized logic execution would get eaten again by useless mem copies. What a silly trade.

Edit2:
Oh well, seems C/WASM is actually faster than looping in JS for uint8 --> uint32 buffer copies ("byte spreading"). Given:

const src = new Uint8Array(FIXED_SIZE);
const dst = new Uint32Array(FIXED_SIZE);

and trying to pump the 119MB from above chunkified as 64kB blocks (all Chrome numbers):

  • fastest (but not applicable, if there are other padding conditions) - 50ms
    dst.set(src)
  • second - naive C impl with WASM/emscripten - 100ms
    extern "C" {
      void copy(uint32_t length) {
        for (uint32_t i = 0; i < length; ++i) DST[i] = SRC[i];  // global static
      }
    }
  • third - naive Rust/WASM - 170 ms
    unsafe {
      for i in 0..length as usize {
        DST[i] = SRC[i] as u32;  // global static mut
      }
    }
  • fourth - looping in JS - 230 ms
    for (let i = 0; i < length; ++i) dst[i] = src[i];

Thats really surprising, both WASM impls are way faster than "native" JS loops, emscripten being quite ahead. Would not have thought, that a loop scope/context is that cheap in WASM (or this expensive in JS). Update: Oh dear - with -msimd128 switch emscripten runs in ~75 ms. (No clue what changed, file got bigger, but my inspector refuses to dissemble it with the simd instructions).

Summary of the short WASM interlude (wont further hijack the thread):
With the UTF8 decoder and the sequence parser we have two candidates in core, that would benefit alot from a wasm rewrite (prolly at least doubling the parsing chain throughput even on Chrome). While the UTF8 decoder is a no-brainer, the parser itself is much work (I have a C parser flying around, but that does not compile with emscripten due to computed gotos and other incompatible interfaces). Alot of rework up to bootstrapping code (wasm modules have to be awaited) + much harder maintenance afterwards with different languages/buildtools in codebase - guess it is a NO for now.
(Though I might come up with a C/WASM SIXEL decoder drop-in for the image addon. Will see.)

@jerch
Copy link
Member Author

jerch commented Nov 16, 2021

Out of scope for me, thus closing.

@jerch jerch closed this as completed Nov 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/parser area/performance breaking-change Breaks API and requires a major version bump type/enhancement Features or improvements to existing features
Projects
None yet
Development

No branches or pull requests

3 participants