Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WASM startup time optimization tracking issue #63809

Open
kg opened this issue Jan 14, 2022 · 6 comments
Open

WASM startup time optimization tracking issue #63809

kg opened this issue Jan 14, 2022 · 6 comments
Assignees
Labels
arch-wasm WebAssembly architecture area-VM-meta-mono tenet-performance Performance related issue tracking This issue is tracking the completion of other related issues. User Story A single user-facing feature. Can be grouped under an epic.
Milestone

Comments

@kg
Copy link
Member

kg commented Jan 14, 2022

(old contents of issue migrated to comment below)

During .NET9 development the runtime and blazor teams made improvements to WASM startup time in multiple areas, but more work remains to be done. Some observations and areas to continue to focus on:

  • large amounts of one-off code on the startup path, much of which is cold. in AOT this is less problematic, but the code still has to be loaded from the wasm binary and compiled by the browser.
    • (interpreter) wrappers for things like cctors, synchronization, and initialization
    • method bodies for cctors and startup code
    • finding ways to statically evaluate this initialization code at build time and bake constants into the binary could pay off tremendously here; coreCLR has solutions for this already in specific cases
    • more efficient ways to populate arrays/lists/dictionaries with smaller IL would also be very profitable. some users i.e. Uno generate cctors that populate massive dictionaries and those methods have huge amounts of IL
  • reflection-driven functionality loading metadata
    • metadata parsing is hot in both interp and AOT
    • strcmp and binary/linear searches are hot in both interp and AOT when scanning for methods by name etc
    • NativeAOT has a solution for this where they bake frozen optimized representations of metadata into the binary that can be cheaply utilized with much less initialization work; we could do that too
  • runtime code generation that kicks us into interp
    • if used incorrectly, json/xml serialization can cause this. migrating to source generators avoids it
    • some especially naughty code may be using linq.expressions or s.r.e, so we should be keeping an eye out for it. i've seen dependencies on both pop up, and s.r.e itself has expensive cctors
  • generic instance explosion
    • more metadata decoding/creation
    • more methods to compile in interp
    • more wasm function bodies to load and compile in AOT
    • SIMD is a major culprit here, but general bcl and blazor also have sources of it, i.e. static void RegisterSomething<T> () => SomeList.Add(typeof(T));
    • there are a lot of intrinsics with useless generic parameters that cause an explosion of method instances, i.e. static T Unsafe.NullRef<T> () => null where T is a class. thorough inlining of all relevant intrinsics counteracts the explosion
  • in interpreted mode, interp codegen accounts for 40-60% of total cpu time during startup
    • much of this code only runs a few times and doesn't tier up; we're not bottlenecked on optimization
    • a sizable chunk of this is in the initial IL decoding and basic block building
    • early DCE and early cprop could help a lot here; coreCLR has both and we don't
  • many mono data structures are heavy on malloc/free, which adds up in a thousand-cuts fashion to multiple percentage points of wasted CPU time during startup. i.e. linked lists and ghashtables
    • this adds memory usage overhead as well
    • in some cases we allocate a data structure and then only ever store 0-2 items into it before freeing it
  • strcmp, strlen, and g_str_hash are bad
    • we spend a silly amount of time during startup measuring, hashing, and comparing constant strings over and over, spread across various call sites
  • blazor is missing prefetch directives in its template HTML for key files
  • we currently kick off requests for every dependency all at once during startup, which means less-important requests can block more urgent ones and delay overall startup. ordering these requests and deferring the low-importance ones can allow startup to begin sooner
    • right now we need icudt very early in startup; fixing this would allow us to defer that fairly large download until later
  • memset zeroing is still hot during startup, though we've made progress in this area. in many cases we are zeroing memory that is already known to be pre-zeroed
    • a large chunk of this is due to emscripten and its two allocators (dlmalloc and mimalloc) not knowing how to exploit the fact that wasm sbrk returns zeroed memory
    • our new custom mmap can exploit this, but we need to make comprehensive code changes to take advantage of that
  • in blazor, AOT'd startup is mostly dominated by just running managed code
    • this contributes to interp startup being dominated by codegen
    • historically a lot of this is initializing things like serialization, dependency injection, or routing
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jan 14, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@kg kg added arch-wasm WebAssembly architecture tracking This issue is tracking the completion of other related issues. labels Jan 14, 2022
@ghost
Copy link

ghost commented Jan 14, 2022

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

This issue tracks various parts of startup performance that need investigation and potential improvement and describes some potential solutions.

  • Redundant startup i/o: We load lots of data at startup every time (IL, assembly metadata, timezone data, etc) and then load it into the heap. This involves multiple copies and is generally a waste of time.
  • Interpreter code generation: We spend a bunch of time generating interpreter IR from managed IL and in many cases that code runs once. We could optimize this out by pre-generating the IR and shipping it instead of IL.
  • Native static constructors and init code runs at startup and we could do this at build time. See Reboot EVAL_CTORS with the new wasm-ctor-eval emscripten-core/emscripten#16011, recently added to emscripten.
  • Our init code and managed static constructors also could largely or entirely be run at build time, built on top of the solution for native. While we would need to do work to wire up things like JS handles and ensure we exempt static cctors that need to run after build, it's likely that we could optimize a lot of this out as well. Part of the cost of cctors is due to the interpreter having to do code gen (see above), because many cctors are huge.
Author: kg
Assignees: -
Labels:

arch-wasm, untriaged, tracking

Milestone: -

@kg kg added the tenet-performance Performance related issue label Jan 14, 2022
@lewing lewing added this to the 7.0.0 milestone Jan 18, 2022
@lewing lewing removed the untriaged New issue has not been triaged by the area owner label Jan 18, 2022
@pavelsavara
Copy link
Member

Could we store the IR-form (with a hash) into some browser cache for the next startup ?

@radical radical modified the milestones: 7.0.0, 8.0.0 Aug 12, 2022
@SamMonoRT
Copy link
Member

cc @fanyang-mono for the first item.

@lewing lewing assigned maraf and kg and unassigned maraf Jul 22, 2023
@lewing lewing modified the milestones: 8.0.0, 9.0.0 Jul 24, 2023
@kg
Copy link
Member Author

kg commented Jul 24, 2023

Startup I/O is covered by the memory snapshot, I believe? IR caching could be too if we move the snapshot later in startup.

@kg
Copy link
Member Author

kg commented Jul 24, 2024

archiving previous version of this issue below. most of this is outdated and some is no longer relevant due to work we did


Current list of items from examining a small application (raytracer) as of 2024-03:

  • response.arraybuffer() and fetch() both take a long time during startup. can we make them faster? (~989ms + ~598ms)
    • it's possible that response.blob().stream().getReader().read() could be used to read responses directly into the wasm heap for slightly faster startup. it appears this only works in Firefox and Chrome, but it's worth testing.
  • sizable amount of time in strcmp for metadata lookups, etc. vectorizing it could help (~28ms)
  • lots of time spent in memset and memcpy; most of it is from emscripten's implementation of mmap, which is used by sgen to allocate heap. (~130ms)
    • Once we upgrade to a newer version of emscripten, we should use -mbulk-memory to enable a faster/smaller version of libc memory operations, which will improve on this
    • It's possible we could pre-reserve a block of zeroes in the wasm heap ready for sgen to claim, which would skip the need to allocate and zero them at startup.
    • We can use sbrk to skip the memset when allocating memory we never plan to get rid of. [wasm] use sbrk instead of emscripten mmap or malloc when loading bytes into heap #100106
  • _emscripten_get_now is very expensive, which makes mono_time_track_xxx very expensive. (~250ms)
  • Lots of time spent running the JS garbage collector during startup (> 210ms)
    • We should do some allocation profiling to see how much we can do to reduce allocations, then update this tracking list with specific allocation hotspots
    • It looks like a lot of this may be the symbol decoder ts, which is now deferred until after startup [wasm] Defer parsing of wasm stack symbolication maps #99974
  • Lots of time spent doing monoeg hash table lookups and insert operations (> 80ms)
    • We should investigate whether we can easily replace this hash table with one from the current decade
    • We should investigate whether parts of the runtime using a generic hash table could benefit from using a more specialized container - zoltan mentioned that in many cases we never delete items, for example, and I suspect many are never modified either
    • hash operations are also high in the profile for AOT, but less heavy than interp
  • mono_class_implement_interface_slow is hot, most of this is during vtable setup (~97ms)
    • This is worse for interp workloads but AOT will have to pay this cost too
    • A large portion of this is mono_class_has_variant_generic_params, which seems cacheable
    • A cache for these checks achieves a 50-60% hit rate and improves startup perf, PR pending
  • interpreter codegen is heavy in non-AOT as one would expect (> 1600ms)
    • interp_transform_method is approximately 2/3 of the total time
    • generate_code is approximately 1/3
    • interp_optimize_code is around 175ms, a rounding error in comparison
  • metadata decoding is a hotspot
  • lots of calls to free during startup; many of these could be optimized out via smart use of arenas/mempools
    • metadata seems to be doing lots of small frees during startup
    • interpreter codegen too
    • a lot of this is freeing slist nodes
  • lots of time spent growing tables or buffers during AOT startup (~196-400ms)
    • unclear which buffers are being grown and why, the stack is bad
    • our default initial heap size is probably far too small, even a simple test application has to incrementally grow it multiple times just to finish running sgen_gc_init (addressed by doubling default size)
    • we should improve the algorithm that determines the default size, since it seems to underestimate how much memory we need
  • instantiate_symbols_asset burns time calling text() (~93ms) and is overall very expensive (~371ms)
    • this shouldn't affect production scenarios, i think, but might affect development or testing workflows? it's unclear to me what this is for
    • ~40ms is spent running a regex to match line breaks
  • jiterpreter interp entry wrapper compilation burns a lot of time decoding strings during interp_create_method_pointer_llvmonly (~14ms of execution time, ~13ms of which is just utf8ToString)
  • mono_wasm_get_assembly_exports is heavy, at least in AOT (~180ms)
    • ~150ms of this is in mono_wasm_bind_assembly_exports
    • ~13ms is JSMarshalerArgument.AssertCurrentThreadContext
      • Most of this appears to be JSProxyContext.cctor and aot initialization
    • ~36ms is mono_runtime_class_init_full for the module's generated interop initializer ([wasm] Improvements to startup performance of mono_wasm_get_assembly_exports #99924)
      • ~21ms in System.Version.TryParse
      • ~16ms in GetCustomAttribute
      • This is all because the source generator (see JSExportGenerator.cs) needs to check whether the current runtime is NET7 and do specific behavior if so. We have to do this so that nugets will work on older runtimes
    • ~105ms is JSFunctionBinding.BindManagedFunction
  • ~48ms spent in decode_patch in AOT
  • Firefox profiler shows ~1.1% of cpu time spent in emscripten fs lookupPath
  • Firefox profiler shows 156ms (~3.6%) under mono_runtime_install_appctx_properties
  • Blazor startup is mostly dominated by Actually Running Code, but many of the hotspots listed above (metadata decoding / interp code generation for two examples) apply still
    • InvokeAsync uses JSON serialization, and first-time init for that is expensive
    • Dependency injection during startup is expensive
  • dlmalloc/dlfree shouldn't be used at all on wasm (this was emscripten's dmalloc, not ours)
  • When we alloc0 from a mempool, we do a regular mempool malloc and then a non-constant-size memset inside of the implementation.
    • Zero the whole mempool at creation time? over 50% of mempool allocs are alloc0
    • Rearrange mempool alloc0 to do a constant size memset for constant size arg?
  • ~5% of startup time in some profiles is spent underneath 'inflate_generic' operations

Profiles of startup for the blazor 8.0 samples:
interpreted, high precision, firefox: https://profiler.firefox.com/public/4tgsv0kh4xvgrckp4w9dcqvyen8x1ftnm8df5d8/calltree/?globalTrackOrder=0&invertCallstack&thread=0&transforms=cr-combined-16-43063~cr-combined-13-43060~cr-combined-24-43071~cr-combined-15-43062~f-combined-0cjnxyb&v=10
aot, low precision, chrome: https://profiler.firefox.com/public/kbd0e1vks074a5af67g2ntzwvwbx20mhgag5rvr/calltree/?globalTrackOrder=0w3&hiddenGlobalTracks=023&hiddenLocalTracksByPid=37252-0w4~29452-0ws~1948-0w4~0-0&invertCallstack&thread=6&timelineType=category&transforms=df-31~mf-193~mf-265~mf-191~mf-190~mf-294~df-1&v=10

Profiles of startup for @maraf 's Money application on 9.0 preview 2:
interpreted, high precision, firefox: https://share.firefox.dev/3TRtKkM
aot, low precision, chrome: https://share.firefox.dev/3IQUT0P

Archived work items from the past:

  • Redundant startup i/o: We load lots of data at startup every time (IL, assembly metadata, timezone data, etc) and then load it into the heap. This involves multiple copies and is generally a waste of time. (@fanyang-mono handling this)
  • Interpreter code generation: We spend a bunch of time generating interpreter IR from managed IL and in many cases that code runs once. We could optimize this out by pre-generating the IR and shipping it instead of IL. ([mono][interp] Implement tiering within the interpreter #65369)
  • Native static constructors and init code runs at startup and we could do this at build time. See Reboot EVAL_CTORS with the new wasm-ctor-eval emscripten-core/emscripten#16011, recently added to emscripten.
  • Our init code and managed static constructors also could largely or entirely be run at build time, built on top of the solution for native. While we would need to do work to wire up things like JS handles and ensure we exempt static cctors that need to run after build, it's likely that we could optimize a lot of this out as well. Part of the cost of cctors is due to the interpreter having to do code gen (see above), because many cctors are huge.

@lewing lewing modified the milestones: 9.0.0, 10.0.0 Jul 29, 2024
@pavelsavara pavelsavara modified the milestones: 10.0.0, Future Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-wasm WebAssembly architecture area-VM-meta-mono tenet-performance Performance related issue tracking This issue is tracking the completion of other related issues. User Story A single user-facing feature. Can be grouped under an epic.
Projects
None yet
Development

No branches or pull requests

8 participants