Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wasm-mt] Remove two-phase suspend; fix state transitions in DS server; fix merge #73305

Merged
merged 12 commits into from
Aug 5, 2022

Conversation

lambdageek
Copy link
Member

@lambdageek lambdageek commented Aug 3, 2022

Grab bag of threading fixes:

  1. Remove the coop two-phase transition (Partially revert 6726fae). This was based on a misunderstanding of how Emscripten works: when the main thread is blocked in a concurrency primitive like sem_wait, it is still processing queued calls from other threads. So there is no need to first suspend the worker threads and then suspend the main thread. The implementation of two-phase suspend had a bug where it would suspend worker threads twice, making the suspend increase by 2. Since resume only decremented the count by 1, this lead to a suspend count overflow. Fixes [wasm-mt] Sampling thread requests resume for a thread with !(suspend_count > 0) #72857
  2. Once the diagnostic server attaches to the runtime, switch it to GC Safe mode when it returns to JavaScript. That is, while the diagnostic server is reacting to messages in the JS event loop, it is considered suspended by the runtime. When it calls into C, switch to GC Unsafe (which may block if there's a STW happening). Add thread state transitions when we come back to C, and when we wait.
  3. Mark the wasm diagnostic server thread as "no sample; no gc" which means that we don't consider it for STW when there's a GC or a sample profiler active. This is how we treat utility threads (including the non-wasm diagnostic server thread) on other platforms.
  4. Fix a bad signature for cwraps.mono_wasm_event_pipe_enable due to a mistake in a previous merge
  5. Added a new browser-threads sample

@ghost
Copy link

ghost commented Aug 3, 2022

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

Grab bag of threading fixes:

  1. Remove the coop two-phase transition (Partially revert 6726fae). This was based on a misunderstanding of how Emscripten works: when the main thread is blocked in a concurrency primitive like sem_wait, it is still processing queued calls from other threads. So there is no need to first suspend the worker threads and then suspend the main thread. The implementation of two-phase suspend had a bug where it would suspend worker threads twice, making the suspend increase by 2. Since resume only decremented the count by 1, this lead to a suspend count overflow. Fixes [wasm-mt] Sampling thread requests resume for a thread with !(suspend_count > 0) #72857
  2. Once the diagnostic server attaches to the runtime, switch it to GC Safe mode when it returns to JavaScript. That is, while the diagnostic server is reacting to messages in the JS event loop, it is considered suspended by the runtime. When it calls into C, switch to GC Unsafe (which may block if there's a STW happening). Add thread state transitions when we come back to C, and when we wait.
  3. Mark the wasm diagnostic server thread as "no sample; no gc" which means that we don't consider it for STW when there's a GC or a sample profiler active. This is how we treat utility threads (including the non-wasm diagnostic server thread) on other platforms.
  4. Fix a bad signature for cwraps.mono_wasm_event_pipe_enable due to a mistake in a previous merge
Author: lambdageek
Assignees: lambdageek
Labels:

arch-wasm, area-Threading-mono

Milestone: -

@lambdageek lambdageek requested a review from kg August 3, 2022 15:48
@lambdageek
Copy link
Member Author

Screen.Recording.2022-08-03.at.16.09.18.mov

@lambdageek
Copy link
Member Author

/azp run runtime-wasm

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lambdageek lambdageek force-pushed the wasm-mt-fix-two-phase-coop-suspend branch from f7dc809 to 0c72fe8 Compare August 3, 2022 23:55
@lambdageek
Copy link
Member Author

/azp run runtime-wasm

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lambdageek
Copy link
Member Author

/azp run runtime-wasm

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@radical
Copy link
Member

radical commented Aug 4, 2022

Wasm/library tests log:

ChromeDriver was started successfully.
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:107546 "MONO_WASM: onConfigLoaded() failed: {}"
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:107597 "MONO_WASM: Stacktrace: \n"
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:107637 "TypeError: Cannot set properties of undefined (setting 'DOTNET_MODIFIABLE_ASSEMBLIES')\n    at Object.onConfigLoaded (http://127.0.0.1:36645/main.js:16:78)\n    at vc (http://127.0.0.1:36645/dotnet.js:3:115444)\n    at async oc (http://127.0.0.1:36645/dotnet.js:3:106316)\n    at async http://127.0.0.1:36645/dotnet.js:3:104090"
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:15956 "Failed to load config file ./mono-config.json TypeError: Cannot set properties of undefined (setting 'DOTNET_MODIFIABLE_ASSEMBLIES')"
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:15956 "Failed to load config file ./mono-config.json TypeError: Cannot set properties of undefined (setting 'DOTNET_MODIFIABLE_ASSEMBLIES')"
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:17377 Uncaught (in promise)
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:17377 Uncaught (in promise)
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:17377 Uncaught (in promise)
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:17377 Uncaught (in promise)
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:17377 Uncaught (in promise)
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:17377 Uncaught (in promise)
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:104195 Uncaught (in promise)
[15:18:50] info: Waiting to flush log messages with a timeout of 120 secs ..
[15:18:50] fail: Flushing log messages failed with: System.OperationCanceledException: The operation was canceled.
                    at System.Threading.Channels.AsyncOperation`1.GetResult(Int16 token)
                    at System.Threading.Channels.ChannelReader`1.ReadAllAsync(CancellationToken cancellationToken)+MoveNext()
                    at System.Threading.Channels.ChannelReader`1.ReadAllAsync(CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
                    at Microsoft.DotNet.XHarness.CLI.Commands.Wasm.WasmTestMessagesProcessor.RunAsync(CancellationToken token) in /_/src/Microsoft.DotNet.XHarness.CLI/Commands/WASM/WasmTestMessagesProcessor.cs:line 66
                    at Microsoft.DotNet.XHarness.CLI.Commands.Wasm.WasmTestMessagesProcessor.RunAsync(CancellationToken token) in /_/src/Microsoft.DotNet.XHarness.CLI/Commands/WASM/WasmTestMessagesProcessor.cs:line 66
                    at Microsoft.DotNet.XHarness.CLI.Commands.Wasm.WasmTestMessagesProcessor.CompleteAndFlushAsync(Nullable`1 timeout) in /_/src/Microsoft.DotNet.XHarness.CLI/Commands/WASM/WasmTestMessagesProcessor.cs:line 106. Ignoring.
[15:18:50] fail: Tests timed out. Killing driver service pid 11169
[15:18:50] fail: Application has finished with exit code TIMED_OUT but 42 was expected

@pavelsavara Is this the one you were fixing?
The xharness exception can be ignored, I have a fix for it locally too.

@radical
Copy link
Member

radical commented Aug 4, 2022

AOT:

[15:35:02] info: Starting:    System.Collections.Tests.dll
[15:36:25] info: 219168800
[15:36:25] info: /datadisks/disk1/work/A2AC0902/w/A97C0939/e/wasm_build/AppBundle/dotnet.js:1881: 211542912
[15:36:25] info:       throw ptr;
[15:36:25] info:       ^
[15:36:25] info: 
[15:36:25] info: Process v8 exited with 1

Each call to begin_suspend_request_suspension_cordially may increment
the suspend count.  But in STW we only resume each thread once.

In 6726fae we added a second phase of
STW to full coop on WebAssembly in order to suspend the browser thread
after all the worker threads have been suspended in order to avoid
some deadlocks that rely on the main thread continuing to process
async work on behalf of the workers before they reach a safepoint.

The problem is that for worker threads we could end up calling
begin_suspend_request_suspension_cordially twice. If the thread
self-suspends after the first call, the second call will increment the
suspend count.  As a result, when we restart the world, the thread
will decrement its suspend count, but still stay suspended.  Worse, on
the _next_ stw, we will increment the suspend count two more times and
decrement once on the next restart, etc.

Eventually the thread will overflow the suspend counter and we will
assert `!(suspend_count > 0)`.

Also change `THREAD_SUSPEND_COUNT_MAX` to `0x7F` (from `0xFF`) - the
suspend count is signed, so the roll-over from 127 to -128 is where we
should assert

Fixes dotnet#72857
include thread states and thread ids where available
…twice"

This reverts commit 92f52ab7ed1cfaa1a4f66e869a8d9404e066f1b2.
Remove mono_threads_platform_stw_defer_initial_suspend

The motivation for it in 6726fae
was unfounded.  There is no need to suspend the main browser thread
after the other threads: suspension on wasm uses `sem_wait` which on
Emscripten on the main thread is implemented using a busy wait
`__timedwait_cp` which processes queued calls.  So even if we suspend
the main thread first, it will still allow other threads in GC Safe to
make progress if they're using syscalls.
…_GC flag

The diagnostic server worker spends most of its time in the JS event
loop waiting for messages.  After we attach to the runtime, we need to
switch to GC Safe mode because the diagnostic server may not ever
reach a safepoint (for example if no more DS events arrive).

Conversely, when we call from JS into the C diagnostic server, we need
to enter GC Unsafe mode (and potentially safepoint).

Also mark the diagnostic server threads with the NO_GC flag - this
thread does not manipulate managed objects so it doesn't need to stop
for GC STW.
@lambdageek
Copy link
Member Author

lambdageek commented Aug 4, 2022

AOT:

when I repro this locally, I get a stack overflow:

https://gist.github.com/lambdageek/8c07263665415982c3b793150a1626b4

(I ran the tests with dotnet.sh build /t:Test /p:TargetOS=Browser /p:TargetArchitecture=Wasm /p:RunAOTCompilation=true)

Checking main...


Update

main didn't crash. But neither did my PR when I re-ran it. it's some kind of non-deterministic issue

@lambdageek lambdageek force-pushed the wasm-mt-fix-two-phase-coop-suspend branch from 866504e to 8a8ab6c Compare August 4, 2022 20:08
Copy link
Member

@kg kg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the threading infra well enough to firmly approve the logic changes here, but the diffs all look clean to me and it makes sense

lambdageek and others added 2 commits August 5, 2022 07:52
Co-authored-by: Katelyn Gadd <kg@luminance.org>
Co-authored-by: Katelyn Gadd <kg@luminance.org>
@lambdageek
Copy link
Member Author

/azp run runtime-wasm

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lambdageek
Copy link
Member Author

Build Browser wasm Linux Release LibraryTests_AOT throw ptr failure is known

@lambdageek lambdageek merged commit 2ad7b1a into dotnet:main Aug 5, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Sep 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-wasm WebAssembly architecture area-VM-threading-mono
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[wasm-mt] Sampling thread requests resume for a thread with !(suspend_count > 0)
3 participants