[wasm-mt] Remove two-phase suspend; fix state transitions in DS server; fix merge #73305

lambdageek · 2022-08-03T15:31:38Z

Grab bag of threading fixes:

Remove the coop two-phase transition (Partially revert 6726fae). This was based on a misunderstanding of how Emscripten works: when the main thread is blocked in a concurrency primitive like sem_wait, it is still processing queued calls from other threads. So there is no need to first suspend the worker threads and then suspend the main thread. The implementation of two-phase suspend had a bug where it would suspend worker threads twice, making the suspend increase by 2. Since resume only decremented the count by 1, this lead to a suspend count overflow. Fixes [wasm-mt] Sampling thread requests resume for a thread with !(suspend_count > 0) #72857
Once the diagnostic server attaches to the runtime, switch it to GC Safe mode when it returns to JavaScript. That is, while the diagnostic server is reacting to messages in the JS event loop, it is considered suspended by the runtime. When it calls into C, switch to GC Unsafe (which may block if there's a STW happening). Add thread state transitions when we come back to C, and when we wait.
Mark the wasm diagnostic server thread as "no sample; no gc" which means that we don't consider it for STW when there's a GC or a sample profiler active. This is how we treat utility threads (including the non-wasm diagnostic server thread) on other platforms.
Fix a bad signature for cwraps.mono_wasm_event_pipe_enable due to a mistake in a previous merge
Added a new browser-threads sample

ghost · 2022-08-03T15:48:12Z

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

Grab bag of threading fixes:

Remove the coop two-phase transition (Partially revert 6726fae). This was based on a misunderstanding of how Emscripten works: when the main thread is blocked in a concurrency primitive like sem_wait, it is still processing queued calls from other threads. So there is no need to first suspend the worker threads and then suspend the main thread. The implementation of two-phase suspend had a bug where it would suspend worker threads twice, making the suspend increase by 2. Since resume only decremented the count by 1, this lead to a suspend count overflow. Fixes [wasm-mt] Sampling thread requests resume for a thread with !(suspend_count > 0) #72857
Once the diagnostic server attaches to the runtime, switch it to GC Safe mode when it returns to JavaScript. That is, while the diagnostic server is reacting to messages in the JS event loop, it is considered suspended by the runtime. When it calls into C, switch to GC Unsafe (which may block if there's a STW happening). Add thread state transitions when we come back to C, and when we wait.
Mark the wasm diagnostic server thread as "no sample; no gc" which means that we don't consider it for STW when there's a GC or a sample profiler active. This is how we treat utility threads (including the non-wasm diagnostic server thread) on other platforms.
Fix a bad signature for cwraps.mono_wasm_event_pipe_enable due to a mistake in a previous merge

Author:	lambdageek
Assignees:	lambdageek
Labels:	`arch-wasm`, `area-Threading-mono`
Milestone:	-

lambdageek · 2022-08-03T20:11:16Z

Screen.Recording.2022-08-03.at.16.09.18.mov

lambdageek · 2022-08-03T20:14:18Z

/azp run runtime-wasm

azure-pipelines · 2022-08-03T20:14:34Z

Azure Pipelines successfully started running 1 pipeline(s).

lambdageek · 2022-08-03T23:58:10Z

/azp run runtime-wasm

azure-pipelines · 2022-08-03T23:58:19Z

Azure Pipelines successfully started running 1 pipeline(s).

lambdageek · 2022-08-04T14:17:41Z

/azp run runtime-wasm

azure-pipelines · 2022-08-04T14:17:50Z

Azure Pipelines successfully started running 1 pipeline(s).

radical · 2022-08-04T15:44:08Z

Wasm/library tests log:

ChromeDriver was started successfully.
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:107546 "MONO_WASM: onConfigLoaded() failed: {}"
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:107597 "MONO_WASM: Stacktrace: \n"
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:107637 "TypeError: Cannot set properties of undefined (setting 'DOTNET_MODIFIABLE_ASSEMBLIES')\n    at Object.onConfigLoaded (http://127.0.0.1:36645/main.js:16:78)\n    at vc (http://127.0.0.1:36645/dotnet.js:3:115444)\n    at async oc (http://127.0.0.1:36645/dotnet.js:3:106316)\n    at async http://127.0.0.1:36645/dotnet.js:3:104090"
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:15956 "Failed to load config file ./mono-config.json TypeError: Cannot set properties of undefined (setting 'DOTNET_MODIFIABLE_ASSEMBLIES')"
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:15956 "Failed to load config file ./mono-config.json TypeError: Cannot set properties of undefined (setting 'DOTNET_MODIFIABLE_ASSEMBLIES')"
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:17377 Uncaught (in promise)
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:17377 Uncaught (in promise)
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:17377 Uncaught (in promise)
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:17377 Uncaught (in promise)
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:17377 Uncaught (in promise)
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:17377 Uncaught (in promise)
[15:03:50] fail: [out of order message from the browser]: http://127.0.0.1:36645/dotnet.js 2:104195 Uncaught (in promise)
[15:18:50] info: Waiting to flush log messages with a timeout of 120 secs ..
[15:18:50] fail: Flushing log messages failed with: System.OperationCanceledException: The operation was canceled.
                    at System.Threading.Channels.AsyncOperation`1.GetResult(Int16 token)
                    at System.Threading.Channels.ChannelReader`1.ReadAllAsync(CancellationToken cancellationToken)+MoveNext()
                    at System.Threading.Channels.ChannelReader`1.ReadAllAsync(CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
                    at Microsoft.DotNet.XHarness.CLI.Commands.Wasm.WasmTestMessagesProcessor.RunAsync(CancellationToken token) in /_/src/Microsoft.DotNet.XHarness.CLI/Commands/WASM/WasmTestMessagesProcessor.cs:line 66
                    at Microsoft.DotNet.XHarness.CLI.Commands.Wasm.WasmTestMessagesProcessor.RunAsync(CancellationToken token) in /_/src/Microsoft.DotNet.XHarness.CLI/Commands/WASM/WasmTestMessagesProcessor.cs:line 66
                    at Microsoft.DotNet.XHarness.CLI.Commands.Wasm.WasmTestMessagesProcessor.CompleteAndFlushAsync(Nullable`1 timeout) in /_/src/Microsoft.DotNet.XHarness.CLI/Commands/WASM/WasmTestMessagesProcessor.cs:line 106. Ignoring.
[15:18:50] fail: Tests timed out. Killing driver service pid 11169
[15:18:50] fail: Application has finished with exit code TIMED_OUT but 42 was expected

@pavelsavara Is this the one you were fixing?
The xharness exception can be ignored, I have a fix for it locally too.

radical · 2022-08-04T15:56:22Z

AOT:

[15:35:02] info: Starting:    System.Collections.Tests.dll
[15:36:25] info: 219168800
[15:36:25] info: /datadisks/disk1/work/A2AC0902/w/A97C0939/e/wasm_build/AppBundle/dotnet.js:1881: 211542912
[15:36:25] info:       throw ptr;
[15:36:25] info:       ^
[15:36:25] info: 
[15:36:25] info: Process v8 exited with 1

Each call to begin_suspend_request_suspension_cordially may increment the suspend count. But in STW we only resume each thread once. In 6726fae we added a second phase of STW to full coop on WebAssembly in order to suspend the browser thread after all the worker threads have been suspended in order to avoid some deadlocks that rely on the main thread continuing to process async work on behalf of the workers before they reach a safepoint. The problem is that for worker threads we could end up calling begin_suspend_request_suspension_cordially twice. If the thread self-suspends after the first call, the second call will increment the suspend count. As a result, when we restart the world, the thread will decrement its suspend count, but still stay suspended. Worse, on the _next_ stw, we will increment the suspend count two more times and decrement once on the next restart, etc. Eventually the thread will overflow the suspend counter and we will assert `!(suspend_count > 0)`. Also change `THREAD_SUSPEND_COUNT_MAX` to `0x7F` (from `0xFF`) - the suspend count is signed, so the roll-over from 127 to -128 is where we should assert Fixes dotnet#72857

include thread states and thread ids where available

…twice" This reverts commit 92f52ab7ed1cfaa1a4f66e869a8d9404e066f1b2.

Remove mono_threads_platform_stw_defer_initial_suspend The motivation for it in 6726fae was unfounded. There is no need to suspend the main browser thread after the other threads: suspension on wasm uses `sem_wait` which on Emscripten on the main thread is implemented using a busy wait `__timedwait_cp` which processes queued calls. So even if we suspend the main thread first, it will still allow other threads in GC Safe to make progress if they're using syscalls.

…_GC flag The diagnostic server worker spends most of its time in the JS event loop waiting for messages. After we attach to the runtime, we need to switch to GC Safe mode because the diagnostic server may not ever reach a safepoint (for example if no more DS events arrive). Conversely, when we call from JS into the C diagnostic server, we need to enter GC Unsafe mode (and potentially safepoint). Also mark the diagnostic server threads with the NO_GC flag - this thread does not manipulate managed objects so it doesn't need to stop for GC STW.

Mistake from a previous merge

lambdageek · 2022-08-04T19:33:23Z

AOT:

when I repro this locally, I get a stack overflow:

https://gist.github.com/lambdageek/8c07263665415982c3b793150a1626b4

(I ran the tests with dotnet.sh build /t:Test /p:TargetOS=Browser /p:TargetArchitecture=Wasm /p:RunAOTCompilation=true)

Checking main...

Update

main didn't crash. But neither did my PR when I re-ran it. it's some kind of non-deterministic issue

src/mono/mono/component/diagnostics_server.c

src/mono/mono/utils/mono-threads-state-machine.c

kg

I don't know the threading infra well enough to firmly approve the logic changes here, but the diffs all look clean to me and it makes sense

Co-authored-by: Katelyn Gadd <kg@luminance.org>

lambdageek · 2022-08-05T11:54:46Z

/azp run runtime-wasm

azure-pipelines · 2022-08-05T11:54:56Z

Azure Pipelines successfully started running 1 pipeline(s).

lambdageek · 2022-08-05T19:50:23Z

Build Browser wasm Linux Release LibraryTests_AOT throw ptr failure is known

lambdageek requested review from lewing, pavelsavara, vargaz, BrzVlad, naricc and marek-safar as code owners August 3, 2022 15:31

dotnet-issue-labeler bot added the area-VM-threading-mono label Aug 3, 2022

ghost assigned lambdageek Aug 3, 2022

lambdageek added the arch-wasm WebAssembly architecture label Aug 3, 2022

lambdageek requested a review from kg August 3, 2022 15:48

lambdageek force-pushed the wasm-mt-fix-two-phase-coop-suspend branch from f7dc809 to 0c72fe8 Compare August 3, 2022 23:55

lambdageek force-pushed the wasm-mt-fix-two-phase-coop-suspend branch from 0c72fe8 to 866504e Compare August 4, 2022 13:50

lambdageek mentioned this pull request Aug 4, 2022

[wasm-ep] Use DOTNET_DiagnosticPorts to configure Diagnostic Server #73370

Merged

lambdageek added 7 commits August 4, 2022 12:50

improve thread state machine assertion messages

b4de524

include thread states and thread ids where available

fix typo

114fcf5

Revert "[coop] Don't call begin_suspend_request_suspension_cordially …

0714e24

…twice" This reverts commit 92f52ab7ed1cfaa1a4f66e869a8d9404e066f1b2.

cwraps: fix bad signature for mono_wasm_event_pipe_enable

f0b10bb

Mistake from a previous merge

lambdageek and others added 3 commits August 4, 2022 12:50

Add new browser-threads sample

754ca79

exclude the browser-threads sample unless wasm threads are enabled

13e5e1c

Update browser-threads sample

8a8ab6c

lambdageek force-pushed the wasm-mt-fix-two-phase-coop-suspend branch from 866504e to 8a8ab6c Compare August 4, 2022 20:08

kg reviewed Aug 5, 2022

View reviewed changes

src/mono/mono/component/diagnostics_server.c Outdated Show resolved Hide resolved

kg reviewed Aug 5, 2022

View reviewed changes

src/mono/mono/utils/mono-threads-state-machine.c Outdated Show resolved Hide resolved

kg approved these changes Aug 5, 2022

View reviewed changes

lambdageek and others added 2 commits August 5, 2022 07:52

Update src/mono/mono/component/diagnostics_server.c

d731c82

Co-authored-by: Katelyn Gadd <kg@luminance.org>

Update src/mono/mono/utils/mono-threads-state-machine.c

6f07849

Co-authored-by: Katelyn Gadd <kg@luminance.org>

lambdageek merged commit 2ad7b1a into dotnet:main Aug 5, 2022

CarnaViire mentioned this pull request Aug 24, 2022

[QUIC] Abort on cancellation throws QUIC_STATUS_INVALID_PARAMETER #73688

Closed

ghost locked as resolved and limited conversation to collaborators Sep 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wasm-mt] Remove two-phase suspend; fix state transitions in DS server; fix merge #73305

[wasm-mt] Remove two-phase suspend; fix state transitions in DS server; fix merge #73305

lambdageek commented Aug 3, 2022 •

edited

Loading

ghost commented Aug 3, 2022

lambdageek commented Aug 3, 2022

lambdageek commented Aug 3, 2022

azure-pipelines bot commented Aug 3, 2022

lambdageek commented Aug 3, 2022

azure-pipelines bot commented Aug 3, 2022

lambdageek commented Aug 4, 2022

azure-pipelines bot commented Aug 4, 2022

radical commented Aug 4, 2022

radical commented Aug 4, 2022

lambdageek commented Aug 4, 2022 •

edited

Loading

kg left a comment

lambdageek commented Aug 5, 2022

azure-pipelines bot commented Aug 5, 2022

lambdageek commented Aug 5, 2022

[wasm-mt] Remove two-phase suspend; fix state transitions in DS server; fix merge #73305

[wasm-mt] Remove two-phase suspend; fix state transitions in DS server; fix merge #73305

Conversation

lambdageek commented Aug 3, 2022 • edited Loading

ghost commented Aug 3, 2022

lambdageek commented Aug 3, 2022

lambdageek commented Aug 3, 2022

azure-pipelines bot commented Aug 3, 2022

lambdageek commented Aug 3, 2022

azure-pipelines bot commented Aug 3, 2022

lambdageek commented Aug 4, 2022

azure-pipelines bot commented Aug 4, 2022

radical commented Aug 4, 2022

radical commented Aug 4, 2022

lambdageek commented Aug 4, 2022 • edited Loading

kg left a comment

Choose a reason for hiding this comment

lambdageek commented Aug 5, 2022

azure-pipelines bot commented Aug 5, 2022

lambdageek commented Aug 5, 2022

lambdageek commented Aug 3, 2022 •

edited

Loading

lambdageek commented Aug 4, 2022 •

edited

Loading