Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Known samples for AndroidEnableMarshalMethods=true #8253

Closed
jonathanpeppers opened this issue Aug 8, 2023 · 12 comments · Fixed by #9343
Closed

Known samples for AndroidEnableMarshalMethods=true #8253

jonathanpeppers opened this issue Aug 8, 2023 · 12 comments · Fixed by #9343
Assignees
Labels
Area: App+Library Build Issues when building Library projects or Application projects. Area: App Runtime Issues in `libmonodroid.so`. bug Component does not function as intended.
Milestone

Comments

@jonathanpeppers
Copy link
Member

jonathanpeppers commented Aug 8, 2023

Android application type

.NET Android (net7.0-android, etc.)

Affected platform version

.NET 8 Preview 6

Description

We have turned off AndroidEnableMarshalMethods for .NET 8 due to too many issues surfaced during public testing. We plan on resolving those issues so it can be turned on for .NET 9.

This is just an issue of known examples that fail at runtime when AndroidEnableMarshalMethods=true:

Just linking them here for us to test later in future releases.

Note that even though we have closed the "duplicate" issues, they still represent actual known samples that should be considered for testing any marshal method fixes.

Steps to Reproduce

See links above.

Did you find any workaround?

No response

Relevant log output

No response

@tranb3r
Copy link

tranb3r commented Jul 23, 2024

@grendello
Why did you enable marshal methods by default in net9-pre6? #8925
Did you test the known samples before releasing this feature again?
My sample #8147 is broken in net9-pre6, and once again, I have to manually disable marshal methods.

@grendello
Copy link
Contributor

@tranb3r This is a very useful feature which gives a noticeable performance increase for huge majority of apps. The only known bad scenario we had was with Blazor apps, and those disable marshal methods by default. We enabled them in preview precisely for the reason of finding what else breaks and add it to the "disable" list if we can identify the issue. Note that marshal methods themselves are fine, the bug doesn't lie in their implementation as far as we can tell (we've had them around for nearly two years, they essentially implement in native code what is implemented in managed code when they are disabled) - the sole fact that they are that much faster triggers a race condition (at least in Blazor) of some sort. We haven't been able to determine what exactly trips up Blazor. If your sample has a pattern we can identify, we'd add it to the "disabled" list.

For the gain in 90% (or more) of apps out there, it is worth enabling the feature by default and disable it explicitly (or automatically, if the pattern is identified as explained above) for the remaining 10% (or less) apps.

@grendello
Copy link
Contributor

@tranb3r as far as testing is concerned, we have hundreds of thousands of tests that run on CI and none of them fail with marshal methods. Without enabling the feature by default, we won't ever be able to gather information about what doesn't work "out there".

@grendello
Copy link
Contributor

@tranb3r have you seen the problem with anything other than Plugin.Fingerprint?

@tranb3r
Copy link

tranb3r commented Jul 23, 2024

For the gain in 90% (or more) of apps out there, it is worth enabling the feature by default and disable it explicitly (or automatically, if the pattern is identified as explained above) for the remaining 10% (or less) apps.

Sure.
However, considering I'm facing at least 2 different issues in my app, caused by this feature, I'm a bit concerned.
Race condition usually occur with complex scenarios, that are difficult to reproduce and integrate in CI.

As far as testing is concerned, we have hundreds of thousands of tests that run on CI and none of them fail with marshal methods.

This is why I'm asking if the samples in this issue have been tested.
My sample is failing, so I suspect this one at least hasn't been investigated/tested.

Have you seen the problem with anything other than Plugin.Fingerprint?

Yes. ANR when doing http requests. However I haven't been able to create a simple repro for it.

@bcaceiro
Copy link

This option should only be used in .net9 previews right? It is supposed to be somewhat broken on .net8? Because I have tried in the previous maui - sr release and also had some issues, and disabled it.

@grendello
Copy link
Contributor

@bcaceiro it was present in net8 but was off by default, correct.

@grendello
Copy link
Contributor

grendello commented Jul 23, 2024

For the gain in 90% (or more) of apps out there, it is worth enabling the feature by default and disable it explicitly (or automatically, if the pattern is identified as explained above) for the remaining 10% (or less) apps.

Sure. However, considering I'm facing at least 2 different issues in my app, caused by this feature, I'm a bit concerned. Race condition usually occur with complex scenarios, that are difficult to reproduce and integrate in CI.

Superficial simplicity is unfortunately par for the course in a lot of .NET. One example of that is Blazor apps, another is async/await or LINQ. There is an awful lot of asynchronous or simply complicated code in those features that can fail under some circumstances without any obvious signals as to what's going on. Your sample uses an external service which definitely can introduce threading and/or race condition scenarios. That said, it's much, much simpler than the Blazor "hello world" app that we were investigating. I need to bring some tools up to date and hopefully your sample will give a clue as to what's failing and why.

As far as testing is concerned, we have hundreds of thousands of tests that run on CI and none of them fail with marshal methods.

This is why I'm asking if the samples in this issue have been tested. My sample is failing, so I suspect this one at least hasn't been investigated/tested.

The sad truth is that we don't have much time for manual testing. However, product previews are exactly where we can hope people like you, early testers, will discover issues and let us know about them. Alas, there are very few people following your footsteps and sometimes we learn about issues very late in the game. If we discover that marshal methods cause more trouble than we thought, we're going to disable them by default before the release and hopefully re-enable for .NET 10 - the feature is too valuable to just drop it (and we have more optimizations in mind that build on top of it).

Have you seen the problem with anything other than Plugin.Fingerprint?

Yes. ANR when doing http requests. However I haven't been able to create a simple repro for it.

That's another scenario involving threads and synchronization, open for race conditions. Hopefully all of them are caused by the same issue and maybe your sample app will help us pin point the problem and swat it for good.

@filipnavara
Copy link
Member

filipnavara commented Sep 28, 2024

Here's a significantly reduced repro for the deadlock: https://github.com/filipnavara/mm-deadlock-repro. There's no MAUI or Blazor, just 50ish lines of .NET Android code in an empty template.

Essentially it boils down to the .NET GC being dead-locked / live-locked in the stop-the-world stage. I have seen several patterns of this.

Just run it in Release mode and it will immediately freeze before getting to display the "Hello" page.

Once the app is in frozen state, the main thread is waiting in Android code in MediaPlayer.Prepare(). Meanwhile, the media player loads the data in separate thread which calls back StreamMediaDataSource.ReadAt. This intentionally allocates empty arrays to trigger GCs to make the repro deterministic. In real world this allocation will happen at the marshalling level for byte[] arrays and trigger the same code path. The GC will deterministically freeze in the stop-the-world stage. It seems to wait for one of the threads, either the one executing MediaPlayer.Prepare() (less likely) or a random Java thread pool thread not running any .NET code at the moment (more likely).

In an unreduced repro, when running under debugger, there was a native exception in art_quick_generic_jni_trampoline prior to getting into the broken state. This may have caused some corruption that ultimately puts the Mono thread list in inconsistent state. I haven't been able to verify if this also happens in the reduced repro due to some struggle with LLDB.

UPD: Here’s a log with all the thread state transitions in MonoVM: https://gist.github.com/filipnavara/1f2e220063022cefe1375b7c87efa1ee. Seems like some code is initially run on one of the Java thread pool threads and that ends up with a wrong state. Next .NET GC from another Java thread pool thread fails to stop the world.

UPD2: Fixed trace with names in MONO_STACKDATA: https://gist.github.com/filipnavara/fd997e3e9879c072935bde866ef51aa8

@filipnavara
Copy link
Member

I have narrowed down the thread state transition issue to this:

09-29 09:38:40.655  7970  7985 E THREAD  : attaching 0xb4000073ec62c9a0
09-29 09:38:40.655  7970  7985 E THREAD  : registering info 0xb4000073ec62c9a0 tid 0x726597f730 small id 43
09-29 09:38:40.655  7970  7985 E THREAD  : [ATTACH][0x726597f730] STARTING . -> RUNNING . (0 -> 0) 
09-29 09:38:40.655  7970  7985 E THREAD  : [DO_BLOCKING][0x726597f730] RUNNING . -> STATE_BLOCKING . (0 -> 0) mono_thread_info_suspend_lock_with_info
09-29 09:38:40.655  7970  7985 E THREAD  : [DONE_BLOCKING][0x726597f730] STATE_BLOCKING . -> RUNNING . (0 -> 0) mono_thread_info_suspend_lock_with_info
09-29 09:38:40.655  7970  7985 E THREAD  : [ABORT_BLOCKING][0x726597f730] RUNNING . -> RUNNING . (0 -> 0) mono_class_get
09-29 09:38:40.656  7970  7985 E THREAD  : [ABORT_BLOCKING][0x726597f730] RUNNING . -> RUNNING . (0 -> 0) mono_method_get_unmanaged_callers_only_ftnptr
09-29 09:38:40.656  7970  7985 E THREAD  : [ABORT_BLOCKING][0x726597f730] RUNNING . -> RUNNING . (0 -> 0) mono_threads_attach_coop
09-29 09:38:40.656  7970  7985 E THREAD  : XX: mono_threads_attach_coop_internal external 0 cookie 0x0

At that point the thread pool thread is attached to the runtime. Yet the very next call to mono_threads_attach_coop_internal, ie. entering the UnmanagedCallersOnly method, fails to identify the thread as external. This causes the transition cookie to be NULL, and exiting the call with mono_threads_detach_coop becomes no-op which leaves the thread in a wrong state.

@filipnavara
Copy link
Member

The sequence above is result of this:

// We need to do that, as Mono APIs cannot be invoked from threads that aren't attached to the runtime.
mono_thread_attach (mono_get_root_domain ());
MonoImage *image = MonoImageLoader::get_from_index (mono_image_index);
MarshalMethodsManagedClass &klass = marshal_methods_class_cache[class_index];
if (klass.klass == nullptr) {
klass.klass = image != nullptr ? mono_class_get (image, klass.token) : nullptr;
}
MonoMethod *method = klass.klass != nullptr ? mono_get_method (image, method_token, klass.klass) : nullptr;
MonoError error;
void *ret = method != nullptr ? mono_method_get_unmanaged_callers_only_ftnptr (method, &error) : nullptr;

get_function_pointer knowingly calls mono_thread_attach which attaches the thread pool thread to Mono runtime. It never detaches it, which leads to an incorrect state of the thread pool threads.

@filipnavara
Copy link
Member

So, the simple fix is to replace mono_thread_attach with the deprecated mono_jit_thread_attach API. Unfortunately, mono_threads_attach_coop/mono_threads_detach_coop was never added to the public API headers and neither was mono_threads_enter_gc_unsafe_region/mono_threads_exit_gc_unsafe_region. These would be the proper way to fix it, I think.

Notably, mono_jit_thread_attach is still the way used by xamarin-macios.

cc @lambdageek

jonpryor pushed a commit that referenced this issue Sep 30, 2024
Context: #8253
Context: #9343

We've had a number of reports about *hangs* when LLVM Marshal Methods
are enabled (see also #8253), but we had been willing
to accept having LLVM Marshal Methods enabled by default as it helps
improve app startup time.

However, [a customer reported a native crash][0] when LLVM Marshal
Methods are enabled:

> ![Thread Stack Trace][1]

	syscall at line 28 within libc
	__futex_wait_ex(void volatile*, bool, int, bool, timespec const*) at line 144 within libc
	sem_wait at line 108 within libc
	within libmonosgen-2.0.so (Build Id: …)
	within <anonymous:…>

This native crash suggests "something going wrong" within MonoVM.
We find this "scary" enough that we feel it is once again prudent
to disable LLVM Marshal Methods by default in .NET 9, by setting the
default value of `$(AndroidEnableMarshalMethods)`=False.

Even though #9343 has a potential fix for the hang, we feel that
we're too close to .NET 9 GA to validate in the wild.  This feature
will need to wait for .NET 10.  🙁

To explicitly enable LLVM Marshal Methods, set
`$(AndroidEnableMarshalMethods)`=True in your App `.csproj`.

[0]: https://discord.com/channels/732297728826277939/732297837953679412/1288758900627345463
[1]: https://cdn.discordapp.com/attachments/732297837953679412/1288758900522483712/05F6ED48-2BBA-4FC0-A54F-81BB57606500.png?ex=66f659c1&is=66f50841&hm=8f6f78f283f8c03e9fb37a452ad5d0babaaad3dd4bcc1b4887f4dcf60f651c12&
jonathanpeppers pushed a commit to jonathanpeppers/xamarin-android that referenced this issue Sep 30, 2024
Context: dotnet#8253
Context: dotnet#9343

We've had a number of reports about *hangs* when LLVM Marshal Methods
are enabled (see also dotnet#8253), but we had been willing
to accept having LLVM Marshal Methods enabled by default as it helps
improve app startup time.

However, [a customer reported a native crash][0] when LLVM Marshal
Methods are enabled:

> ![Thread Stack Trace][1]

	syscall at line 28 within libc
	__futex_wait_ex(void volatile*, bool, int, bool, timespec const*) at line 144 within libc
	sem_wait at line 108 within libc
	within libmonosgen-2.0.so (Build Id: …)
	within <anonymous:…>

This native crash suggests "something going wrong" within MonoVM.
We find this "scary" enough that we feel it is once again prudent
to disable LLVM Marshal Methods by default in .NET 9, by setting the
default value of `$(AndroidEnableMarshalMethods)`=False.

Even though dotnet#9343 has a potential fix for the hang, we feel that
we're too close to .NET 9 GA to validate in the wild.  This feature
will need to wait for .NET 10.  🙁

To explicitly enable LLVM Marshal Methods, set
`$(AndroidEnableMarshalMethods)`=True in your App `.csproj`.

[0]: https://discord.com/channels/732297728826277939/732297837953679412/1288758900627345463
[1]: https://cdn.discordapp.com/attachments/732297837953679412/1288758900522483712/05F6ED48-2BBA-4FC0-A54F-81BB57606500.png?ex=66f659c1&is=66f50841&hm=8f6f78f283f8c03e9fb37a452ad5d0babaaad3dd4bcc1b4887f4dcf60f651c12&
jonathanpeppers added a commit that referenced this issue Sep 30, 2024
…9346)

Context: #8253
Context: #9343

We've had a number of reports about *hangs* when LLVM Marshal Methods
are enabled (see also #8253), but we had been willing
to accept having LLVM Marshal Methods enabled by default as it helps
improve app startup time.

However, [a customer reported a native crash][0] when LLVM Marshal
Methods are enabled:

> ![Thread Stack Trace][1]

	syscall at line 28 within libc
	__futex_wait_ex(void volatile*, bool, int, bool, timespec const*) at line 144 within libc
	sem_wait at line 108 within libc
	within libmonosgen-2.0.so (Build Id: …)
	within <anonymous:…>

This native crash suggests "something going wrong" within MonoVM.
We find this "scary" enough that we feel it is once again prudent
to disable LLVM Marshal Methods by default in .NET 9, by setting the
default value of `$(AndroidEnableMarshalMethods)`=False.

Even though #9343 has a potential fix for the hang, we feel that
we're too close to .NET 9 GA to validate in the wild.  This feature
will need to wait for .NET 10.  🙁

To explicitly enable LLVM Marshal Methods, set
`$(AndroidEnableMarshalMethods)`=True in your App `.csproj`.

[0]: https://discord.com/channels/732297728826277939/732297837953679412/1288758900627345463
[1]: https://cdn.discordapp.com/attachments/732297837953679412/1288758900522483712/05F6ED48-2BBA-4FC0-A54F-81BB57606500.png?ex=66f659c1&is=66f50841&hm=8f6f78f283f8c03e9fb37a452ad5d0babaaad3dd4bcc1b4887f4dcf60f651c12&

Co-authored-by: Marek Habersack <grendel@twistedcode.net>
jonathanpeppers pushed a commit that referenced this issue Sep 30, 2024
)

Fixes: #8253

`get_function_pointer` can be called from any thread. In particular,
it can be called from Android thread pool thread when method
marshaling is used. We need to ensure those threads stay in the GC
safe mode from the runtime perspective once we stop executing user
code on those threads. `mono_thread_attach` switches the thread to GC
unsafe mode and it stays there after we stop executing user code. The
next GC will try to suspend the thread which cannot succeed since
there's no cooperative runtime code running on that thread anymore.

The fix is to use `mono_jit_thread_attach` which mimics what the
iOS/macOS interop does. It creates the necessary MonoVM thread
structures for JIT to work but doesn't transition the thread state.
jonathanpeppers pushed a commit that referenced this issue Sep 30, 2024
)

Fixes: #8253

`get_function_pointer` can be called from any thread. In particular,
it can be called from Android thread pool thread when method
marshaling is used. We need to ensure those threads stay in the GC
safe mode from the runtime perspective once we stop executing user
code on those threads. `mono_thread_attach` switches the thread to GC
unsafe mode and it stays there after we stop executing user code. The
next GC will try to suspend the thread which cannot succeed since
there's no cooperative runtime code running on that thread anymore.

The fix is to use `mono_jit_thread_attach` which mimics what the
iOS/macOS interop does. It creates the necessary MonoVM thread
structures for JIT to work but doesn't transition the thread state.
grendello added a commit that referenced this issue Oct 1, 2024
dellis1972 pushed a commit that referenced this issue Oct 2, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Nov 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Area: App+Library Build Issues when building Library projects or Application projects. Area: App Runtime Issues in `libmonodroid.so`. bug Component does not function as intended.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants