Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix init race in mono_class_try_get_[shortname]_class. #112282

Merged

Conversation

lateralusX
Copy link
Member

@lateralusX lateralusX commented Feb 7, 2025

We observed several crash reports in dotnet Android coming from google play store, like this one:

#109921 (comment)

and it happens at the following callstack:

mono_class_setup_vtable_full at /__w/1/s/src/mono/mono/metadata/class-setup-vtable.c:900
init_io_stream_slots at /__w/1/s/src/mono/mono/metadata/icall.c:3378
ves_icall_System_IO_Stream_HasOverriddenBeginEndRead at /__w/1/s/src/mono/mono/metadata/icall.c:3437
 (inlined by) ves_icall_System_IO_Stream_HasOverriddenBeginEndRead_raw at /__w/1/s/src/mono/mono/metadata/../metadata/icall-def.h:228

Looking a little deeper at that that code path expects call to mono_class_try_get_stream_class to succeed since it calls mono_class_setup_vtable on returned class pointer. There are other places where code expectes mono_class_try_get_[shortname]_class to always succeed or it will assert. Other locations handles the case where it returns NULL meaning that the class has been linked out.

After looking into the macro implementation generating the mono_class_try_get_[shortname]_class functions it appears that it has a race where one thread could return NULL even if the class successfully gets loaded by another racing thread. Initially that might have been intentionally and callers would need to redo the call, but then there is no way for callers to distinct between hitting the race and class not available. Adding to the fact that code would also expect that some critical classes never fail loading, like what init_io_stream_slots does, this race ended up in crashes.

In the past, this particular race in init_io_stream_slots was hidden due to multiple calls to mono_class_try_get_stream_class minimizing the risk of a race to cause a crash. That implementation was however changed by e74da7c ending up in just one call to mono_class_try_get_stream_class in the critical paths.

Since code relies on mono_class_try_get_[shortname]_class to succeed for some critical classes, code asserts if it returns NULL or crashes. The fix will use acquire/release semantics on the static variable guarding the init of the cached class and make sure memory order is preserved on both read and write. Previous implementation adopted a similar approach but it took a copy of the static tmp_class into a local variable meaning that memory order would not be guaranteed, even if it used memory barriers and volatile keywords on the static variables.

Fix also adds an assert into init_io_stream_slots since it expects failures in calls to mono_class_try_get_stream_class as fatal.

We observed several crash reports in dotnet Android coming from
google play store, like this one:

dotnet#109921 (comment)

and it happens at the following callstack:

mono_class_setup_vtable_full at /__w/1/s/src/mono/mono/metadata/class-setup-vtable.c:900
init_io_stream_slots at /__w/1/s/src/mono/mono/metadata/icall.c:3378
ves_icall_System_IO_Stream_HasOverriddenBeginEndRead at /__w/1/s/src/mono/mono/metadata/icall.c:3437
 (inlined by) ves_icall_System_IO_Stream_HasOverriddenBeginEndRead_raw at /__w/1/s/src/mono/mono/metadata/../metadata/icall-def.h:228

Looking a little deeper at that that code path expects call to
mono_class_try_get_stream_class to succeed since it calls
mono_class_setup_vtable on returned class pointer. There are other
places where code expectes mono_class_try_get_[shortname]_class to
always succeed or it will assert. Other locations handles the case
where it returns NULL meaning that the class has been linked out.

After looking into the macro implementation generating the
mono_class_try_get_[shortname]_class functions it appears that it has
a race where one thread could return NULL even if the class successfully
gets loaded by another racing thread. Initially that might have
been intentionally and callers would need to redo the call, but then
there is no way for callers to distinct between hitting the race and
class not available. Adding to the fact that code would also expect that
some critical classes never fail loading, like what
init_io_stream_slots does, this race ended up in race conditions.

In the past, this particular race in init_io_stream_slots was hidden
due to multiple calls to mono_class_try_get_stream_class minimzing the
risk of a race to cause a crash. That implementation was however
changed by dotnet@e74da7c
ending up in just one call to mono_class_try_get_stream_class in the
critical paths.

Since code relies on mono_class_try_get_[shortname]_class to succeed
for some critical classes, code asserts if it returns NULL or crashes.
The fix will use acquire/release semantics on the static variable
guarding the cached type and make sure memory order is preserved
on both read and write. Previous implementation adopted a similar approach
but it took a copy of the static tmp_class into a local variable meaning
that memory order would not be guaranteed, even if it applied memory
barriers and volatile keywords on the static variables.

Fix also adds an assert into init_io_stream_slots since it expects
failures in calls to mono_class_try_get_stream_class as fatal.
@steveisok steveisok requested a review from kg February 7, 2025 20:05
Copy link
Member

@kg kg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find

@srxqds
Copy link
Contributor

srxqds commented Feb 8, 2025

backport to release/9.x?

@steveisok
Copy link
Member

backport to release/9.x?

Yes

@steveisok
Copy link
Member

/backport to release/9.0

Copy link
Contributor

github-actions bot commented Feb 8, 2025

Started backporting to release/9.0: https://github.com/dotnet/runtime/actions/runs/13211416115

@steveisok
Copy link
Member

/backport to release/8.0

Copy link
Contributor

github-actions bot commented Feb 8, 2025

Started backporting to release/8.0: https://github.com/dotnet/runtime/actions/runs/13211423655

grendello added a commit to grendello/runtime that referenced this pull request Feb 10, 2025
* main: (41 commits)
  Automated bump of chrome version (dotnet#112309)
  Add `GetDeclaringType` to `PropertyDefinition` and `EventDefinition`. (dotnet#111646)
  Update the System.ComponentModel.Annotations solution to build in VS (dotnet#112313)
  JIT: initial support for stack allocating arrays of GC type (dotnet#112250)
  [main] Update dependencies from dotnet/roslyn (dotnet#112260)
  Update Xcode casing (dotnet#112307)
  update the location of assert for REG_ZR check (dotnet#112294)
  Enable `SA1206`: Keyword ordering (dotnet#112303)
  Address feedback on dense FrozenDictionary optimization (dotnet#112298)
  Start regular pri-1 tests runs with native AOT (dotnet#111391)
  Observe exceptions from _connectionCloseTcs (dotnet#112190)
  Test failure - SendAsync_RequestVersion20_ResponseVersion20 (dotnet#112232)
  Fix init race in mono_class_try_get_[shortname]_class. (dotnet#112282)
  Remove repeated call to DllMain (dotnet#112285)
  Replace bitvector.h/cpp with ptrArgTP type in gc_unwind_x86.h/inl (dotnet#112268)
  JIT: Limit 3-opt to 1000 swaps per run (dotnet#112259)
  [main] Update dependencies from dotnet/icu, dotnet/runtime-assets (dotnet#112120)
  Update dependencies from https://github.com/dotnet/emsdk build 20250205.3 (dotnet#112223)
  Fix EventPipe on Android CoreClr. (dotnet#112270)
  Fix exception handling in the prestub worker (dotnet#111937)
  ...
@alight-rajeshrathod
Copy link

alight-rajeshrathod commented Feb 11, 2025

Can someone please confirm when this fix will be available to use in the project for dotnet 8.x release?

@steveisok
Copy link
Member

Can someone please confirm when this fix will be available to use in the project for dotnet 8.x release?

Seeing it hasn't been taken into .NET 8 yet I cannot project when. Once I have more info I can let you know.

steveisok pushed a commit that referenced this pull request Feb 12, 2025
Backport of #112282 to release/9.0

We observed several crash reports in dotnet Android coming from
google play store, like this one:

#109921 (comment)

and it happens at the following callstack:

mono_class_setup_vtable_full at /__w/1/s/src/mono/mono/metadata/class-setup-vtable.c:900
init_io_stream_slots at /__w/1/s/src/mono/mono/metadata/icall.c:3378
ves_icall_System_IO_Stream_HasOverriddenBeginEndRead at /__w/1/s/src/mono/mono/metadata/icall.c:3437
 (inlined by) ves_icall_System_IO_Stream_HasOverriddenBeginEndRead_raw at /__w/1/s/src/mono/mono/metadata/../metadata/icall-def.h:228

Looking a little deeper at that that code path expects call to
mono_class_try_get_stream_class to succeed since it calls
mono_class_setup_vtable on returned class pointer. There are other
places where code expectes mono_class_try_get_[shortname]_class to
always succeed or it will assert. Other locations handles the case
where it returns NULL meaning that the class has been linked out.

After looking into the macro implementation generating the
mono_class_try_get_[shortname]_class functions it appears that it has
a race where one thread could return NULL even if the class successfully
gets loaded by another racing thread. Initially that might have
been intentionally and callers would need to redo the call, but then
there is no way for callers to distinct between hitting the race and
class not available. Adding to the fact that code would also expect that
some critical classes never fail loading, like what
init_io_stream_slots does, this race ended up in race conditions.

In the past, this particular race in init_io_stream_slots was hidden
due to multiple calls to mono_class_try_get_stream_class minimzing the
risk of a race to cause a crash. That implementation was however
changed by e74da7c
ending up in just one call to mono_class_try_get_stream_class in the
critical paths.

Since code relies on mono_class_try_get_[shortname]_class to succeed
for some critical classes, code asserts if it returns NULL or crashes.
The fix will use acquire/release semantics on the static variable
guarding the cached type and make sure memory order is preserved
on both read and write. Previous implementation adopted a similar approach
but it took a copy of the static tmp_class into a local variable meaning
that memory order would not be guaranteed, even if it applied memory
barriers and volatile keywords on the static variables.

Fix also adds an assert into init_io_stream_slots since it expects
failures in calls to mono_class_try_get_stream_class as fatal.
steveisok pushed a commit that referenced this pull request Feb 12, 2025
Backport of #112282 to release/8.0

We observed several crash reports in dotnet Android coming from
google play store, like this one:

#109921 (comment)

and it happens at the following callstack:

mono_class_setup_vtable_full at /__w/1/s/src/mono/mono/metadata/class-setup-vtable.c:900
init_io_stream_slots at /__w/1/s/src/mono/mono/metadata/icall.c:3378
ves_icall_System_IO_Stream_HasOverriddenBeginEndRead at /__w/1/s/src/mono/mono/metadata/icall.c:3437
 (inlined by) ves_icall_System_IO_Stream_HasOverriddenBeginEndRead_raw at /__w/1/s/src/mono/mono/metadata/../metadata/icall-def.h:228

Looking a little deeper at that that code path expects call to
mono_class_try_get_stream_class to succeed since it calls
mono_class_setup_vtable on returned class pointer. There are other
places where code expectes mono_class_try_get_[shortname]_class to
always succeed or it will assert. Other locations handles the case
where it returns NULL meaning that the class has been linked out.

After looking into the macro implementation generating the
mono_class_try_get_[shortname]_class functions it appears that it has
a race where one thread could return NULL even if the class successfully
gets loaded by another racing thread. Initially that might have
been intentionally and callers would need to redo the call, but then
there is no way for callers to distinct between hitting the race and
class not available. Adding to the fact that code would also expect that
some critical classes never fail loading, like what
init_io_stream_slots does, this race ended up in race conditions.

In the past, this particular race in init_io_stream_slots was hidden
due to multiple calls to mono_class_try_get_stream_class minimzing the
risk of a race to cause a crash. That implementation was however
changed by e74da7c
ending up in just one call to mono_class_try_get_stream_class in the
critical paths.

Since code relies on mono_class_try_get_[shortname]_class to succeed
for some critical classes, code asserts if it returns NULL or crashes.
The fix will use acquire/release semantics on the static variable
guarding the cached type and make sure memory order is preserved
on both read and write. Previous implementation adopted a similar approach
but it took a copy of the static tmp_class into a local variable meaning
that memory order would not be guaranteed, even if it applied memory
barriers and volatile keywords on the static variables.

Fix also adds an assert into init_io_stream_slots since it expects
failures in calls to mono_class_try_get_stream_class as fatal.

Co-authored-by: lateralusX <lateralusx.github@gmail.com>
@alight-rajeshrathod
Copy link

Thanks @steveisok. Our production app continue to report crash because of this issue so would appreciate any timeline to release this hotfix in dotnet 8.0 release.

@steveisok
Copy link
Member

Thanks @steveisok. Our production app continue to report crash because of this issue so would appreciate any timeline to release this hotfix in dotnet 8.0 release.

The timing was unfortunately right after the 8.0.14 cutoff (March release), so it won't be in until April.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants