Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssemblyLoadContext crash on linux during exception processing #81108

Closed
jaredpar opened this issue Jan 24, 2023 · 17 comments
Closed

AssemblyLoadContext crash on linux during exception processing #81108

jaredpar opened this issue Jan 24, 2023 · 17 comments

Comments

@jaredpar
Copy link
Member

Description

The compiler team is seeing a runtime crash when running our analyzer loading tests on Linux. The same tests pass on Windows and Mac. The basic setup is our tests do the following:

  • Create a separate AssemblyLoadContext for the purpose of isolating the individual test. This loads the compiler binaries and many test assets
  • Create an AssemblyLoadContext in order to load our analyzer dependencies.

On Linux this is resulting in a runtime crash.

Reproduction Steps

Run the test Microsoft.CodeAnalysis.UnitTests.DefaultAnalyzerAssemblyLoaderTests.AssemblyLoading_DependencyInDifferentDirectory on Linux. Couple of notes:

  1. Clone https://github.com/jaredpar/roslyn
  2. Sync to SHA e107da471048f014c8ddf3517c9c79a24a1a5adb
  3. This is a Theory test and we're unsure if the crash happens when the bool is true or false

Expected behavior

The test should pass. Based on the stack it may be throwing an exception that indicates an error on our part. In the chance of exception though we would expect it to exit with an exception that has a stack trace.

Note: if there is an exception being thrown then it's quite possible that it's an exception type who's assembly is rooted in one of the child AssemblyLoadContext. The same assembly is going to be loaded in the default AssemblyLoadContext. Wanted to call that out in case it helps with the investigation.

Actual behavior

The runtime crashes. The stack of the crash is as follows:

0:019> k
 # Child-SP          RetAddr               Call Site00 
00007f9b`7c7ef470 00007f9b`812238b3     libpthread_2_27!_waitpid+0x4a01 
00007f9b`7c7ef4a0 00007f9b`8122431f     libcoreclr!PROCCreateCrashDump+0x223 [/__w/1/s/src/coreclr/pal/src/thread/process.cpp @ 2465] 02 
00007f9b`7c7ef500 00007f9b`811ec0b0     libcoreclr!PROCCreateCrashDumpIfEnabled+0x2ff [/__w/1/s/src/coreclr/pal/src/thread/process.cpp @ 2654] 03 (Inline Function) --------`--------     libcoreclr!invoke_previous_action+0xd7 [/__w/1/s/src/coreclr/pal/src/exception/signal.cpp @ 15732480] 04 
00007f9b`7c7ef580 00007f9b`8223f980     libcoreclr!sigsegv_handler+0x260 [/__w/1/s/src/coreclr/pal/src/exception/signal.cpp @ 15732480] 05 
00007f9b`7c7efac0 00007f9b`80e62a94     libpthread_2_27!_restore_rt06 (Inline Function) --------`--------     libcoreclr!FixupPrecode::Init+0xd [/__w/1/s/src/coreclr/vm/precode.cpp @ 676] 07 (Inline Function) --------`--------     libcoreclr!Precode::Init+0x67 [/__w/1/s/src/coreclr/vm/precode.cpp @ 284] 08 
00007f5a`2d1e5a50 00007f9b`80e3b106     libcoreclr!Precode::AllocateTemporaryEntryPoints+0x124 [/__w/1/s/src/coreclr/vm/precode.cpp @ 482] 09 (Inline Function) --------`--------     libcoreclr!MethodDescChunk::CreateTemporaryEntryPoints+0x8 [/__w/1/s/src/coreclr/vm/method.cpp @ 15732480] 0a (Inline Function) --------`--------     libcoreclr!MethodDescChunk::EnsureTemporaryEntryPointsCreated+0x65 [/__w/1/s/src/coreclr/vm/method.hpp @ 2209] 0b 
00007f5a`2d1e5af0 00007f9b`80da5647     libcoreclr!MethodDesc::SetTemporaryEntryPoint+0x86 [/__w/1/s/src/coreclr/vm/method.cpp @ 2919] 0c (Inline Function) --------`--------     libcoreclr!ArrayClass::InitArrayMethodDesc+0x7a [/__w/1/s/src/coreclr/vm/methodtable.inl @ 594] 0d 
00007f5a`2d1e5b40 00007f9b`80dcebb1     libcoreclr!Module::CreateArrayMethodTable+0xce7 [/__w/1/s/src/coreclr/vm/array.cpp @ 540] 0e 
00007f5a`2d1e5eb0 00007f9b`80dce7f8     libcoreclr!ClassLoader::CreateTypeHandleForTypeKey+0x2f1 [/__w/1/s/src/coreclr/vm/clsload.cpp @ 15732480] 0f 
00007f5a`2d1e5fc0 00007f9b`80dcf6b8     libcoreclr!ClassLoader::DoIncrementalLoad+0x48 [/__w/1/s/src/coreclr/vm/clsload.cpp @ 2883] 10 
00007f5a`2d1e61f0 00007f9b`80dcbc6e     libcoreclr!ClassLoader::LoadTypeHandleForTypeKey_Body+0x418 [/__w/1/s/src/coreclr/vm/clsload.cpp @ 3559] 11 
00007f5a`2d1e62c0 00007f9b`80dcb9ff     libcoreclr!ClassLoader::LoadTypeHandleForTypeKey+0xae [/__w/1/s/src/coreclr/vm/clsload.cpp @ 3278] 12 
00007f5a`2d1e6340 00007f9b`80dd0024     libcoreclr!ClassLoader::LoadConstructedTypeThrowing+0x22f [/__w/1/s/src/coreclr/vm/clsload.cpp @ 15732480] 13 
00007f5a`2d1e63d0 00007f9b`80e72ed3     libcoreclr!ClassLoader::LoadArrayTypeThrowing+0xa4 [/__w/1/s/src/coreclr/vm/clsload.cpp @ 3685] 14 
00007f5a`2d1e6440 00007f9b`80e053e0     libcoreclr!SigPointer::GetTypeHandleThrowing+0xda3 [/__w/1/s/src/coreclr/vm/siginfo.cpp @ 1549] 15 
00007f5a`2d1e6500 00007f9b`80f31e3b     libcoreclr!Dictionary::PopulateEntry+0x6d0 [/__w/1/s/src/coreclr/vm/genericdict.cpp @ 771] 16 
00007f5a`2d1e66b0 00007f9b`80f31fe9     libcoreclr!JIT_GenericHandleWorker+0x1eb [/__w/1/s/src/coreclr/vm/jithelpers.cpp @ 3130] 17 
00007f5a`2d1e6750 00007f9b`020fce3d     libcoreclr!JIT_GenericHandle_Framed+0x109 [/__w/1/s/src/coreclr/vm/jithelpers.cpp @ 15732480] 18 
00007f5a`2d1e68d0 00007f9b`020fc24f     System_Private_CoreLib!System.Runtime.CompilerServices.ConditionalWeakTable`2+Container[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]]..ctor(System.Runtime.CompilerServices.ConditionalWeakTable`2<System.__Canon,System.__Canon>)+0x5d [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/ConditionalWeakTable.cs @ 473] 19 
00007f5a`2d1e6900 00007f9b`01fe21e1     System_Private_CoreLib!System.Runtime.CompilerServices.ConditionalWeakTable`2[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]]..ctor()+0x3f [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/ConditionalWeakTable.cs @ 40] 1a 
00007f5a`2d1e6980 00007f9b`01fe26b8     System_Private_CoreLib!System.Diagnostics.StackFrameHelper.InitializeSourceInfo(Int32, Boolean, System.Exception)+0x251 [/_/src/coreclr/System.Private.CoreLib/src/System/Diagnostics/StackFrameHelper.cs @ 129] 1b 
00007f5a`2d1e6a60 00007f9b`01ddeb34     System_Private_CoreLib!System.Diagnostics.StackTrace.CaptureStackTrace(Int32, Boolean, System.Exception)+0x48 [/_/src/coreclr/System.Private.CoreLib/src/System/Diagnostics/StackTrace.CoreCLR.cs @ 64] 1c 
00007f5a`2d1e6ab0 0000 0000` 0000 0000     System_Private_CoreLib!System.Exception.get_StackTrace()+0x64 [/_/src/libraries/System.Private.CoreLib/src/System/Exception.cs @ 219] 

Regression?

Unknown

Known Workarounds

None

Configuration

Here is the SDK / runtime info

+ dotnet --info
.NET SDK:
 Version:   7.0.100
 Commit:    e12b7af219

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  18.04
 OS Platform: Linux
 RID:         ubuntu.18.04-x64
 Base Path:   /datadisks/disk1/work/A9840934/p/dotnet-cli/sdk/7.0.100/

Host:
  Version:      7.0.0
  Architecture: x64
  Commit:       d099f075e4

.NET SDKs installed:
  7.0.100 [/datadisks/disk1/work/A9840934/p/dotnet-cli/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 7.0.0 [/datadisks/disk1/work/A9840934/p/dotnet-cli/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 7.0.0 [/datadisks/disk1/work/A9840934/p/dotnet-cli/shared/Microsoft.NETCore.App]

Other architectures found:
  None

Environment variables:
  DOTNET_ROOT       [/datadisks/disk1/work/A9840934/p/dotnet-cli]

Other information

@AaronRobinsonMSFT helped a lot in getting this info

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jan 24, 2023
@jaredpar
Copy link
Member Author

@elinor-fung are you aware of this bug? It's getting to a point where we're having to disable most assembly loading tests on Linux because it's blocking our PRs. That's a significant gap I'd like to avoid if possible but no one seems to be looing at this.

jaredpar added a commit to jaredpar/roslyn that referenced this issue Jan 31, 2023
This also has the added benefit that the entire .NET Core set of tests
run in less than one second. Previously it close to half a second per
test.

This will hopefully alleviate the pain for the runtime issue we're
hittnig running `AssemblyLoadContext` tests in CI. This reduces the
amount of compilations on this code path which seem to be the trigger
for the failure.

- dotnet#66621
- dotnet/runtime#81108
@elinor-fung
Copy link
Member

Thanks @jaredpar - I had not seen this. Taking a look.

@elinor-fung
Copy link
Member

For the core dump that has the stack in the description:

precode.cpp@676 is when we trying to assign to pData->PrecodeFixupThunk in FixupPrecode::Init:

FixupPrecodeData *pData = GetData();
pData->MethodDesc = pMD;
_ASSERTE(GetMethodDesc() == (TADDR)pMD);
pData->Target = (PCODE)pPrecodeRX + FixupPrecode::FixupCodeOffset;
pData->PrecodeFixupThunk = GetPreStubEntryPoint();

where the data should be one page after this = 0x00007f9b3ef04ff0. That maps to

(lldb) di -f
->  0x7f9b80e62a94: movq   %r12, 0x10(%rbx,%rax)
    0x7f9b80e62a99: addq   $0x18, %rbx
    0x7f9b80e62a9d: movq   %r13, %rdi
    0x7f9b80e62aa0: callq  0x7f9b80e35dd0
    0x7f9b80e62aa5: addq   %rax, %r13
    0x7f9b80e62aa8: addl   $-0x1, %r14d
(lldb) register read rbx rax
     rbx = 0x00007f9b3ef04ff0
     rax = 0x0000000000001000

If I'm looking at this correctly, it seems like part of the memory for that FixupPrecodeData is not writeable. I'm not sure how we got to this state though.

struct FixupPrecodeData
{
PCODE Target;
MethodDesc *MethodDesc;
PCODE PrecodeFixupThunk;
};

(lldb) memory region $rbx+$rax
[0x00007f9b3ef05000-0x00007f9b3ef06000) rw-
(lldb) memory region $rbx+$rax+0x8
[0x00007f9b3ef05000-0x00007f9b3ef06000) rw-
(lldb) memory region $rbx+$rax+0x10
[0x00007f9b3ef06000-0x00007f9b3ef10000) ---

I'm not actually sure if !vprot in windbg works properly for linux core dumps, but that showed:

> !vprot 00007f9b3ef04ff0+0x1000+0x10
BaseAddress:       00007f9b3ef06000
AllocationBase:    00007f9b3ef06000
AllocationProtect: 00000000  ***Invalid page protection***
RegionSize:        000000000000a000
State:             00001000  MEM_COMMIT
Protect:           00000000  ***Invalid page protection***
Type:              00020000  MEM_PRIVATE

@janvorli any ideas / does this look familiar at all?

@janvorli
Copy link
Member

janvorli commented Feb 1, 2023

I have recently fixed a bug that is likely what happens here. It went in on Jan 26 (#81192). The top of the stack in the issue #81103 that I was fixing looks the same.

jaredpar added a commit to jaredpar/roslyn that referenced this issue Feb 1, 2023
This also has the added benefit that the entire .NET Core set of tests
run in less than one second. Previously it close to half a second per
test.

This will hopefully alleviate the pain for the runtime issue we're
hittnig running `AssemblyLoadContext` tests in CI. This reduces the
amount of compilations on this code path which seem to be the trigger
for the failure.

- dotnet#66621
- dotnet/runtime#81108
@janvorli
Copy link
Member

janvorli commented Feb 1, 2023

I will try the repro @jaredpar shared and patch the runtime with the fix ported to see if it is the same thing.

@janvorli
Copy link
Member

janvorli commented Feb 1, 2023

@jaredpar which branch is your commit in? Just cloning your repo and running git checkout e107da471048f014c8ddf3517c9c79a24a1a5adb doesn't find the commit.

@janvorli
Copy link
Member

janvorli commented Feb 1, 2023

Looking at the dump in detail, it actually looks like a slightly different issue than the one fixed by #81192, although at the same place. It looks like the memory region allocated from the new stub heap crossed the page boundary, which should never happen as the heap uses interleaved code / data pages.

jaredpar added a commit to jaredpar/roslyn that referenced this issue Feb 1, 2023
This also has the added benefit that the entire .NET Core set of tests
run in less than one second. Previously it close to half a second per
test.

This will hopefully alleviate the pain for the runtime issue we're
hittnig running `AssemblyLoadContext` tests in CI. This reduces the
amount of compilations on this code path which seem to be the trigger
for the failure.

- dotnet#66621
- dotnet/runtime#81108
@jaredpar
Copy link
Member Author

jaredpar commented Feb 1, 2023

@janvorli did you clone roslyn or my fork? My fork is where the commit is

https://github.com/jaredpar/roslyn/commits/e107da471048f014c8ddf3517c9c79a24a1a5adb

@janvorli
Copy link
Member

janvorli commented Feb 1, 2023

I've cloned your fork (using the url you've shared before)

git remote -v
origin  https://github.com/jaredpar/roslyn (fetch)
origin  https://github.com/jaredpar/roslyn (push)

By default it cloned branch master and the state was very old (top commit from July 2016). That's why I assumed it must be in some other branch.

@janvorli
Copy link
Member

janvorli commented Feb 1, 2023

When I open the commit on github, it says:
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

@janvorli
Copy link
Member

janvorli commented Feb 2, 2023

I was able to repro it locally after @jaredpar helped me to get the right state of the repo. The primary cause of the issue is OOM, but a special kind of that. Linux has a maximum number of memory mappings that is by default set to about 64k mapings. This test ends up creating more of them, so further mmap calls fail.
The issue here has several failure modes that are caused by a couple of issues that were already fixed in the main branch (#81192, #78069). It seems that there is also one more failure mode that was not covered by any of those fixes.
However, even with the fixes, the tests would fail due to the OOM. It would just not crash at various seemingly unrelated places. I actually wonder if we have some leak in the unloadability that results in the large amount of mappings. There were some changes in assembly loading on Linux that I think could possibly cause such a thing. I am starting to investigate that.

There is a workaround though. The max number of mappings per process can be enlarged. For example, raising the number to 2048000 can be done as follows:

sudo bash -c "sysctl -w 'vm.max_map_count=2048000'"

I have verified that with the setting raised to 128000, the failing test consistently succeeds. 96000 was not sufficient. I didn't try to bisect it further. Setting the limit higher has no impact on memory consumption or performance.

@jaredpar
Copy link
Member Author

jaredpar commented Feb 2, 2023

This change should also alleviate the underlying OOM situation as it dramatically reduces the amount of allocations that happen, reduces the assemblies that get loaded into the AssemblyLoadContext and takes advantage of unloading:

dotnet/roslyn#66612

Still I'm a bit surprised that we were hitting an OOM in the original code. Yes we create a lot of compilations but in the grand scheme of our unit tests it's a drop in the bucket. The item that is significantly different to the rest of our tests is the number of non-collectible AssemblyLoadContext that get created. I'm puzzled as to why that pushed us to OOM so fast.

@janvorli
Copy link
Member

janvorli commented Feb 2, 2023

For every 170 jitted functions, we need 2 mappings for precode stubs. For each assembly, we need one mapping per section. @jaredpar do you have an estimate on how many assemblies get loaded when running the test and how many methods got jitted? From what you've said, it seems that the version that I was testing was using non-collectible AssemblyLoadContexts. So each assembly that got loaded into one and all of its dependencies that were loaded in there too would contribute to that.

@jaredpar
Copy link
Member Author

jaredpar commented Feb 2, 2023

For each assembly, we need one mapping per section. @jaredpar do you have an estimate on how many assemblies get loaded when running the test and how many methods got jitted

I think there were roughly 20-30 assemblies that got loaded into every AssemblyLoadContext that was created. A full test run probably created ~60 AssemblyLoadContext. As for how many methods got jitted ... a lot. Even simple programs like "Hello World" tend to JIT significant portions of the compiler code base.

If the suspicion is that JIT'd methods then that lines up with what I'm seeing. What makes this group of tests different than our others is the amount of AssemblyLoadContext created + the sheer number of DLLs that we loaded and JIT'd. That will all be fixed shortly so this should cease to be a problem for us.

jaredpar added a commit to jaredpar/roslyn that referenced this issue Feb 2, 2023
This also has the added benefit that the entire .NET Core set of tests
run in less than one second. Previously it close to half a second per
test.

This will hopefully alleviate the pain for the runtime issue we're
hittnig running `AssemblyLoadContext` tests in CI. This reduces the
amount of compilations on this code path which seem to be the trigger
for the failure.

- dotnet#66621
- dotnet/runtime#81108
jaredpar pushed a commit to dotnet/roslyn that referenced this issue Feb 2, 2023
* Continued analyzer assembly loading work

This change responds to several concerns raised in #66492:

- Collapse the `AnalyzerAssembyLoader` hierarchy by one level and move
  more logic into the base type.
- Fix `AssemblyLoadContext` in tests to not replicate types. That means
  we can safely pass `ITestOutputHelper` into the custom
  `AssemblyLoadContext`
- Further strengthen the test separation

I have deliberately not changed the file names to reflect the new type
names because I wanted the review process to be smooth. Once I have two
sign offs for the change I will send another commit which does the
following:

- Rename
    - DefaultAnalyzerAssemblyLoaderTests.cs -> AnalyzerAssemblyLoaderTests.cs
    - DefaultAnalyzerAssemblyLoader.Core.cs -> AnalyzerAssemblyLoader.Core.cs
    - DefaultAnalyzerAssemblyLoader.Desktop.cs -> AnalyzerAssemblyLoader.Desktop.cs
    - DefaultAnalyzerAssemblyLoader.cs -> AnalyzerAssemblyLoader.cs
- Move
    - `DefaultAnalyzerAssemblyLoader` into DefaultAnalyzerAssemblyLoader.cs
    - `InvokeUtil` to InvokeUtil.cs. As constructed every edit to this
      type causes Test Explorer to lose context on the theory branches.
      Doesn't come back until rebuild. Will be smoother editting in its own
      file.

That should get is into the style we expect here. After that I will
merge once tests are green.

* Unload AssemblyLoadContext after test completion

* Reuse AssemblyLoadTestFixture across test runs

This also has the added benefit that the entire .NET Core set of tests
run in less than one second. Previously it close to half a second per
test.

This will hopefully alleviate the pain for the runtime issue we're
hittnig running `AssemblyLoadContext` tests in CI. This reduces the
amount of compilations on this code path which seem to be the trigger
for the failure.

- #66621
- dotnet/runtime#81108

* Apply suggestions from code review

Co-authored-by: Rikki Gibson <rikkigibson@gmail.com>

* PR feedback

* rename the files

* Fixup

---------

Co-authored-by: Rikki Gibson <rikkigibson@gmail.com>
@mangod9
Copy link
Member

mangod9 commented Jul 24, 2023

is this issue still actionable or can be closed?

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Jul 24, 2023
@mangod9 mangod9 added this to the Future milestone Jul 24, 2023
@jaredpar
Copy link
Member Author

This is no longer a blocker for the compiler after we moved to collectable AssemblyLoadContext. I don't know if @janvorli wanted to keep this open to note the OOM issues that happen when you take AssemblyLoadContext to the extreme we previously were.

@janvorli
Copy link
Member

We can close it then.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 31, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants