AssemblyLoadContext crash on linux during exception processing #81108

jaredpar · 2023-01-24T18:30:27Z

Description

The compiler team is seeing a runtime crash when running our analyzer loading tests on Linux. The same tests pass on Windows and Mac. The basic setup is our tests do the following:

Create a separate AssemblyLoadContext for the purpose of isolating the individual test. This loads the compiler binaries and many test assets
Create an AssemblyLoadContext in order to load our analyzer dependencies.

On Linux this is resulting in a runtime crash.

Reproduction Steps

Run the test Microsoft.CodeAnalysis.UnitTests.DefaultAnalyzerAssemblyLoaderTests.AssemblyLoading_DependencyInDifferentDirectory on Linux. Couple of notes:

Clone https://github.com/jaredpar/roslyn
Sync to SHA e107da471048f014c8ddf3517c9c79a24a1a5adb
This is a Theory test and we're unsure if the crash happens when the bool is true or false

Expected behavior

The test should pass. Based on the stack it may be throwing an exception that indicates an error on our part. In the chance of exception though we would expect it to exit with an exception that has a stack trace.

Note: if there is an exception being thrown then it's quite possible that it's an exception type who's assembly is rooted in one of the child AssemblyLoadContext. The same assembly is going to be loaded in the default AssemblyLoadContext. Wanted to call that out in case it helps with the investigation.

Actual behavior

The runtime crashes. The stack of the crash is as follows:

0:019> k
 # Child-SP          RetAddr               Call Site00 
00007f9b`7c7ef470 00007f9b`812238b3     libpthread_2_27!_waitpid+0x4a01 
00007f9b`7c7ef4a0 00007f9b`8122431f     libcoreclr!PROCCreateCrashDump+0x223 [/__w/1/s/src/coreclr/pal/src/thread/process.cpp @ 2465] 02 
00007f9b`7c7ef500 00007f9b`811ec0b0     libcoreclr!PROCCreateCrashDumpIfEnabled+0x2ff [/__w/1/s/src/coreclr/pal/src/thread/process.cpp @ 2654] 03 (Inline Function) --------`--------     libcoreclr!invoke_previous_action+0xd7 [/__w/1/s/src/coreclr/pal/src/exception/signal.cpp @ 15732480] 04 
00007f9b`7c7ef580 00007f9b`8223f980     libcoreclr!sigsegv_handler+0x260 [/__w/1/s/src/coreclr/pal/src/exception/signal.cpp @ 15732480] 05 
00007f9b`7c7efac0 00007f9b`80e62a94     libpthread_2_27!_restore_rt06 (Inline Function) --------`--------     libcoreclr!FixupPrecode::Init+0xd [/__w/1/s/src/coreclr/vm/precode.cpp @ 676] 07 (Inline Function) --------`--------     libcoreclr!Precode::Init+0x67 [/__w/1/s/src/coreclr/vm/precode.cpp @ 284] 08 
00007f5a`2d1e5a50 00007f9b`80e3b106     libcoreclr!Precode::AllocateTemporaryEntryPoints+0x124 [/__w/1/s/src/coreclr/vm/precode.cpp @ 482] 09 (Inline Function) --------`--------     libcoreclr!MethodDescChunk::CreateTemporaryEntryPoints+0x8 [/__w/1/s/src/coreclr/vm/method.cpp @ 15732480] 0a (Inline Function) --------`--------     libcoreclr!MethodDescChunk::EnsureTemporaryEntryPointsCreated+0x65 [/__w/1/s/src/coreclr/vm/method.hpp @ 2209] 0b 
00007f5a`2d1e5af0 00007f9b`80da5647     libcoreclr!MethodDesc::SetTemporaryEntryPoint+0x86 [/__w/1/s/src/coreclr/vm/method.cpp @ 2919] 0c (Inline Function) --------`--------     libcoreclr!ArrayClass::InitArrayMethodDesc+0x7a [/__w/1/s/src/coreclr/vm/methodtable.inl @ 594] 0d 
00007f5a`2d1e5b40 00007f9b`80dcebb1     libcoreclr!Module::CreateArrayMethodTable+0xce7 [/__w/1/s/src/coreclr/vm/array.cpp @ 540] 0e 
00007f5a`2d1e5eb0 00007f9b`80dce7f8     libcoreclr!ClassLoader::CreateTypeHandleForTypeKey+0x2f1 [/__w/1/s/src/coreclr/vm/clsload.cpp @ 15732480] 0f 
00007f5a`2d1e5fc0 00007f9b`80dcf6b8     libcoreclr!ClassLoader::DoIncrementalLoad+0x48 [/__w/1/s/src/coreclr/vm/clsload.cpp @ 2883] 10 
00007f5a`2d1e61f0 00007f9b`80dcbc6e     libcoreclr!ClassLoader::LoadTypeHandleForTypeKey_Body+0x418 [/__w/1/s/src/coreclr/vm/clsload.cpp @ 3559] 11 
00007f5a`2d1e62c0 00007f9b`80dcb9ff     libcoreclr!ClassLoader::LoadTypeHandleForTypeKey+0xae [/__w/1/s/src/coreclr/vm/clsload.cpp @ 3278] 12 
00007f5a`2d1e6340 00007f9b`80dd0024     libcoreclr!ClassLoader::LoadConstructedTypeThrowing+0x22f [/__w/1/s/src/coreclr/vm/clsload.cpp @ 15732480] 13 
00007f5a`2d1e63d0 00007f9b`80e72ed3     libcoreclr!ClassLoader::LoadArrayTypeThrowing+0xa4 [/__w/1/s/src/coreclr/vm/clsload.cpp @ 3685] 14 
00007f5a`2d1e6440 00007f9b`80e053e0     libcoreclr!SigPointer::GetTypeHandleThrowing+0xda3 [/__w/1/s/src/coreclr/vm/siginfo.cpp @ 1549] 15 
00007f5a`2d1e6500 00007f9b`80f31e3b     libcoreclr!Dictionary::PopulateEntry+0x6d0 [/__w/1/s/src/coreclr/vm/genericdict.cpp @ 771] 16 
00007f5a`2d1e66b0 00007f9b`80f31fe9     libcoreclr!JIT_GenericHandleWorker+0x1eb [/__w/1/s/src/coreclr/vm/jithelpers.cpp @ 3130] 17 
00007f5a`2d1e6750 00007f9b`020fce3d     libcoreclr!JIT_GenericHandle_Framed+0x109 [/__w/1/s/src/coreclr/vm/jithelpers.cpp @ 15732480] 18 
00007f5a`2d1e68d0 00007f9b`020fc24f     System_Private_CoreLib!System.Runtime.CompilerServices.ConditionalWeakTable`2+Container[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]]..ctor(System.Runtime.CompilerServices.ConditionalWeakTable`2<System.__Canon,System.__Canon>)+0x5d [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/ConditionalWeakTable.cs @ 473] 19 
00007f5a`2d1e6900 00007f9b`01fe21e1     System_Private_CoreLib!System.Runtime.CompilerServices.ConditionalWeakTable`2[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]]..ctor()+0x3f [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/ConditionalWeakTable.cs @ 40] 1a 
00007f5a`2d1e6980 00007f9b`01fe26b8     System_Private_CoreLib!System.Diagnostics.StackFrameHelper.InitializeSourceInfo(Int32, Boolean, System.Exception)+0x251 [/_/src/coreclr/System.Private.CoreLib/src/System/Diagnostics/StackFrameHelper.cs @ 129] 1b 
00007f5a`2d1e6a60 00007f9b`01ddeb34     System_Private_CoreLib!System.Diagnostics.StackTrace.CaptureStackTrace(Int32, Boolean, System.Exception)+0x48 [/_/src/coreclr/System.Private.CoreLib/src/System/Diagnostics/StackTrace.CoreCLR.cs @ 64] 1c 
00007f5a`2d1e6ab0 0000 0000` 0000 0000     System_Private_CoreLib!System.Exception.get_StackTrace()+0x64 [/_/src/libraries/System.Private.CoreLib/src/System/Exception.cs @ 219]

Regression?

Unknown

Known Workarounds

None

Configuration

Here is the SDK / runtime info

+ dotnet --info
.NET SDK:
 Version:   7.0.100
 Commit:    e12b7af219

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  18.04
 OS Platform: Linux
 RID:         ubuntu.18.04-x64
 Base Path:   /datadisks/disk1/work/A9840934/p/dotnet-cli/sdk/7.0.100/

Host:
  Version:      7.0.0
  Architecture: x64
  Commit:       d099f075e4

.NET SDKs installed:
  7.0.100 [/datadisks/disk1/work/A9840934/p/dotnet-cli/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 7.0.0 [/datadisks/disk1/work/A9840934/p/dotnet-cli/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 7.0.0 [/datadisks/disk1/work/A9840934/p/dotnet-cli/shared/Microsoft.NETCore.App]

Other architectures found:
  None

Environment variables:
  DOTNET_ROOT       [/datadisks/disk1/work/A9840934/p/dotnet-cli]

Other information

@AaronRobinsonMSFT helped a lot in getting this info

The text was updated successfully, but these errors were encountered:

jaredpar · 2023-01-31T18:14:59Z

@elinor-fung are you aware of this bug? It's getting to a point where we're having to disable most assembly loading tests on Linux because it's blocking our PRs. That's a significant gap I'd like to avoid if possible but no one seems to be looing at this.

This also has the added benefit that the entire .NET Core set of tests run in less than one second. Previously it close to half a second per test. This will hopefully alleviate the pain for the runtime issue we're hittnig running `AssemblyLoadContext` tests in CI. This reduces the amount of compilations on this code path which seem to be the trigger for the failure. - dotnet#66621 - dotnet/runtime#81108

elinor-fung · 2023-01-31T22:19:11Z

Thanks @jaredpar - I had not seen this. Taking a look.

elinor-fung · 2023-02-01T00:18:05Z

For the core dump that has the stack in the description:

precode.cpp@676 is when we trying to assign to pData->PrecodeFixupThunk in FixupPrecode::Init:

runtime/src/coreclr/vm/precode.cpp

Lines 670 to 676 in 82f7100

    
           FixupPrecodeData *pData = GetData(); 
        
           pData->MethodDesc = pMD; 
        
           _ASSERTE(GetMethodDesc() == (TADDR)pMD); 
        
           pData->Target = (PCODE)pPrecodeRX + FixupPrecode::FixupCodeOffset; 
        
           pData->PrecodeFixupThunk = GetPreStubEntryPoint();

where the data should be one page after this = 0x00007f9b3ef04ff0. That maps to

(lldb) di -f
->  0x7f9b80e62a94: movq   %r12, 0x10(%rbx,%rax)
    0x7f9b80e62a99: addq   $0x18, %rbx
    0x7f9b80e62a9d: movq   %r13, %rdi
    0x7f9b80e62aa0: callq  0x7f9b80e35dd0
    0x7f9b80e62aa5: addq   %rax, %r13
    0x7f9b80e62aa8: addl   $-0x1, %r14d
(lldb) register read rbx rax
     rbx = 0x00007f9b3ef04ff0
     rax = 0x0000000000001000

If I'm looking at this correctly, it seems like part of the memory for that FixupPrecodeData is not writeable. I'm not sure how we got to this state though.

runtime/src/coreclr/vm/precode.h

Lines 199 to 204 in 4f01044

    
           struct FixupPrecodeData 
        
           { 
        
               PCODE Target; 
        
               MethodDesc *MethodDesc; 
        
               PCODE PrecodeFixupThunk; 
        
           };

(lldb) memory region $rbx+$rax
[0x00007f9b3ef05000-0x00007f9b3ef06000) rw-
(lldb) memory region $rbx+$rax+0x8
[0x00007f9b3ef05000-0x00007f9b3ef06000) rw-
(lldb) memory region $rbx+$rax+0x10
[0x00007f9b3ef06000-0x00007f9b3ef10000) ---

I'm not actually sure if !vprot in windbg works properly for linux core dumps, but that showed:

> !vprot 00007f9b3ef04ff0+0x1000+0x10
BaseAddress:       00007f9b3ef06000
AllocationBase:    00007f9b3ef06000
AllocationProtect: 00000000  ***Invalid page protection***
RegionSize:        000000000000a000
State:             00001000  MEM_COMMIT
Protect:           00000000  ***Invalid page protection***
Type:              00020000  MEM_PRIVATE

@janvorli any ideas / does this look familiar at all?

janvorli · 2023-02-01T00:50:19Z

I have recently fixed a bug that is likely what happens here. It went in on Jan 26 (#81192). The top of the stack in the issue #81103 that I was fixing looks the same.

This also has the added benefit that the entire .NET Core set of tests run in less than one second. Previously it close to half a second per test. This will hopefully alleviate the pain for the runtime issue we're hittnig running `AssemblyLoadContext` tests in CI. This reduces the amount of compilations on this code path which seem to be the trigger for the failure. - dotnet#66621 - dotnet/runtime#81108

janvorli · 2023-02-01T12:37:12Z

I will try the repro @jaredpar shared and patch the runtime with the fix ported to see if it is the same thing.

janvorli · 2023-02-01T12:54:26Z

@jaredpar which branch is your commit in? Just cloning your repo and running git checkout e107da471048f014c8ddf3517c9c79a24a1a5adb doesn't find the commit.

janvorli · 2023-02-01T13:51:43Z

Looking at the dump in detail, it actually looks like a slightly different issue than the one fixed by #81192, although at the same place. It looks like the memory region allocated from the new stub heap crossed the page boundary, which should never happen as the heap uses interleaved code / data pages.

This also has the added benefit that the entire .NET Core set of tests run in less than one second. Previously it close to half a second per test. This will hopefully alleviate the pain for the runtime issue we're hittnig running `AssemblyLoadContext` tests in CI. This reduces the amount of compilations on this code path which seem to be the trigger for the failure. - dotnet#66621 - dotnet/runtime#81108

jaredpar · 2023-02-01T15:55:20Z

@janvorli did you clone roslyn or my fork? My fork is where the commit is

https://github.com/jaredpar/roslyn/commits/e107da471048f014c8ddf3517c9c79a24a1a5adb

janvorli · 2023-02-01T19:00:15Z

I've cloned your fork (using the url you've shared before)

git remote -v
origin  https://github.com/jaredpar/roslyn (fetch)
origin  https://github.com/jaredpar/roslyn (push)

By default it cloned branch master and the state was very old (top commit from July 2016). That's why I assumed it must be in some other branch.

janvorli · 2023-02-01T19:17:22Z

When I open the commit on github, it says:
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

janvorli · 2023-02-02T14:41:36Z

I was able to repro it locally after @jaredpar helped me to get the right state of the repo. The primary cause of the issue is OOM, but a special kind of that. Linux has a maximum number of memory mappings that is by default set to about 64k mapings. This test ends up creating more of them, so further mmap calls fail.
The issue here has several failure modes that are caused by a couple of issues that were already fixed in the main branch (#81192, #78069). It seems that there is also one more failure mode that was not covered by any of those fixes.
However, even with the fixes, the tests would fail due to the OOM. It would just not crash at various seemingly unrelated places. I actually wonder if we have some leak in the unloadability that results in the large amount of mappings. There were some changes in assembly loading on Linux that I think could possibly cause such a thing. I am starting to investigate that.

There is a workaround though. The max number of mappings per process can be enlarged. For example, raising the number to 2048000 can be done as follows:

sudo bash -c "sysctl -w 'vm.max_map_count=2048000'"

I have verified that with the setting raised to 128000, the failing test consistently succeeds. 96000 was not sufficient. I didn't try to bisect it further. Setting the limit higher has no impact on memory consumption or performance.

jaredpar · 2023-02-02T16:52:50Z

This change should also alleviate the underlying OOM situation as it dramatically reduces the amount of allocations that happen, reduces the assemblies that get loaded into the AssemblyLoadContext and takes advantage of unloading:

dotnet/roslyn#66612

Still I'm a bit surprised that we were hitting an OOM in the original code. Yes we create a lot of compilations but in the grand scheme of our unit tests it's a drop in the bucket. The item that is significantly different to the rest of our tests is the number of non-collectible AssemblyLoadContext that get created. I'm puzzled as to why that pushed us to OOM so fast.

janvorli · 2023-02-02T17:51:57Z

For every 170 jitted functions, we need 2 mappings for precode stubs. For each assembly, we need one mapping per section. @jaredpar do you have an estimate on how many assemblies get loaded when running the test and how many methods got jitted? From what you've said, it seems that the version that I was testing was using non-collectible AssemblyLoadContexts. So each assembly that got loaded into one and all of its dependencies that were loaded in there too would contribute to that.

jaredpar · 2023-02-02T18:00:54Z

For each assembly, we need one mapping per section. @jaredpar do you have an estimate on how many assemblies get loaded when running the test and how many methods got jitted

I think there were roughly 20-30 assemblies that got loaded into every AssemblyLoadContext that was created. A full test run probably created ~60 AssemblyLoadContext. As for how many methods got jitted ... a lot. Even simple programs like "Hello World" tend to JIT significant portions of the compiler code base.

If the suspicion is that JIT'd methods then that lines up with what I'm seeing. What makes this group of tests different than our others is the amount of AssemblyLoadContext created + the sheer number of DLLs that we loaded and JIT'd. That will all be fixed shortly so this should cease to be a problem for us.

This also has the added benefit that the entire .NET Core set of tests run in less than one second. Previously it close to half a second per test. This will hopefully alleviate the pain for the runtime issue we're hittnig running `AssemblyLoadContext` tests in CI. This reduces the amount of compilations on this code path which seem to be the trigger for the failure. - dotnet#66621 - dotnet/runtime#81108

* Continued analyzer assembly loading work This change responds to several concerns raised in #66492: - Collapse the `AnalyzerAssembyLoader` hierarchy by one level and move more logic into the base type. - Fix `AssemblyLoadContext` in tests to not replicate types. That means we can safely pass `ITestOutputHelper` into the custom `AssemblyLoadContext` - Further strengthen the test separation I have deliberately not changed the file names to reflect the new type names because I wanted the review process to be smooth. Once I have two sign offs for the change I will send another commit which does the following: - Rename - DefaultAnalyzerAssemblyLoaderTests.cs -> AnalyzerAssemblyLoaderTests.cs - DefaultAnalyzerAssemblyLoader.Core.cs -> AnalyzerAssemblyLoader.Core.cs - DefaultAnalyzerAssemblyLoader.Desktop.cs -> AnalyzerAssemblyLoader.Desktop.cs - DefaultAnalyzerAssemblyLoader.cs -> AnalyzerAssemblyLoader.cs - Move - `DefaultAnalyzerAssemblyLoader` into DefaultAnalyzerAssemblyLoader.cs - `InvokeUtil` to InvokeUtil.cs. As constructed every edit to this type causes Test Explorer to lose context on the theory branches. Doesn't come back until rebuild. Will be smoother editting in its own file. That should get is into the style we expect here. After that I will merge once tests are green. * Unload AssemblyLoadContext after test completion * Reuse AssemblyLoadTestFixture across test runs This also has the added benefit that the entire .NET Core set of tests run in less than one second. Previously it close to half a second per test. This will hopefully alleviate the pain for the runtime issue we're hittnig running `AssemblyLoadContext` tests in CI. This reduces the amount of compilations on this code path which seem to be the trigger for the failure. - #66621 - dotnet/runtime#81108 * Apply suggestions from code review Co-authored-by: Rikki Gibson <rikkigibson@gmail.com> * PR feedback * rename the files * Fixup --------- Co-authored-by: Rikki Gibson <rikkigibson@gmail.com>

mangod9 · 2023-07-24T18:15:11Z

is this issue still actionable or can be closed?

jaredpar · 2023-07-31T20:43:32Z

This is no longer a blocker for the compiler after we moved to collectable AssemblyLoadContext. I don't know if @janvorli wanted to keep this open to note the OOM issues that happen when you take AssemblyLoadContext to the extreme we previously were.

janvorli · 2023-07-31T22:30:59Z

We can close it then.

dotnet-issue-labeler bot added the area-VM-coreclr label Jan 24, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Jan 24, 2023

jaredpar mentioned this issue Jan 24, 2023

Fix analyzer loading to prefer host dotnet/roslyn#66492

Merged

jaredpar mentioned this issue Jan 31, 2023

assemblyloading_assemblylocationnotadded always failed in Linux dotnet/roslyn#66621

Open

mangod9 removed the untriaged New issue has not been triaged by the area owner label Jul 24, 2023

mangod9 added this to the Future milestone Jul 24, 2023

janvorli closed this as completed Jul 31, 2023

ghost locked as resolved and limited conversation to collaborators Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssemblyLoadContext crash on linux during exception processing #81108

AssemblyLoadContext crash on linux during exception processing #81108

jaredpar commented Jan 24, 2023

jaredpar commented Jan 31, 2023

elinor-fung commented Jan 31, 2023

elinor-fung commented Feb 1, 2023

janvorli commented Feb 1, 2023

janvorli commented Feb 1, 2023

janvorli commented Feb 1, 2023

janvorli commented Feb 1, 2023

jaredpar commented Feb 1, 2023

janvorli commented Feb 1, 2023

janvorli commented Feb 1, 2023

janvorli commented Feb 2, 2023

jaredpar commented Feb 2, 2023

janvorli commented Feb 2, 2023

jaredpar commented Feb 2, 2023

mangod9 commented Jul 24, 2023

jaredpar commented Jul 31, 2023

janvorli commented Jul 31, 2023

AssemblyLoadContext crash on linux during exception processing #81108

AssemblyLoadContext crash on linux during exception processing #81108

Comments

jaredpar commented Jan 24, 2023

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

jaredpar commented Jan 31, 2023

elinor-fung commented Jan 31, 2023

elinor-fung commented Feb 1, 2023

janvorli commented Feb 1, 2023

janvorli commented Feb 1, 2023

janvorli commented Feb 1, 2023

janvorli commented Feb 1, 2023

jaredpar commented Feb 1, 2023

janvorli commented Feb 1, 2023

janvorli commented Feb 1, 2023

janvorli commented Feb 2, 2023

jaredpar commented Feb 2, 2023

janvorli commented Feb 2, 2023

jaredpar commented Feb 2, 2023

mangod9 commented Jul 24, 2023

jaredpar commented Jul 31, 2023

janvorli commented Jul 31, 2023