Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BinderTracingTest.ResolutionFlow times out #104670

Open
jakobbotsch opened this issue Jul 10, 2024 · 9 comments
Open

BinderTracingTest.ResolutionFlow times out #104670

jakobbotsch opened this issue Jul 10, 2024 · 9 comments
Assignees
Milestone

Comments

@jakobbotsch
Copy link
Member

jakobbotsch commented Jul 10, 2024

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=735589
Build error leg or test failing: Loader/binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd
Pull request: #104603

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "",
  "ErrorPattern": "BinderTracingTest.ResolutionFlow.* Timed Out",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=735589
Error message validated: [BinderTracingTest.ResolutionFlow.* Timed Out]
Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 7/10/2024 11:13:33 AM UTC

Report

Build Definition Test Pull Request
879584 dotnet/runtime Loader/binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd #110117
875008 dotnet/runtime Loader/binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd #109998
869615 dotnet/runtime Loader/binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 2 3
@jakobbotsch jakobbotsch added blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' Known Build Error Use this to report build issues in the .NET Helix tab labels Jul 10, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jul 10, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-infrastructure-libraries
See info in area-owners.md if you want to be subscribed.

@jakobbotsch
Copy link
Member Author

cc @davmason

@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Jul 12, 2024
@tommcdon tommcdon added this to the 9.0.0 milestone Jul 12, 2024
@mdh1418
Copy link
Member

mdh1418 commented Jul 15, 2024

This seems to be a continuation of #97735 and #94390. Given that the BinderTracingTests are already skipped for Jitstress and GCStress, there is probably another culprit for these tests timing out.

It seems flakey given the low hit count, and that previous counterpart #97735 at one point had 0 hit counts in 30 days.

I haven't been able to repro the timeout locally, and given how flakey it is on CI, I'm not sure if I'd be able to reliably repro in CI. From this build instance it seems like this is causing the hang


based off of

0:000> !clrstack -f -all
OS Thread Id: 0x23ec
Child SP       IP Call Site
0097E0EC 77B3F3EC ntdll!NtWaitForMultipleObjects + 12
0097E284 72246A85 coreclr!Thread::DoAppropriateAptStateWait + 199
0097E308 72246FB3 coreclr!Thread::DoAppropriateWaitWorker + 998
0097E464 7224D4AD coreclr!`Thread::DoAppropriateWait'::`9'::__Body::Run + 90
0097E4B8 72246B3C coreclr!Thread::DoAppropriateWait + 149
0097E51C 722BEC16 coreclr!WaitHandleNative::CorWaitOneNative + 294
0097E560          [HelperMethodFrame: 0097e560] System.Private.CoreLib.dll!System.Threading.WaitHandle.WaitOneCore(IntPtr, Int32, Boolean)
0097E5F4 71022A5C System_Private_CoreLib!System.Boolean System.Threading.WaitHandle::WaitOneNoCheck(System.Int32, System.Boolean, System.Object, System.Diagnostics.Tracing.NativeRuntimeEventSource+WaitHandleWaitSourceMap)$##6003CB2 + 188
0097E5F8 71022A5C System.Private.CoreLib.dll!System.Threading.WaitHandle.WaitOneNoCheck(Int32, Boolean, System.Object, WaitHandleWaitSourceMap) + 188
0097E630 71022984 System_Private_CoreLib!System.Boolean System.Threading.WaitHandle::WaitOne(System.Int32)$##6003CB1 + 20
0097E63C 71022984 System.Private.CoreLib.dll!System.Threading.WaitHandle.WaitOne(Int32) + 20
0097E644 094C5B7B system.diagnostics.process.dll!System.Diagnostics.Process.WaitForExitCore(Int32) + 123
0097E67C 094C0800 BinderTracingTest.ResolutionFlow.dll!BinderTracingTests.BinderTracingTest.RunTestInSeparateProcess(System.Reflection.MethodInfo) + 816
0097E6DC 0817237E BinderTracingTest.ResolutionFlow.dll!BinderTracingTests.BinderTracingTest.RunAllTests() + 446
0097E704 0817204F BinderTracingTest.ResolutionFlow.dll!BinderTracingTests.BinderTracingTest.Main(System.String[]) + 39

From looking at what BinderTracingTests does, I'm not quite sure what is causing the separate subprocess to hang. Given how this test is still hanging even without GCStress/Jitstress, I would've expected for the test to hit the BinderEventListener's 30s timeout at

int waitTimeoutInMs = Environment.GetEnvironmentVariable("DOTNET_GCStress") == null
? 30 * 1000
if the test made it to
ValidateSingleBind(listener, expected.AssemblyName, expected);

@elinor-fung / @davmason any other ideas on what might be causing the hang?

@tommcdon tommcdon removed the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Jul 29, 2024
@mdh1418
Copy link
Member

mdh1418 commented Aug 1, 2024

762118 was from a PR that caused a deadlock, so it isn't the same cause as the first singluar hit in 735589

@mdh1418 mdh1418 modified the milestones: 9.0.0, 10.0.0 Aug 2, 2024
@jakobbotsch
Copy link
Member Author

@mdh1418
Copy link
Member

mdh1418 commented Aug 5, 2024

Hit in https://dev.azure.com/dnceng-public/public/_build/results?buildId=765128&view=results

Isn't that a jitstress pipeline? Why did the test get run there if it was marked JitOptimizationSensitive=true in #102842?

In any case, the latest failure console log shows a hang presumable when waiting for this test

public static BindOperation ApplicationAssemblies()
to finish running as a separate process (since the tests seem to be ran sequentially + FindInLoadContext_DefaultALC_IncompatibleVersion finished + the stack trace has system.diagnostics.process.dll!System.Diagnostics.Process.WaitForExitCore).

I guess the console.writelines from a subprocess don't actually get written immediately, given that we don't see either of these

Console.WriteLine($"[{DateTime.Now:T}] Launching process for {method.Name}...");
using (Process p = Process.Start(startInfo))
{
Console.WriteLine($"Started subprocess {p.Id} for {method.Name}...");
given that the test hangs afterwards.

Also the subprocess dump doesn't get generated even though we are expecting to create a dump for all processes related to corerun.

@hoyosjs any ideas on what to tweak to be able to capture the dump for the hanging subprocess in this case?

@jakobbotsch
Copy link
Member Author

Hit in https://dev.azure.com/dnceng-public/public/_build/results?buildId=765128&view=results

Isn't that a jitstress pipeline? Why did the test get run there if it was marked JitOptimizationSensitive=true in #102842?

The jitstress pipelines runs tests under many different configurations. This particular configuration does not set any of the "jitstress" environment variables, it only sets the following:

set DOTNET_TieredCompilation=0
set DOTNET_EnableAVX=0

@mdh1418
Copy link
Member

mdh1418 commented Nov 19, 2024

From the latest failure's console log, the test spins up a subprocess with PID 920

09:26:42.243 Running test: Loader/binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd
Attempting to collect crash dump: C:\cores\crashdump_3068.dmp
Collected crash dump: C:\cores\crashdump_3068.dmp





cmdLine:C:\h\w\A46C08CB\w\BB700A39\e\Loader\Loader\../binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd Timed Out (timeout in milliseconds: 600000 from variable __TestTimeout, start: 11/14/2024 9:26:42 AM, end: 11/14/2024 9:36:42 AM)

Return code:      -100
Raw output file:      C:\h\w\A46C08CB\w\BB700A39\uploads\binding\tracing\BinderTracingTest.ResolutionFlow\output.txt
Raw output:
BEGIN EXECUTION
 "C:\h\w\A46C08CB\p\corerun.exe" -p "System.Reflection.Metadata.MetadataUpdater.IsSupported=false" -p "System.Runtime.Serialization.EnableUnsafeBinaryFormatterSerialization=true"  BinderTracingTest.ResolutionFlow.dll 
[9:26:43 AM] Running AssemblyLoadContextResolving_ReturnNull...
[9:26:44 AM] Invoking AssemblyLoadContextResolving_ReturnNull...
[9:26:44 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
Waiting for bind events for AssemblyToLoad_Subdirectory (50ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Finished
Test AssemblyLoadContextResolving_ReturnNull finished.
[9:26:45 AM] Running AssemblyLoadContextResolving_LoadAssembly...
[9:26:45 AM] Invoking AssemblyLoadContextResolving_LoadAssembly...
[9:26:45 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Started
ValidateBindOperation Finished
ValidateBindOperation Finished
Test AssemblyLoadContextResolving_LoadAssembly finished.
[9:26:45 AM] Running AssemblyLoadContextResolving_NameMismatch...
[9:26:45 AM] Invoking AssemblyLoadContextResolving_NameMismatch...
[9:26:45 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Started
ValidateBindOperation Finished
ValidateBindOperation Finished
Test AssemblyLoadContextResolving_NameMismatch finished.
[9:26:45 AM] Running AssemblyLoadContextResolving_MultipleHandlers...
[9:26:45 AM] Invoking AssemblyLoadContextResolving_MultipleHandlers...
[9:26:45 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
ValidateBindOperation Started
ValidateBindOperation Started
ValidateBindOperation Finished
ValidateBindOperation Finished
Test AssemblyLoadContextResolving_MultipleHandlers finished.
[9:26:45 AM] Running AppDomainAssemblyResolve_ReturnNull...
[9:26:45 AM] Invoking AppDomainAssemblyResolve_ReturnNull...
[9:26:45 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Finished
Test AppDomainAssemblyResolve_ReturnNull finished.
[9:26:45 AM] Running AppDomainAssemblyResolve_LoadAssembly...
[9:26:45 AM] Invoking AppDomainAssemblyResolve_LoadAssembly...
[9:26:45 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Started
ValidateBindOperation Finished
ValidateBindOperation Finished
Test AppDomainAssemblyResolve_LoadAssembly finished.
[9:26:45 AM] Running AppDomainAssemblyResolve_NameMismatch...
[9:26:45 AM] Invoking AppDomainAssemblyResolve_NameMismatch...
[9:26:45 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Started
ValidateBindOperation Finished
ValidateBindOperation Finished
Test AppDomainAssemblyResolve_NameMismatch finished.
[9:26:45 AM] Running AppDomainAssemblyResolve_MultipleHandlers...
[9:26:45 AM] Invoking AppDomainAssemblyResolve_MultipleHandlers...
[9:26:45 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Started
ValidateBindOperation Finished
ValidateBindOperation Finished
Test AppDomainAssemblyResolve_MultipleHandlers finished.
[9:26:45 AM] Launching process for AssemblyLoadFromResolveHandler_LoadDependency...
Started subprocess 4652 for AssemblyLoadFromResolveHandler_LoadDependency...
[9:26:47 AM] Running AssemblyLoadFromResolveHandler_LoadDependency...
[9:26:47 AM] Invoking AssemblyLoadFromResolveHandler_LoadDependency...
[9:26:48 AM] Validating bind operation for AssemblyToLoadDependency, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null...
Waiting for bind events for AssemblyToLoadDependency (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Finished
Test AssemblyLoadFromResolveHandler_LoadDependency finished.

[9:26:48 AM] Launching process for AssemblyLoadFromResolveHandler_NotTracked...
Started subprocess 3304 for AssemblyLoadFromResolveHandler_NotTracked...
[9:26:49 AM] Running AssemblyLoadFromResolveHandler_NotTracked...
[9:26:50 AM] Invoking AssemblyLoadFromResolveHandler_NotTracked...
[9:26:50 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
Waiting for bind events for AssemblyToLoad_Subdirectory (50ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Finished
Test AssemblyLoadFromResolveHandler_NotTracked finished.

[9:26:50 AM] Launching process for FindInLoadContext_CustomALC...
Started subprocess 1776 for FindInLoadContext_CustomALC...
[9:26:51 AM] Running FindInLoadContext_CustomALC...
[9:26:52 AM] Invoking FindInLoadContext_CustomALC...
[9:26:52 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Finished
Test FindInLoadContext_CustomALC finished.

[9:26:53 AM] Launching process for FindInLoadContext_DefaultALC...
Started subprocess 5352 for FindInLoadContext_DefaultALC...
[9:26:54 AM] Running FindInLoadContext_DefaultALC...
[9:26:55 AM] Invoking FindInLoadContext_DefaultALC...
[9:26:55 AM] Validating bind operation for AssemblyToLoad...
Waiting for bind events for AssemblyToLoad (0ms waited) Sleeping for 50ms
Waiting for bind events for AssemblyToLoad (50ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Finished
Test FindInLoadContext_DefaultALC finished.

[9:26:56 AM] Launching process for FindInLoadContext_DefaultALC_IncompatibleVersion...
Started subprocess 920 for FindInLoadContext_DefaultALC_IncompatibleVersion...
[9:26:57 AM] Running FindInLoadContext_DefaultALC_IncompatibleVersion...
[9:26:58 AM] Invoking FindInLoadContext_DefaultALC_IncompatibleVersion...
[9:26:58 AM] Validating bind operation for AssemblyToLoad, Version=4.3.2.1...
Waiting for bind events for AssemblyToLoad (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Finished
Test FindInLoadContext_DefaultALC_IncompatibleVersion finished.

cmdLine:C:\h\w\A46C08CB\w\BB700A39\e\Loader\Loader\../binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd Timed Out (timeout in milliseconds: 600000 from variable __TestTimeout, start: 11/14/2024 9:26:42 AM, end: 11/14/2024 9:36:42 AM)
Invoking: C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\cdb.exe -c "$<C:\h\w\A46C08CB\t\tmppytb12.tmp" -z "C:\cores\crashdump_3068.dmp"

and that subprocess appears to complete RunSingleTest based on Test FindInLoadContext_DefaultALC_IncompatibleVersion finished.

Console.WriteLine($"Test {method.Name} finished.");

Moreover, at the end of the console log, there is

[XUnitLogChecker]: 09:39:44.85: The full run will be done.
[XUnitLogChecker]: 09:39:44.85: Item 'Loader' did complete successfully!
ERROR: The process "corerun.exe" with PID 920 could not be terminated.
Reason: There is no running instance of the task.

Could this be a bug with Process.WaitForExit() where the subprocess fails to signal to the parent process that it finished? In any case, its not apparent to me that this is a tracing issue. Its odd that our logic to collect dumps from all childprocesses isn't generating a dump for the PID 920 that the test is supposedly stuck on. Until we collect a crashdump from the child process that the test is stuck on, it'll be hard to figure out what is causing the test to hang.

Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-diagnostics-process
See info in area-owners.md if you want to be subscribed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

3 participants