Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible deadlock in System.Diagnostics.Tracing.EtwEventProvider.System.Diagnostics.Tracing.IEventProvider.EventUnregister #48342

Closed
n777ty opened this issue Feb 16, 2021 · 5 comments · Fixed by #56453
Assignees
Milestone

Comments

@n777ty
Copy link

n777ty commented Feb 16, 2021

We are observing a possible deadlock (at minimum it's an infinite hang) as follows:

  • The process is shutting down
0:000> k
 # Child-SP          RetAddr           Call Site
00 000000ed`af77e808 00007ffa`acc98c1f ntdll!ZwWaitForMultipleObjects+0x14
01 000000ed`af77e810 00007ffa`829f6fb6 KERNELBASE!WaitForMultipleObjectsEx+0xef
02 (Inline Function) --------`-------- coreclr!Thread::DoAppropriateAptStateWait+0x44
03 000000ed`af77eb10 00007ffa`829f6d35 coreclr!Thread::DoAppropriateWaitWorker+0x242
04 000000ed`af77ec10 00007ffa`829f6355 coreclr!Thread::DoAppropriateWait+0x89
05 (Inline Function) --------`-------- coreclr!CLREventBase::WaitEx+0x53
06 000000ed`af77ec90 00007ffa`829f33f9 coreclr!CLREventBase::Wait+0x59
07 000000ed`af77ece0 00007ffa`829f4487 coreclr!FinalizerThread::RaiseShutdownEvents+0x59
08 000000ed`af77ed10 00007ffa`829f4213 coreclr!EEShutDownHelper+0x23b
09 000000ed`af77ef20 00007ffa`829f3070 coreclr!EEShutDown+0x73
0a 000000ed`af77ef70 00007ffa`82a8b051 coreclr!CorHost2::UnloadAppDomain2+0x40
0b 000000ed`af77efa0 00007ffa`924b5181 coreclr!coreclr_shutdown_2+0x61
0c 000000ed`af77f000 00007ffa`924c2e61 hostpolicy!coreclr_t::shutdown+0x61
0d 000000ed`af77f040 00007ffa`924c2a36 hostpolicy!run_app_for_context+0x3c1
0e 000000ed`af77f1a0 00007ffa`924c4262 hostpolicy!run_app+0x46
0f 000000ed`af77f1f0 00007ffa`90550653 hostpolicy!corehost_main+0x132
10 000000ed`af77f3a0 00007ffa`905537c8 hostfxr!execute_app+0x2e3
11 000000ed`af77f490 00007ffa`905527cd hostfxr!`anonymous namespace'::read_config_and_execute+0xa8
12 000000ed`af77f5a0 00007ffa`9055027c hostfxr!fx_muxer_t::handle_exec_host_command+0x15d
13 000000ed`af77f630 00007ffa`9054bd0c hostfxr!fx_muxer_t::execute+0x1fc
14 000000ed`af77f780 00007ff6`3f222361 hostfxr!hostfxr_main_startupinfo+0x9c
  • During shutdown, cleanup is taking place on Finalizer thread
   2  Id: 2be8.50a8 Suspend: 0 Teb: 000000ed`af4dd000 Unfrozen
 # Child-SP          RetAddr           Call Site
00 000000ed`afbff3d8 00007ffa`afb809a1 ntdll!ZwWaitForAlertByThreadId+0x14
01 000000ed`afbff3e0 00007ffa`afbb9175 ntdll!RtlAcquireSRWLockExclusive+0x121
02 (Inline Function) --------`-------- ntdll!EtwpWaitForRegistrationRundownRelease+0xc
03 000000ed`afbff440 00007ffa`823d767c ntdll!EtwNotificationUnregister+0x55
04 000000ed`afbff470 00007ffa`823d75f8 System_Private_CoreLib!System.Diagnostics.Tracing.EtwEventProvider.System.Diagnostics.Tracing.IEventProvider.EventUnregister(Int64)+0x3c
05 000000ed`afbff540 00007ffa`823d74c1 System_Private_CoreLib!System.Diagnostics.Tracing.EventProvider.Dispose(Boolean)+0x88
06 000000ed`afbff5a0 00007ffa`823d1c25 System_Private_CoreLib!System.Diagnostics.Tracing.EventSource.Dispose(Boolean)+0x51
07 000000ed`afbff5e0 00007ffa`823d19cf System_Private_CoreLib!System.Diagnostics.Tracing.EventListener.DisposeOnShutdown(System.Object, System.EventArgs)+0xb5
08 000000ed`afbff650 00007ffa`82a96ba3 System_Private_CoreLib!System.AppContext.OnProcessExit()+0x3f
09 000000ed`afbff690 00007ffa`82a2cb82 coreclr!CallDescrWorkerInternal+0x83
  • Finalizer thread is blocked trying to acquire SRWLock

  • Finalizer thread owns another lock itself

0:002> !syncblk
Index         SyncBlock MonitorHeld Recursion Owning Thread Info          SyncBlock Owner
   43 000002311EEE6C88            3         1 000002311EE3EC60 50a8   2   000002311f7e37e8 System.Collections.Generic.List`1[[System.WeakReference, System.Private.CoreLib]]
  • Another thread is waiting on Finalizer's lock
   6  Id: 2be8.5150 Suspend: 0 Teb: 000000ed`af4f5000 Unfrozen
 # Child-SP          RetAddr           Call Site
00 000000ed`b0afe618 00007ffa`acc98c1f ntdll!ZwWaitForMultipleObjects+0x14
01 000000ed`b0afe620 00007ffa`829f6fb6 KERNELBASE!WaitForMultipleObjectsEx+0xef
02 (Inline Function) --------`-------- coreclr!Thread::DoAppropriateAptStateWait+0x44
03 000000ed`b0afe920 00007ffa`829f6d35 coreclr!Thread::DoAppropriateWaitWorker+0x242
04 000000ed`b0afea20 00007ffa`829f6355 coreclr!Thread::DoAppropriateWait+0x89
05 (Inline Function) --------`-------- coreclr!CLREventBase::WaitEx+0x53
06 000000ed`b0afeaa0 00007ffa`829f60e9 coreclr!CLREventBase::Wait+0x59
07 000000ed`b0afeaf0 00007ffa`829f5efa coreclr!AwareLock::EnterEpilogHelper+0x1d5
08 000000ed`b0afec50 00007ffa`829f5eaa coreclr!AwareLock::EnterEpilog+0x4a
09 000000ed`b0afecb0 00007ffa`829f8851 coreclr!AwareLock::Enter+0xb2
0a (Inline Function) --------`-------- coreclr!SyncBlock::EnterMonitor+0x8
0b (Inline Function) --------`-------- coreclr!ObjHeader::EnterObjMonitor+0xd
0c (Inline Function) --------`-------- coreclr!Object::EnterObjMonitor+0x16
0d 000000ed`b0afece0 00007ffa`829f89a0 coreclr!JIT_MonEnter_Helper+0x145
0e 000000ed`b0afee90 00007ffa`823c3414 coreclr!JIT_MonReliableEnter_Portable+0xe0
0f 000000ed`b0afeed0 00007ffa`826114d0 System_Private_CoreLib!System.Diagnostics.Tracing.EventSource.SendCommand(System.Diagnostics.Tracing.EventListener, System.Diagnostics.Tracing.EventProviderType, Int32, Int32, System.Diagnostics.Tracing.EventCommand, Boolean, System.Diagnostics.Tracing.EventLevel, System.Diagnostics.Tracing.EventKeywords, System.Collections.Generic.IDictionary`2<System.String,System.String>)+0xa4
10 000000ed`b0afef50 00007ffa`82609cc3 System_Private_CoreLib!System.Diagnostics.Tracing.EventSource+OverideEventProvider.OnControllerCommand(System.Diagnostics.Tracing.ControllerCommand, System.Collections.Generic.IDictionary`2<System.String,System.String>, Int32, Int32)+0x50
11 000000ed`b0afefc0 00007ffa`237e49d6 System_Private_CoreLib!System.Diagnostics.Tracing.EventProvider.EtwEnableCallBack(System.Guid ByRef, Int32, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*, Void*)+0x2f3
12 000000ed`b0aff0b0 00007ffa`82a96cbb System_Private_CoreLib!ILStubClass.IL_STUB_ReversePInvoke(IntPtr, Int32, Byte, Int64, Int64, Int64, Int64)+0x56
13 000000ed`b0aff110 00007ffa`afb8aa1d coreclr!UMThunkStub+0x6b
14 000000ed`b0aff1c0 00007ffa`afb8a6a5 ntdll!EtwpEventApiCallback+0xbd
15 000000ed`b0aff250 00007ffa`afbb3cfb ntdll!EtwpUpdateEnableInfoAndCallback+0xb9
16 000000ed`b0aff290 00007ffa`afbb3abb ntdll!EtwpProcessNotification+0x43
17 000000ed`b0aff2c0 00007ffa`afbb37ec ntdll!EtwDeliverDataBlock+0x8b
18 000000ed`b0aff320 00007ffa`afb9a800 ntdll!EtwpNotificationThread+0x6c
19 000000ed`b0aff490 00007ffa`afb7ed03 ntdll!TppExecuteWaitCallback+0x8c
1a 000000ed`b0aff4d0 00007ffa`ae7784d4 ntdll!TppWorkerThread+0x923
1b 000000ed`b0aff8d0 00007ffa`afbb1791 kernel32!BaseThreadInitThunk+0x14
1c 000000ed`b0aff900 00000000`00000000 ntdll!RtlUserThreadStart+0x21
  • Both of these threads are dealing with System.Threading.Tasks.TplEventSource event provider.

  • Shutdown never completes due to this lock-up (sequence of dumps confirms)

@dotnet-issue-labeler dotnet-issue-labeler bot added area-Tracing-coreclr untriaged New issue has not been triaged by the area owner labels Feb 16, 2021
@sywhang sywhang self-assigned this Feb 16, 2021
@sywhang sywhang added bug and removed untriaged New issue has not been triaged by the area owner labels Feb 16, 2021
@sywhang
Copy link
Contributor

sywhang commented Feb 16, 2021

Thanks for reporting this @n777ty. I've identified the issue to be with the way we hold the EventListener lock inside EventSource. I'll follow up with you on additional details offline.

@gasridha
Copy link

gasridha commented Jul 7, 2021

@sywhang I have a similar deadlock issue intermittently in a Azure Webjob process - during shutdown, Finalizer is blocked on System_Private_CoreLib!System.Diagnostics.Tracing.EtwEventProvider.System.Diagnostics.Tracing.IEventProvider.EventUnregister, which seems to be waiting on another thread that's waiting on the finalizer itself. Its using .Net core 3.1. Is there any workaround to fix the issue (other than killing the process) ? Also, is there a possibility of fixing it in .Net core 3.1 ?

@sywhang
Copy link
Contributor

sywhang commented Jul 7, 2021

@gasridha There isn't any workaround. If you're the one starting the ETW session, then you might be able to replace it with other tracing alternatives like dotnet-trace as long as you don't need native stacks, but if the session is being started by something else, I don't think there's anything you can do.

Re: backport to 3.1, I'm no longer at .NET team so I can't answer that but @josalem should be able to.

@noahfalk noahfalk self-assigned this Jul 28, 2021
@noahfalk
Copy link
Member

The bug appears to be that we are holding the lock while calling Dispose() here. We've already resolved a similar issue in the past and went out of our way in EventProvider.Dispose() to avoid having the lock held. However grabbing the lock outside of Dispose() prevented us from calling into ETW without the lock held as intended.

noahfalk added a commit to noahfalk/runtime that referenced this issue Jul 28, 2021
Fixes dotnet#48342

A deadlock was occuring because we held the EventListenersLock
while calling into EventUnregister which will take ETW's own native
lock. In the case of ETW sending an enable/disable notification
these locks are taken in reverse order which triggers a deadlock.

The fix is to ensure that we always order the locks so that any code
path taking both locks always takes the ETW lock first. In this case
it meant accumulating the list of event sources to dispose under
the lock but then exiting the lock prior to calling Dispose() which
will eventually call EventUnregister.
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 28, 2021
noahfalk added a commit that referenced this issue Jul 29, 2021
Fixes #48342

A deadlock was occuring because we held the EventListenersLock
while calling into EventUnregister which will take ETW's own native
lock. In the case of ETW sending an enable/disable notification
these locks are taken in reverse order which triggers a deadlock.

The fix is to ensure that we always order the locks so that any code
path taking both locks always takes the ETW lock first. In this case
it meant accumulating the list of event sources to dispose under
the lock but then exiting the lock prior to calling Dispose() which
will eventually call EventUnregister.
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 29, 2021
@noahfalk
Copy link
Member

Fixed in #56453

@ghost ghost locked as resolved and limited conversation to collaborators Aug 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants