Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsyncLock dispose deadlock #57

Closed
MattRyan opened this issue Jun 30, 2016 · 42 comments
Closed

AsyncLock dispose deadlock #57

MattRyan opened this issue Jun 30, 2016 · 42 comments
Assignees
Milestone

Comments

@MattRyan
Copy link

MattRyan commented Jun 30, 2016

I am running into a dead lock when using the AsyncLock. The situation as best as I can determine occurs when an AsyncLock.Lock() is being disposed. The task.WaitAndUnwrapException() call uses an awaiter that releases the current thread. Another tasks then grabs this thread. Until the second task complete or awaits the AsyncLock cannot complete because it requires the original calling context.

I noticed that the overload of WaitAndUnwrapException does not use the GetAwaiter method so it never gives up control of the thread. Would a simple fix be to use this overload in the AsyncLock?

@StephenCleary
Copy link
Owner

I'm not following you. Can you post a unit test or gist code sample that demonstrates the deadlock?

@MattRyan
Copy link
Author

MattRyan commented Jun 30, 2016

I will try but its proving difficult to reproduce outside the system the behavior is observed in.

I was incorrect about where the problem is. It is the when disposing of the result of
this._queue.Enqueue<IDisposable>(this._mutex, cancellationToken).

Looks like the deadlock happens when disposing of the CompletionDisposable. It tries to set the result of the task completion source.

Stack trace to the block.

Nito.AsyncEx.dll!Nito.AsyncEx.DefaultAsyncWaitQueue<System.IDisposable>.CompleteDisposable.Dispose() Line 213 C#
Nito.AsyncEx.dll!Nito.AsyncEx.AsyncLock.ReleaseLock() Line 153 + 0x26 bytes C#
Nito.AsyncEx.dll!Nito.AsyncEx.AsyncLock.Key.Dispose() Line 180 + 0x1c bytes C#

@StephenCleary StephenCleary self-assigned this Jul 1, 2016
@IanYates
Copy link

IanYates commented Jul 3, 2016

Interesting. I have a deadlock (or at least lock build-ups that last a looong time) issue that occurs sporadically at one client site and have been adding a lot of logging to try to track down the source but so far haven't had a lot of success, and they're upending their IT right now so I'm hoping it's more their environment and bad database server causing delays.
Without having the source handy at the moment, I wonder if it's a good idea for me to wrap AsyncLock in an adapter (or perhaps subclass it if appropriate virtuals are available) to intercept calls to .Lock and log the start and completion of lock disposal. That's one area I haven't done any logging around.

@StephenCleary
Copy link
Owner

@IanYates Excellent idea. I actually had a build of AsyncEx with exhaustive ETW tracing for all synchronization primitive operations. Unfortunately, it proved too difficult to make portable in a reasonable fashion.

FYI, there is an Id on every primitive designed for tracing, which is still in the codebase.

@chanti-github
Copy link

Hi Stephen,
I am also facing the same problem "lock build-ups that last a looong time" can you provide the solution please

@StephenCleary
Copy link
Owner

@chanti-github Can you provide code that reproduces the problem?

In the meantime, I suspect that upgrading to the AsyncLock in the AsyncEx.Coordination package would solve the issue.

@chanti-github
Copy link

Sorry this issue is difficult to reproduce i am using the follwoing way

using (await Constants._lock.LockAsync ()) {
try{
var localNotes = default(NotesModel);
if(_notes.LocalEntityID != 0){
localNotes =await _asyncConnection.Table ().Where(p=>p.LocalEntityID == _notes.LocalEntityID && p.Code == _notes.Code).FirstOrDefaultAsync();
}else{
localNotes =await _asyncConnection.Table ().Where(p=>p.entityId == _notes.entityId && p.Code == _notes.Code).FirstOrDefaultAsync();
}
if(localNotes != null){
await _asyncConnection.UpdateAsync(_notes).ContinueWith((t)=>{
_messenger.Publish(new UpdateRecordsCountMessage(this){UpdatedCount = 1});
});
}else{
await _asyncConnection.InsertAsync (_notes).ContinueWith((t)=>{
_messenger.Publish(new UpdateRecordsCountMessage(this){UpdatedCount = 1});
});
}
}catch(Exception ex){
System.Diagnostics.Debug.WriteLine ("Exception While insert/update in categories" + ex.Message);
}
}

@chanti-github
Copy link

@StephenCleary can you please update on this ??

@StephenCleary
Copy link
Owner

@chanti-github Have you tried upgrading to the new version, as I suggested?

Also, if you could develop a minimal, complete, verifiable example, that would enable me to track down the root cause.

@chanti-github
Copy link

Do you want me update Nito.AsyncEx Pacakage ??

@StephenCleary
Copy link
Owner

@chanti-github Try removing Nito.AsyncEx and installing the prerelease version of Nito.AsyncEx.Coordination.

@chanti-github
Copy link

you are saying it is pre release version will i get any issues ??

@IanYates
Copy link

IanYates commented Jul 5, 2016

@chanti-github - well you have deadlocks (or at least lock build-up) at the moment. It can't really get much worse than that.
@StephenCleary is suggesting you try the pre-release to see if it helps. You can easily back out the new package and restore the old one if things don't work out. That's what I plan to do when I literally get back on my feet tomorrow (ankle injury stole the last couple of days of productivity) 👣

@MattRyan
Copy link
Author

MattRyan commented Jul 5, 2016

I have an example that I believe will reproduce the issue. It is occurring when one task uses Lock and another uses LockAsync. The task using the Lock method does not terminate the dispose. If both tasks use Lock or LockAsync exclusively the problem does not occur.

It is important that first task not await the task delay. If task1 awaits thus releasing the thread then task2 will complete.

var asyncLock = new AsyncLock();
var cancellationTokenSource = new CancellationTokenSource();
var cancellationToken = cancellationTokenSource.Token;
var task1 = Task.Run(
    async () =>
    {
        while (!cancellationToken.IsCancellationRequested)
        {
            using (await asyncLock.LockAsync())
            {
                Task.Delay(TimeSpan.FromMilliseconds(10),
                           cancellationToken)
                    .Wait();
            }
        }
    },
    cancellationToken);

var task2 = Task.Run(
    () =>
    {
        using (asyncLock.Lock())
        {
            Task.Delay(TimeSpan.FromSeconds(1))
                .Wait();
        }
    },
    cancellationToken);
Task.WaitAny(task2);
cancellationTokenSource.Cancel();
Task.WaitAny(task1);

@chanti-github
Copy link

Any Updates Please ??

@StephenCleary
Copy link
Owner

@chanti-github Have you tried upgrading to the new version, as I suggested?

@chanti-github
Copy link

No, I will try today and let you know but i am able to see the issue several times

@chanti-github
Copy link

I am not able to add package it is showing this error "Could not install package 'Nito.AsyncEx.Coordination 1.0.0-delta-7'. You are trying to install this package into a project that targets 'portable-net45+win+wp80+MonoAndroid10+xamarinios10+MonoTouch10', but the package does not contain any assembly references or content files that are compatible with that framework. For more information, contact the package author."

@chanti-github
Copy link

i am not able to add package to my solution (mvvmcross)

@StephenCleary
Copy link
Owner

AsyncEx.Coordination does not support Windows 8 store or Windows Phone Silverlight, by design. It targets netstandard1.3, which is supported by .NET 4.6, Universal Windows Platform, Mono/Xamarin platforms, and .NET Core apps.

@StephenCleary
Copy link
Owner

I think I've tracked this down; thanks @MattRyan for the repro. This line should be calling TrySetResultWithBackgroundContinuations instead of TrySetResult.

Not sure when I'll have a fix in the official NuGet package. I'll leave this issue open until I do.

@chanti-github
Copy link

thanks for updates on package can you please let us know when you will move
to nuget this is little needed
On 11-Jul-2016 9:44 PM, "Stephen Cleary" notifications@github.com wrote:

I think I've tracked this down; thanks @MattRyan
https://github.com/MattRyan for the repro. This line


should be calling TrySetResultWithBackgroundContinuations instead of
TrySetResult.

Not sure when I'll have a fix in the official NuGet package. I'll leave
this issue open until I do.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#57 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ARKqeSi7OwPvQAHDEl7ZVLdKgu6mCN0Tks5qUmvegaJpZM4JCqEI
.

@sarathdev
Copy link

@StephenCleary
I am also facing severe dead lock situation especially in ios can you please help on this i am using nito.async.ex package

@IanYates
Copy link

I've pulled down the source, made the change and, as of yesterday, am using the one-line change in production. So far so good.
The segment of our software where I think we may have run in to this will be exercised in anger tomorrow and over the coming days so I'll be able to declare a win by next week if all goes well :)

(this was less of an upheaval, and I was happier to do the quick production test, than if I switched over to the new AsyncEx package but I still plan to jump ship to it in the coming weeks)

@StephenCleary
Copy link
Owner

@sarathdev @chanti-github I will update AsyncEx as soon as possible. Unfortunately, I do not have the capability to build it at the moment, and it requires multiple specific versions of VS and other tools. Building the build machine will take a couple days of work, and I'm not sure when I'll have that time.\

(My old build machine was my old laptop, which has died. My new build machine will be a VM, so this will hopefully be the last time I'll have to build a build machine).

@IanYates
Copy link

That explains the build.ps1 file - I can see why you're having trouble :) I couldn't run it either as I only have VS 2015 on my current machine.
I ended up switching VS to release mode, building the .NET 4.5 projects, and grabbing the assemblies from the Concurrent project's bin folder.

@ejsmith
Copy link

ejsmith commented Jul 20, 2016

Have you considered just using AppVeyor?

@StephenCleary
Copy link
Owner

@ejsmith I'm not sure if AppVeyor has a build machine that would suffice. My new (netcore) AsyncEx libraries all use AppVeyor, but they only target new platforms.

The legacy AsyncEx requires some extremely legacy tooling. In particular, a modern system (VS2013+) in addition to VS2012 with the SL4 and WP71 SDKs. I made three separate attempts last week to try to set up a build VM. All of them failed, most notably due to the WP71 SDK. I am planning to make two additional attempts - I think it might work if I start with Win7, uninstall all the .NET Framework components, and build it up starting with VS2010 and the legacy SDKs, and then add VS2012 and VS2013 from there. If that doesn't work, I'll try the same thing using Vista (!) as a starting point.

If that doesn't work, then we're screwed. The only real option forward at that point would be to drop support for SL4 and WP71. Which would considerably reduce the build machine setup complexity.

@ejsmith
Copy link

ejsmith commented Jul 20, 2016

That seems like a pretty reasonable thing to me. :-) Both of those platforms are no longer supported. This is an OSS project you do in your spare time. I think it's pretty reasonable to not want to spend your spare time trying to support legacy platforms especially if they add a lot of effort.

Also, people are free to continue using the older packages if they want.

@StephenCleary
Copy link
Owner

@ejsmith Well, I know a good number of AsyncEx consumers are on unsupported platforms - most notably .NET 4. I also know there are SL and WP consumers, though I'm not sure if SL4 and WP71 are necessary - unfortunately, there is no way to track this.

I think dropping SL4/WP71 makes sense, but we'd need to keep net4/net45/sl5/wp80/wpa81/win8 targets. The new AsyncEx does not support any of them. :/

@ejsmith
Copy link

ejsmith commented Jul 20, 2016

Ahh... I didn't realize that this was the legacy repo. :-)

@chanti-github
Copy link

@StephenCleary

I am using httpclient single instance approach but i am getting the task was canclled exception in times can you please help me on this ??

@StephenCleary
Copy link
Owner

@chanti-github I don't believe that has anything to do with this issue; please post a question to Stack Overflow.

@chanti-github
Copy link

@StephenCleary i saw so many questions related to this so i did not get any answers with curiosity i asked.

@IanYates
Copy link

FYI, the suggested fix is still working well for me.
It turns out I also had some blocking issues arising from SignalR's internals (Groups.Add tends to freeze if you add too many at a time - from the end user's perspective it's the same "nothing works" end result) but I am still quite sure this fix has been a net positive.

@StephenCleary
Copy link
Owner

Should be fixed in 4.0.0-pre.

@StephenCleary StephenCleary added this to the 4.0.0 milestone Dec 14, 2016
@Allon-Guralnek
Copy link

This problem has hit us in production, with Nito.AsyncEx 4.0.1.0. We were using only .Lock() (and not .LockAsync()) in an AspNetCore application running on .NET 4.5.1. The reason we were using AsyncLock instead of just the lock keyword is because there once was another place where LockAsync() was used, but it was removed, leaving only the sync version.

When load suddenly spiked, the number of threads started rising and rising, and certain requests weren't being returned. It eventually reached thousands of threads. We took a dump, from it we could see the problem.

Most of the threads (thousands) were waiting on AsyncLock.Lock() with the following stack trace:

ntdll!NtWaitForMultipleObjects+a 
KERNELBASE!WaitForMultipleObjectsEx+e8 
kernel32!WaitForMultipleObjectsExImplementation+b3 
clr!WaitForMultipleObjectsEx_SO_TOLERANT+62 
clr!Thread::DoAppropriateAptStateWait+53 
clr!Thread::DoAppropriateWaitWorker+186 
clr!Thread::DoAppropriateWait+7d 
clr!CLREventBase::WaitEx+c0 
clr!Thread::Block+1e 
clr!SyncBlock::Wait+150 
[[GCFrame]] 
clr!ObjectNative::WaitTimeout+c7 
[[HelperMethodFrame_1OBJ] (System.Threading.Monitor.ObjWait)] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object) 
mscorlib_ni!System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)+3ec 
mscorlib_ni!System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken)+db 
mscorlib_ni!System.Threading.Tasks.Task.InternalWait(Int32, System.Threading.CancellationToken)+24a 
mscorlib_ni!System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)+ddf400 
mscorlib_ni!System.Runtime.CompilerServices.TaskAwaiter`1[[System.__Canon, mscorlib]].GetResult()+31 
Nito.AsyncEx.Synchronous.TaskExtensions.WaitAndUnwrapException[[System.__Canon, mscorlib]](System.Threading.Tasks.Task`1<System.__Canon>)+64 
Nito.AsyncEx.AsyncLock.Lock(System.Threading.CancellationToken)+e5 
My.AspNetCoreService.DoSomething(...)+46 

And a single thread was stuck here:

ntdll!NtWaitForMultipleObjects+a 
KERNELBASE!WaitForMultipleObjectsEx+e8 
kernel32!WaitForMultipleObjectsExImplementation+b3 
clr!WaitForMultipleObjectsEx_SO_TOLERANT+62 
clr!Thread::DoAppropriateAptStateWait+53 
clr!Thread::DoAppropriateWaitWorker+186 
clr!Thread::DoAppropriateWait+7d 
clr!CLREventBase::WaitEx+c0 
clr!Thread::Block+1e 
clr!SyncBlock::Wait+150 
[[GCFrame]] 
clr!ObjectNative::WaitTimeout+c7 
[[HelperMethodFrame_1OBJ] (System.Threading.Monitor.ObjWait)] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object) 
mscorlib_ni!System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)+3ec 
mscorlib_ni!System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken)+db 
mscorlib_ni!System.Threading.Tasks.Task.InternalWait(Int32, System.Threading.CancellationToken)+24a 
mscorlib_ni!System.Threading.Tasks.Task.Wait(Int32, System.Threading.CancellationToken)+44 
mscorlib_ni!System.Threading.Tasks.Task.Wait()+11 
Nito.AsyncEx.DefaultAsyncWaitQueue`1+CompleteDisposable[[System.__Canon, mscorlib]].Dispose()+69 
Nito.AsyncEx.AsyncLock.ReleaseLock()+fa 
My.AspNetCoreService.DoSomething(...)+132 

After we switch to the regular lock keyword, the problem there no longer reproduced, but then another location where we used LockAsync started show the same issue. This time there were mixed Lock() and LockAsync(). In this case it's not possible to switch to the regular lock because there are awaits inside the lock. We're rewriting these sections to so that the awaits are outside the lock.

Unfortunately we don't have a repro to share.

Is Nito.AsyncEx.Coordination still relevant? It seems Nito.AsyncEx has a newer version on nuget.org.

@StephenCleary
Copy link
Owner

Short answer: yes, Nito.AsyncEx.Coordination is still relevant; it will be part of Nito.AsyncEx v5. Is it possible for you to upgrade to Nito.AsyncEx.Coordination?

@Allon-Guralnek
Copy link

Allon-Guralnek commented Mar 23, 2017

We have a theory about why this happened. It seems due to a spike in traffic or a hot start of a server, many requests come in all at once, each represented by a delegate task enqueued to the thread pool. Unfortunately, they all require a single dependency (or "shared resource" to use your term) that needs to be initialized first. This initialization takes about a second, during which tens of thousands of requests have queued up in the thread pool. Near the end of the initialization, we have a small, trivial piece of code (nothing blocking) that is protected by an AsyncLock. Only the first thread that reaches it enters it, the rest are waiting in a queue. If the initialization of the shared resource has already been performed, it is skipped, but that part with AsyncLock is never skipped and is executed by all threads, regardless.

When the lock is released via .ReleaseLock(), the notification for the releasing the next awaiter in the queue is posted to the thread-pool (via the DefaultAsyncWaitQueue<T>.CompleteDisposable class) . But the thread pool is full of tens of thousands of threads blocked on .Lock() and it only has, lets say 16 threads on a specific machine, so the continuation isn't executed. The thread pool increases the thread count when needed, but at an excruciatingly slow pace (I think 1 thread per second), and won't reach that unlocking task at the end of a ten-thousand long queue within a reasonable time.

Granted, we shouldn't have used .Lock() in an async context, but we were constrained by existing interfaces. We didn't think it would cause such a disaster, especially since the code inside the AsyncLock was non-blocking, non-awaiting, and should have completed in microseconds. When replacing it with a regular lock, the issue disappeared because lock doesn't depend on the thread-pool to release a lock the next thread in the queue.

Does the AsyncLock from Nito.AsyncEx.Coordination mitigate this issue in any way? This could affect anyone who uses the sync parts of AsyncLock.

@Allon-Guralnek
Copy link

Here's a simple repro:

static AsyncLock l = new AsyncLock();

void Main()
{
	Console.WriteLine("Starting tasks...");
	var tasks = Enumerable.Range(0, 100).Select(i => Task.Run(() => Do1(i))).ToArray();
	
	Console.WriteLine("Tasks started. Waiting for all to complete...");
	Task.WaitAll(tasks);

	Console.WriteLine("All done!");
}

void Do1(int i)
{
	if (i == 0)
		Thread.Sleep(100);

	using (l.Lock())
	{
		// mutate shared state atomically
	}
		
	Console.WriteLine("Done: " + i);
}

void Do2(int i)
{
	if (i == 0)
		Thread.Sleep(100);

	lock (l)
	{
		// mutate shared state atomically
	}

	Console.WriteLine("Done: " + i);
}

If you have Main() use Do1() which uses AsyncLock it will block for a long time. If you use Do2() which uses lock, it will complete immediately.

@StephenCleary
Copy link
Owner

Thanks for the repro. I've confirmed that the problem does not occur when using AsyncEx.Coordination. This is true even if you use AsyncContext to more realistically emulate a non-core ASP.NET scenario:

var tasks = Enumerable.Range(0, 100).Select(i => Task.Run(() => AsyncContext.Run(() => Do1(i)))).ToArray();

However, asynchronous coordination primitives - by their nature - are sensitive to thread pool exhaustion in some scenarios. If you can use lock, you should.

@IanYates
Copy link

IanYates commented Sep 5, 2017

The repro provided by @Allon-Guralnek will cause the issue for me.

In LinqPad I've got this

AsyncLock l = new AsyncLock();

void Main()
{
	Task.Run(() =>
	{

		Console.WriteLine("Starting tasks...");
		var tasks = Enumerable.Range(0, 100).Select(i => Task.Run(() => Do1(i))).ToArray();

		Console.WriteLine("Tasks started. Waiting for all to complete...");
		Task.WaitAll(tasks);

		Console.WriteLine("All done!");
	}).Wait();
}

void Do1(int i)
{
	if (i == 0)
		Thread.Sleep(100);

	using (l.Lock())
	{
		// mutate shared state atomically
	}

	Console.WriteLine("Done: " + i);
}

void Do2(int i)
{
	if (i == 0)
		Thread.Sleep(100);

	lock (l)
	{
		// mutate shared state atomically
	}

//	Console.WriteLine("Done: " + i);
}

You'll note that I've got a Task.Run() in the main() method to get it started.

Sometimes it runs through just fine. Other times it'll stop.

Example output

Starting tasks...
Done: 1
Tasks started. Waiting for all to complete...
Done: 2
Done: 99

and 1 minute 13 seconds later I'm still nowhere :(
After 1 minute 28 seconds it suddenly finished

Starting tasks...
Done: 1
Tasks started. Waiting for all to complete...
Done: 2
Done: 99
Done: 4
Done: 3
Done: 6
Done: 7
Done: 8
Done: 5
Done: 9
Done: 98
Done: 0
Done: 10
Done: 11
Done: 12
Done: 13
Done: 14
{{snip}}All the rest were in order{{/snip}}

But I can hit stop & start and get it working again. 9 times out of 10 it works properly. Removing the Console.WriteLine inside the loop seems to make it fail more often.

With LinqPad I can quickly swap between
Nito.AsyncEx.Coordination NuGet package - this made the poor output above
Nito.AsyncEx NuGet package v3.x - works
Nito.AysyncEx NuGet package v4.x (strong named one was convenient to use) - failed first go with no output at all :(
Nito.AsyncEx v3.x with one line patched from earlier - works

So this appears to be a regression where v3.x - with the one-line patch from earlier in this issue - is good but AsyncEx v4.x and v5.x are both bad.

It's a tricky one as it's intermittent and does eventually come good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants