Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock observed during network issues accessing Redis #2376

Closed
jmough opened this issue Feb 17, 2023 · 6 comments
Closed

Deadlock observed during network issues accessing Redis #2376

jmough opened this issue Feb 17, 2023 · 6 comments

Comments

@jmough
Copy link

jmough commented Feb 17, 2023

Hi,

We recently observed a deadlock in one of our applications that is reaching out to Redis. In both cases, there was an issue on one of the Redis replicas for several minutes that took a replica offline. When it came back, the services did not recover gracefully and, after getting a crash dump, DebugDiag highlighted the following deadlock (I added some notes on the lock locations in bold):

Thread 1 -

[[GCFrame]]
[[GCFrame]]
[[HelperMethodFrame] (System.Threading.Monitor.Enter)] System.Threading.Monitor.Enter(System.Object) Attempts to lock on _writtenAwaitingResponse, deadlocks
StackExchange.Redis.PhysicalConnection.GetHeadMessages(StackExchange.Redis.Message ByRef, StackExchange.Redis.Message ByRef)+47
StackExchange.Redis.ExceptionFactory.AddCommonDetail(System.Collections.Generic.List1>, System.Text.StringBuilder, StackExchange.Redis.Message, StackExchange.Redis.ConnectionMultiplexer, StackExchange.Redis.ServerEndPoint)+78 StackExchange.Redis.ExceptionFactory.Timeout(StackExchange.Redis.ConnectionMultiplexer, System.String, StackExchange.Redis.Message, StackExchange.Redis.ServerEndPoint, System.Nullable1)+4b3
StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[[StackExchange.Redis.RedisValue, StackExchange.Redis]](StackExchange.Redis.Message, StackExchange.Redis.ResultProcessor1, StackExchange.Redis.ServerEndPoint, StackExchange.Redis.RedisValue)+199 **Grabs lock on result box** StackExchange.Redis.RedisBase.ExecuteSync[[StackExchange.Redis.RedisValue, StackExchange.Redis]](StackExchange.Redis.Message, StackExchange.Redis.ResultProcessor1, StackExchange.Redis.ServerEndPoint, StackExchange.Redis.RedisValue)+af
StackExchange.Redis.RedisDatabase.StringGet(StackExchange.Redis.RedisKey, StackExchange.Redis.CommandFlags)+e5

Thread 2:

[[HelperMethodFrame] (System.Threading.Monitor.Enter)] System.Threading.Monitor.Enter(System.Object) Trying to lock the result box
StackExchange.Redis.SimpleResultBox.StackExchange.Redis.IResultBox.ActivateContinuations()+2f
StackExchange.Redis.Message.Complete()+4c
StackExchange.Redis.PhysicalConnection.RecordConnectionFailed(StackExchange.Redis.ConnectionFailureType, System.Exception, System.String, Boolean, System.IO.Pipelines.IDuplexPipe)+e57 - This line grabs a lock on _writtenAwaitingResponse
StackExchange.Redis.PhysicalConnection+d__115.MoveNext()+8c9
[[HelperMethodFrame]]
mscorlib_ni!System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()+20
System.IO.Pipelines.PipeCompletion.ThrowLatchedException()+e
System.IO.Pipelines.Pipe.GetReadResult(System.IO.Pipelines.ReadResult ByRef)+46
System.IO.Pipelines.Pipe.GetReadAsyncResult()+e0
StackExchange.Redis.PhysicalConnection+d__115.MoveNext()+558
mscorlib_ni!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+172
mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+15
mscorlib_ni!System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.Run()+6f
Pipelines.Sockets.Unofficial.DedicatedThreadPoolPipeScheduler.Execute(System.Action`1, System.Object)+1e
Pipelines.Sockets.Unofficial.DedicatedThreadPoolPipeScheduler.RunWorkLoop()+11a
mscorlib_ni!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+172
mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+15
mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)+55
mscorlib_ni!System.Threading.ThreadHelper.ThreadStart(System.Object)+60

In the second thread, it looks like that thread grabs a lock on _writtenAwaitingResponse in PhsyicalConnection.cs, then gets stuck on the lock in SimpleResultBox.ActivateContinuations. The first thread meanwhile has already grabbed the continuations lock and is trying to grab the _writtenAwaitingResponse in GetHeadMessages.

We are currently on library version 2.6.66, and the application is a .NET Framework 4.8 Windows Service.

Thank you!

@mgravell
Copy link
Collaborator

mgravell commented Feb 17, 2023 via email

mgravell added a commit that referenced this issue Feb 20, 2023
1. to fix the immediate scenario: don't hold the queue lock when we abort things - only hold it when fetching next
2. to avoid similar not yet seen: in GetHeadMessages, don't blindly wait forever

also standardise on TryPeek/TryDequeue
@mgravell
Copy link
Collaborator

PR: #2378

@mgravell
Copy link
Collaborator

What is intriguing to me: I'm assuming that this error is rare, but: that could be entirely wrong - it could be happening way more than we have ever realized, just: until this ticket we were ignorant and dismissing it as noise. Anyhow, the PR above addresses this in 3 different ways - it fixes both the paths above to never require both locks simultaneously, and it fixes a more general exception data scenario to use a lock-with-timeout, in case there are some other as-yet-unidentified routes

Great report, thanks; I propose we deploy this one without too much delay

@mgravell
Copy link
Collaborator

2.6.96 is now available on myget: https://github.com/StackExchange/StackExchange.Redis/#package-status

@mgravell
Copy link
Collaborator

2.6.96 is now available on NuGet; I don't know how often this problem was occurring for you, but: probably worth updating!

@jmough
Copy link
Author

jmough commented Feb 21, 2023

Thanks for looking into this so quickly! We'll go ahead and update soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants