Persistent actors are still stuck after network failure #114

object · 2019-05-23T08:28:53Z

This might be the same as #104 which is supposed to be fixed in 1.3.13. However we upgraded to 1.3.13 and it looks like this error occurs more often than before. Here are the symptoms:

A temporary network failure occurs and lasts for a short time, we see in our logs some errors (for example an error from RabbitMQ).
Around this time we see a number of "Failed to execute chunk for 1 requests" error message from akka://Oddjob/system/akka.persistence.journal.sql-server.
In a few minutes after that (when the network is OK) we see a few hundreds of "Circuit Breaker is open; calls are failing fast" errors from our persistent actors that are trying to recover its state.
Then there is a continuous stream of "Recovery timed out, didn't get event within 60s, highest sequence number seen 0." messages that lasts until we restart the node.
After restarting the node things go back to normal.

object · 2019-05-29T07:01:07Z

I ran more tests and can confirm that if I temporarily stop SQL Server then persistent actor recovery keeps failing with error "Recovery timed out, didn't get event within 60s, highest sequence number seen 0." even after the server is restarted.

Since this issue has quite high priority, in case you need more information or want us to test something, just let us know.

ismaelhamed · 2019-05-29T14:05:59Z

@Horusiath Looks like the BatchingSqlJournal.FailChunkExecution method is not producing the corresponding WriteMessageRejected or WriteMessageFailure messages

Aaronontheweb · 2019-06-21T13:21:42Z

@ismaelhamed have we patched this again since 1.3.13? Working on getting a v1.4.0 beta out ASAP - I can tack a fix for this onto that release if it hasn't been done already.

ismaelhamed · 2019-06-23T09:21:44Z

@Aaronontheweb This is a new one. But when I looked into it, I didn't see an obvious way to fix it.

object · 2019-06-25T06:18:18Z

@ismaelhamed You mentioned BatchingSqlJournal.FailChunkExecution. Does it also apply to non-batching journal? I believe we hit the same problem when we temporarily switched from batching to non-batching journal.

ismaelhamed · 2019-06-25T10:16:34Z

@object But then, I guess it'd be a different edge case that you've found. Probably the best way to solve this would be trying to come up with a test case to reproduce it.

object · 2019-06-25T10:29:35Z

@ismaelhamed I would like to extract some code that reproduces the problem. Unfortunately it's not that easy. Do you have any tips of how we can track the problem, e.g. in what Akka/Akka.Persistence files add some tracing details that we can present to Akka team when this happens again?

Aaronontheweb · 2019-07-30T16:36:42Z

If you guys could call out where the issue is, @izavala might be able to look into it

object · 2019-07-31T08:37:33Z

@izavala @Aaronontheweb The issue still happens and we are willing to dedicate time to help resolving it. I will try to reproduce it again and record the sequence of steps and log events. If you have any tips about how to collect additional information that might be useful for you, please let me know.

object · 2019-08-01T14:08:50Z

I ran some more tests using Akka 1.3.14 (most recent pre-1.4 version). Here's what happened.

I started an Akka cluster, different nodes have different roles and only one of them uses persistent actors. I restarted this node to have it fresh.
Sent some messages that triggered persistent actor creation. So far so good.
Stopped SQL Server instance (the persistent actors store).
Sent a new message that was supposed to trigger persistent actor with PersistentId ps~msue14000709.

--- SQL Server stopped

Shortly after that I could see following entries in the log:

15:41:36.468
Resolving persistent actor ["fg:ps~msue14000709"]
15:41:36.468
fields.LogSource	   	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/msue14000709#2093000972]
t fields.MachineName	   	NRKDT66337
t fields.Message	   	Replay failed ((Total count of recovering actors = 0)). Message type [<null>] ([<null>]),
 - ,
System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) ---> System.ComponentModel.Win32Exception (0x80004005): The system cannot find the file specified

I waited for some time. Errors were logged all the time, an extract from the log around this time:

15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	now supervising akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Restarted (Akkling.Actors+FunActor`1[System.Object])
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Root: init
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	Started (Akkling.Actors+FunActor`1[System.Object])
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Spawned root actor
15:41:36.613	Debug	akka://Oddjob/system/recoveryPermitter	received AutoReceiveMessage <Terminated>: [akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/fg:ps~msue14000709#1839776680] - ExistenceConfirmed=False
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	Stopped
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/fg:ps~msue14000709	Stopped
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/fg:ps~msue14000709	Started (Akkling.Persistence.FunPersistentActor`1[System.Object])
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	Stopping
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/fg:ps~msue14000709#1839776680]	Creating persistent actor (Total count of recovering actors = 69)
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor#1460648651]	Resolving persistent actor ["fg:ps~msue14000709"]
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Restarting
15:41:36.613	Error	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Circuit Breaker is open; calls are failing fast
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709	Stopped
15:41:36.613	Error	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709#1962527943]	["Replay failed ((Total count of recovering actors = 68)). Message type [<null>] ([<null>])", null, "Akka.Pattern.OpenCircuitException: Circuit Breaker is open; calls are failing fast
   at Akka.Pattern.Open.<Invoke>d__2`1.MoveNext()"] {AckInfo} {Exception}
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Root: awaiting_group_actor (unresolved 2+2)
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Requested persistent actor resolution for GroupIdentity "ps~msue14000709"
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709	now watched by [akka://Oddjob/system/recoveryPermitter#857219885]
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709#1962527943]	Creating persistent actor (Total count of recovering actors = 69)
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709	Started (Akkling.Persistence.FunPersistentActor`1[System.Object])
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	now supervising akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor#1460648651]	Resolving persistent actor ["asa:ps~msue14000709"]
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	Unhandled message from NoSender : PreStart
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Requested persistent actor resolution for AkamaiStorageId "ps~msue14000709"
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Requesting resolution of 2 persistent IDs
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	["JobWithStorage", "Some({AckActor =
  [akka://Oddjob/user/akamaiupload.reader/queue_acknowledge#913954865];
 Timestamp = 01-Aug-19 3:41:21 PM +02:00;})"] {AckInfo}
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	Unhandled message from akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709 : PostStop
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	now supervising akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Restarted (Akkling.Actors+FunActor`1[System.Object])
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Root: init
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	Started (Akkling.Actors+FunActor`1[System.Object])
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Spawned root actor
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	Stopped
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/fg:ps~msue14000709	Stopped
15:41:36.612	Debug	akka://Oddjob/system/recoveryPermitter	received AutoReceiveMessage <Terminated>: [akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/fg:ps~msue14000709#1353465467] - ExistenceConfirmed=False
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/fg:ps~msue14000709	Started (Akkling.Persistence.FunPersistentActor`1[System.Object])
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/fg:ps~msue14000709#1353465467]	Creating persistent actor (Total count of recovering actors = 68)
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	Stopping
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Restarting
15:41:36.612	Error	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Circuit Breaker is open; calls are failing fast
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor#1305293783]	Resolving persistent actor ["fg:ps~msue14000709"]
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Root: awaiting_group_actor (unresolved 2+2)
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Requested persistent actor resolution for GroupIdentity "ps~msue14000709"
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709	Stopped
15:41:36.612	Error	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709#2110918249]	["Replay failed ((Total count of recovering actors = 67)). Message type [<null>] ([<null>])", null, "Akka.Pattern.OpenCircuitException: Circuit Breaker is open; calls are failing fast
   at Akka.Pattern.Open.<Invoke>d__2`1.MoveNext()"] {AckInfo} {Exception}
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709	now watched by [akka://Oddjob/system/recoveryPermitter#857219885]
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709	Started (Akkling.Persistence.FunPersistentActor`1[System.Object])
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709#2110918249]	Creating persistent actor (Total count of recovering actors = 68)
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	now supervising akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor#1305293783]	Resolving persistent actor ["asa:ps~msue14000709"]

You can see many attempts to recover same persistent actors, lots of errors and messages about open Circuit Breaker and calls are failing fast.

Then I restarted SQL server

--- SQL Server started

Still I saw new attempts to recover same actors, still open circuit breaker. What I also saw is a huge total number of recovering actors (we write in the logs this value that we evaluate ourselves).
Look at this:

15:41:41.692	Error	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Circuit Breaker is open; calls are failing fast
15:41:41.692	Error	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709#1258832817]	["Replay failed ((Total count of recovering actors = 3515)). Message type [<null>] ([<null>])", null, "Akka.Pattern.OpenCircuitException: Circuit Breaker is open; calls are failing fast
   at Akka.Pattern.Open.<Invoke>d__2`1.MoveNext()"] {AckInfo} {Exception}
15:41:41.691	Error	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Circuit Breaker is open; calls are failing fast
15:41:41.690	Error	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709#1347041716]	["Replay failed ((Total count of recovering actors = 3514)). Message type [<null>] ([<null>])", null, "Akka.Pattern.OpenCircuitException: Circuit Breaker is open; calls are failing fast
   at Akka.Pattern.Open.<Invoke>d__2`1.MoveNext()"] {AckInfo} {Exception}
15:41:41.689	Error	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Circuit Breaker is open; calls are failing fast
15:41:41.689	Error	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709#102710293]	["Replay failed ((Total count of recovering actors = 3513)). Message type [<null>] ([<null>])", null, "Akka.Pattern.OpenCircuitException: Circuit Breaker is open; calls are failing fast
   at Akka.Pattern.Open.<Invoke>d__2`1.MoveNext()"] {AckInfo} {Exception}
15:41:41.688	Error	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Circuit Breaker is open; calls are failing fast
15:41:41.688	Error	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709#96659651]	["Replay failed ((Total count of recovering actors = 3512)). Message type [<null>] ([<null>])", null, "Akka.Pattern.OpenCircuitException: Circuit Breaker is open; calls are failing fast
   at Akka.Pattern.Open.<Invoke>d__2`1.MoveNext()"] {AckInfo} {Exception}

It's about the same actor Id (ps~msue14000709) but the number of recovering actors are more that 3500! It looks like the new attempts are made all the time, but none of them succeed so they are all pending.

After few more minutes I stop this cluster node.
Restarted the node, things went back to normal.

@izavala @Aaronontheweb Does this description give you any clue? Is there anything else I can record to shed more light into this problem?

object · 2019-08-02T13:56:01Z

I think I have good news, @Aaronontheweb @izavala

We spent some time today trying to write a small console app that reproduces the problem, and we this app managed to recover its persistent actors after restart of SQL Server. So we have a code example that doesn't fail similar to what I reported earlier. This means that there are chances that our large app has a flaw in its persistent actor handling logic.

We will continue investigating this case on Monday, trying to find out what might be the cause of difference in behavior. But we have a hope now that the problem lies outside Akka code base.

object · 2019-08-02T15:42:48Z

More on this. The difference between simple scenario that where I wasn't able to reproduce the issue and our real app is that the real app uses cluster sharding, so if an actor fails to recover its persistent state, the error is propagated to its supervisor which is a shard, it immediately retries to create the actor again, at some point it fails fast with OpenCircuitException and the exception is propagated again. I guess this is why we see thousands of attempts to create the same persistent actor that fails and fails again.

object · 2019-08-04T08:17:35Z

Investigated more and I am closing the issue as not related to SQL Server persistence provider. I still have no explanation for why some actors instantiated via cluster sharding don't recover once CircuitBreakerOpenException is raised and reset timeout passed, but this is a completely different issue which I need to investigate more.

object closed this as completed Aug 4, 2019

ismaelhamed mentioned this issue Feb 27, 2020

Some persistent actors are stuck with RecoveryTimedOutException after circuit breaker opens akkadotnet/akka.net#4265

Closed

ismaelhamed mentioned this issue Apr 18, 2021

BatchingSqlJournal writes prematurely sends WriteMessageSuccess before the transaction is commited akkadotnet/akka.net#4941

Closed

Aaronontheweb mentioned this issue Apr 19, 2021

Batching persistence journal will never emit any failed operation messages akkadotnet/akka.net#4951

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent actors are still stuck after network failure #114

Persistent actors are still stuck after network failure #114

object commented May 23, 2019

object commented May 29, 2019 •

edited

Loading

ismaelhamed commented May 29, 2019

Aaronontheweb commented Jun 21, 2019

ismaelhamed commented Jun 23, 2019

object commented Jun 25, 2019

ismaelhamed commented Jun 25, 2019

object commented Jun 25, 2019

Aaronontheweb commented Jul 30, 2019

object commented Jul 31, 2019

object commented Aug 1, 2019 •

edited

Loading

object commented Aug 2, 2019

object commented Aug 2, 2019

object commented Aug 4, 2019

Persistent actors are still stuck after network failure #114

Persistent actors are still stuck after network failure #114

Comments

object commented May 23, 2019

object commented May 29, 2019 • edited Loading

ismaelhamed commented May 29, 2019

Aaronontheweb commented Jun 21, 2019

ismaelhamed commented Jun 23, 2019

object commented Jun 25, 2019

ismaelhamed commented Jun 25, 2019

object commented Jun 25, 2019

Aaronontheweb commented Jul 30, 2019

object commented Jul 31, 2019

object commented Aug 1, 2019 • edited Loading

object commented Aug 2, 2019

object commented Aug 2, 2019

object commented Aug 4, 2019

object commented May 29, 2019 •

edited

Loading

object commented Aug 1, 2019 •

edited

Loading