Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent actors are still stuck after network failure #114

Closed
object opened this issue May 23, 2019 · 13 comments
Closed

Persistent actors are still stuck after network failure #114

object opened this issue May 23, 2019 · 13 comments

Comments

@object
Copy link
Contributor

object commented May 23, 2019

This might be the same as #104 which is supposed to be fixed in 1.3.13. However we upgraded to 1.3.13 and it looks like this error occurs more often than before. Here are the symptoms:

  1. A temporary network failure occurs and lasts for a short time, we see in our logs some errors (for example an error from RabbitMQ).
  2. Around this time we see a number of "Failed to execute chunk for 1 requests" error message from akka://Oddjob/system/akka.persistence.journal.sql-server.
  3. In a few minutes after that (when the network is OK) we see a few hundreds of "Circuit Breaker is open; calls are failing fast" errors from our persistent actors that are trying to recover its state.
  4. Then there is a continuous stream of "Recovery timed out, didn't get event within 60s, highest sequence number seen 0." messages that lasts until we restart the node.
  5. After restarting the node things go back to normal.
@object
Copy link
Contributor Author

object commented May 29, 2019

I ran more tests and can confirm that if I temporarily stop SQL Server then persistent actor recovery keeps failing with error "Recovery timed out, didn't get event within 60s, highest sequence number seen 0." even after the server is restarted.

Since this issue has quite high priority, in case you need more information or want us to test something, just let us know.

@ismaelhamed
Copy link
Member

@Horusiath Looks like the BatchingSqlJournal.FailChunkExecution method is not producing the corresponding WriteMessageRejected or WriteMessageFailure messages

@Aaronontheweb
Copy link
Member

@ismaelhamed have we patched this again since 1.3.13? Working on getting a v1.4.0 beta out ASAP - I can tack a fix for this onto that release if it hasn't been done already.

@ismaelhamed
Copy link
Member

@Aaronontheweb This is a new one. But when I looked into it, I didn't see an obvious way to fix it.

@object
Copy link
Contributor Author

object commented Jun 25, 2019

@ismaelhamed You mentioned BatchingSqlJournal.FailChunkExecution. Does it also apply to non-batching journal? I believe we hit the same problem when we temporarily switched from batching to non-batching journal.

@ismaelhamed
Copy link
Member

@object But then, I guess it'd be a different edge case that you've found. Probably the best way to solve this would be trying to come up with a test case to reproduce it.

@object
Copy link
Contributor Author

object commented Jun 25, 2019

@ismaelhamed I would like to extract some code that reproduces the problem. Unfortunately it's not that easy. Do you have any tips of how we can track the problem, e.g. in what Akka/Akka.Persistence files add some tracing details that we can present to Akka team when this happens again?

@Aaronontheweb
Copy link
Member

If you guys could call out where the issue is, @izavala might be able to look into it

@object
Copy link
Contributor Author

object commented Jul 31, 2019

@izavala @Aaronontheweb The issue still happens and we are willing to dedicate time to help resolving it. I will try to reproduce it again and record the sequence of steps and log events. If you have any tips about how to collect additional information that might be useful for you, please let me know.

@object
Copy link
Contributor Author

object commented Aug 1, 2019

I ran some more tests using Akka 1.3.14 (most recent pre-1.4 version). Here's what happened.

  1. I started an Akka cluster, different nodes have different roles and only one of them uses persistent actors. I restarted this node to have it fresh.
  2. Sent some messages that triggered persistent actor creation. So far so good.
  3. Stopped SQL Server instance (the persistent actors store).
  4. Sent a new message that was supposed to trigger persistent actor with PersistentId ps~msue14000709.

--- SQL Server stopped

  1. Shortly after that I could see following entries in the log:
15:41:36.468
Resolving persistent actor ["fg:ps~msue14000709"]
15:41:36.468
fields.LogSource	   	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/msue14000709#2093000972]
t fields.MachineName	   	NRKDT66337
t fields.Message	   	Replay failed ((Total count of recovering actors = 0)). Message type [<null>] ([<null>]),
 - ,
System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) ---> System.ComponentModel.Win32Exception (0x80004005): The system cannot find the file specified

I waited for some time. Errors were logged all the time, an extract from the log around this time:

15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	now supervising akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Restarted (Akkling.Actors+FunActor`1[System.Object])
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Root: init
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	Started (Akkling.Actors+FunActor`1[System.Object])
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Spawned root actor
15:41:36.613	Debug	akka://Oddjob/system/recoveryPermitter	received AutoReceiveMessage <Terminated>: [akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/fg:ps~msue14000709#1839776680] - ExistenceConfirmed=False
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	Stopped
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/fg:ps~msue14000709	Stopped
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/fg:ps~msue14000709	Started (Akkling.Persistence.FunPersistentActor`1[System.Object])
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	Stopping
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/fg:ps~msue14000709#1839776680]	Creating persistent actor (Total count of recovering actors = 69)
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor#1460648651]	Resolving persistent actor ["fg:ps~msue14000709"]
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Restarting
15:41:36.613	Error	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Circuit Breaker is open; calls are failing fast
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709	Stopped
15:41:36.613	Error	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709#1962527943]	["Replay failed ((Total count of recovering actors = 68)). Message type [<null>] ([<null>])", null, "Akka.Pattern.OpenCircuitException: Circuit Breaker is open; calls are failing fast
   at Akka.Pattern.Open.<Invoke>d__2`1.MoveNext()"] {AckInfo} {Exception}
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Root: awaiting_group_actor (unresolved 2+2)
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Requested persistent actor resolution for GroupIdentity "ps~msue14000709"
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709	now watched by [akka://Oddjob/system/recoveryPermitter#857219885]
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709#1962527943]	Creating persistent actor (Total count of recovering actors = 69)
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709	Started (Akkling.Persistence.FunPersistentActor`1[System.Object])
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	now supervising akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor#1460648651]	Resolving persistent actor ["asa:ps~msue14000709"]
15:41:36.613	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	Unhandled message from NoSender : PreStart
15:41:36.613	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Requested persistent actor resolution for AkamaiStorageId "ps~msue14000709"
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Requesting resolution of 2 persistent IDs
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	["JobWithStorage", "Some({AckActor =
  [akka://Oddjob/user/akamaiupload.reader/queue_acknowledge#913954865];
 Timestamp = 01-Aug-19 3:41:21 PM +02:00;})"] {AckInfo}
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	Unhandled message from akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709 : PostStop
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	now supervising akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Restarted (Akkling.Actors+FunActor`1[System.Object])
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Root: init
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	Started (Akkling.Actors+FunActor`1[System.Object])
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Spawned root actor
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	Stopped
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/fg:ps~msue14000709	Stopped
15:41:36.612	Debug	akka://Oddjob/system/recoveryPermitter	received AutoReceiveMessage <Terminated>: [akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/fg:ps~msue14000709#1353465467] - ExistenceConfirmed=False
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/fg:ps~msue14000709	Started (Akkling.Persistence.FunPersistentActor`1[System.Object])
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/fg:ps~msue14000709#1353465467]	Creating persistent actor (Total count of recovering actors = 68)
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	Stopping
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Restarting
15:41:36.612	Error	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Circuit Breaker is open; calls are failing fast
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor#1305293783]	Resolving persistent actor ["fg:ps~msue14000709"]
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Root: awaiting_group_actor (unresolved 2+2)
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709#1256947492]	Requested persistent actor resolution for GroupIdentity "ps~msue14000709"
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709	Stopped
15:41:36.612	Error	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709#2110918249]	["Replay failed ((Total count of recovering actors = 67)). Message type [<null>] ([<null>])", null, "Akka.Pattern.OpenCircuitException: Circuit Breaker is open; calls are failing fast
   at Akka.Pattern.Open.<Invoke>d__2`1.MoveNext()"] {AckInfo} {Exception}
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709	now watched by [akka://Oddjob/system/recoveryPermitter#857219885]
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709	Started (Akkling.Persistence.FunPersistentActor`1[System.Object])
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709#2110918249]	Creating persistent actor (Total count of recovering actors = 68)
15:41:36.612	Debug	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor	now supervising akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709
15:41:36.612	Debug	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor#1305293783]	Resolving persistent actor ["asa:ps~msue14000709"]

You can see many attempts to recover same persistent actors, lots of errors and messages about open Circuit Breaker and calls are failing fast.

  1. Then I restarted SQL server

--- SQL Server started

  1. Still I saw new attempts to recover same actors, still open circuit breaker. What I also saw is a huge total number of recovering actors (we write in the logs this value that we evaluate ourselves).
    Look at this:
15:41:41.692	Error	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Circuit Breaker is open; calls are failing fast
15:41:41.692	Error	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709#1258832817]	["Replay failed ((Total count of recovering actors = 3515)). Message type [<null>] ([<null>])", null, "Akka.Pattern.OpenCircuitException: Circuit Breaker is open; calls are failing fast
   at Akka.Pattern.Open.<Invoke>d__2`1.MoveNext()"] {AckInfo} {Exception}
15:41:41.691	Error	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Circuit Breaker is open; calls are failing fast
15:41:41.690	Error	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709#1347041716]	["Replay failed ((Total count of recovering actors = 3514)). Message type [<null>] ([<null>])", null, "Akka.Pattern.OpenCircuitException: Circuit Breaker is open; calls are failing fast
   at Akka.Pattern.Open.<Invoke>d__2`1.MoveNext()"] {AckInfo} {Exception}
15:41:41.689	Error	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Circuit Breaker is open; calls are failing fast
15:41:41.689	Error	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709#102710293]	["Replay failed ((Total count of recovering actors = 3513)). Message type [<null>] ([<null>])", null, "Akka.Pattern.OpenCircuitException: Circuit Breaker is open; calls are failing fast
   at Akka.Pattern.Open.<Invoke>d__2`1.MoveNext()"] {AckInfo} {Exception}
15:41:41.688	Error	akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709	Circuit Breaker is open; calls are failing fast
15:41:41.688	Error	[akka://Oddjob/system/sharding/upload/upload_1/ps~msue14000709/persistence-supervisor/asa:ps~msue14000709#96659651]	["Replay failed ((Total count of recovering actors = 3512)). Message type [<null>] ([<null>])", null, "Akka.Pattern.OpenCircuitException: Circuit Breaker is open; calls are failing fast
   at Akka.Pattern.Open.<Invoke>d__2`1.MoveNext()"] {AckInfo} {Exception}

It's about the same actor Id (ps~msue14000709) but the number of recovering actors are more that 3500! It looks like the new attempts are made all the time, but none of them succeed so they are all pending.

  1. After few more minutes I stop this cluster node.
  2. Restarted the node, things went back to normal.

@izavala @Aaronontheweb Does this description give you any clue? Is there anything else I can record to shed more light into this problem?

@object
Copy link
Contributor Author

object commented Aug 2, 2019

I think I have good news, @Aaronontheweb @izavala

We spent some time today trying to write a small console app that reproduces the problem, and we this app managed to recover its persistent actors after restart of SQL Server. So we have a code example that doesn't fail similar to what I reported earlier. This means that there are chances that our large app has a flaw in its persistent actor handling logic.

We will continue investigating this case on Monday, trying to find out what might be the cause of difference in behavior. But we have a hope now that the problem lies outside Akka code base.

@object
Copy link
Contributor Author

object commented Aug 2, 2019

More on this. The difference between simple scenario that where I wasn't able to reproduce the issue and our real app is that the real app uses cluster sharding, so if an actor fails to recover its persistent state, the error is propagated to its supervisor which is a shard, it immediately retries to create the actor again, at some point it fails fast with OpenCircuitException and the exception is propagated again. I guess this is why we see thousands of attempts to create the same persistent actor that fails and fails again.

@object
Copy link
Contributor Author

object commented Aug 4, 2019

Investigated more and I am closing the issue as not related to SQL Server persistence provider. I still have no explanation for why some actors instantiated via cluster sharding don't recover once CircuitBreakerOpenException is raised and reset timeout passed, but this is a completely different issue which I need to investigate more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants