-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistent actors are still stuck after network failure #114
Comments
I ran more tests and can confirm that if I temporarily stop SQL Server then persistent actor recovery keeps failing with error "Recovery timed out, didn't get event within 60s, highest sequence number seen 0." even after the server is restarted. Since this issue has quite high priority, in case you need more information or want us to test something, just let us know. |
@Horusiath Looks like the |
@ismaelhamed have we patched this again since 1.3.13? Working on getting a v1.4.0 beta out ASAP - I can tack a fix for this onto that release if it hasn't been done already. |
@Aaronontheweb This is a new one. But when I looked into it, I didn't see an obvious way to fix it. |
@ismaelhamed You mentioned BatchingSqlJournal.FailChunkExecution. Does it also apply to non-batching journal? I believe we hit the same problem when we temporarily switched from batching to non-batching journal. |
@object But then, I guess it'd be a different edge case that you've found. Probably the best way to solve this would be trying to come up with a test case to reproduce it. |
@ismaelhamed I would like to extract some code that reproduces the problem. Unfortunately it's not that easy. Do you have any tips of how we can track the problem, e.g. in what Akka/Akka.Persistence files add some tracing details that we can present to Akka team when this happens again? |
If you guys could call out where the issue is, @izavala might be able to look into it |
@izavala @Aaronontheweb The issue still happens and we are willing to dedicate time to help resolving it. I will try to reproduce it again and record the sequence of steps and log events. If you have any tips about how to collect additional information that might be useful for you, please let me know. |
I ran some more tests using Akka 1.3.14 (most recent pre-1.4 version). Here's what happened.
--- SQL Server stopped
I waited for some time. Errors were logged all the time, an extract from the log around this time:
You can see many attempts to recover same persistent actors, lots of errors and messages about open Circuit Breaker and calls are failing fast.
--- SQL Server started
It's about the same actor Id (ps~msue14000709) but the number of recovering actors are more that 3500! It looks like the new attempts are made all the time, but none of them succeed so they are all pending.
@izavala @Aaronontheweb Does this description give you any clue? Is there anything else I can record to shed more light into this problem? |
I think I have good news, @Aaronontheweb @izavala We spent some time today trying to write a small console app that reproduces the problem, and we this app managed to recover its persistent actors after restart of SQL Server. So we have a code example that doesn't fail similar to what I reported earlier. This means that there are chances that our large app has a flaw in its persistent actor handling logic. We will continue investigating this case on Monday, trying to find out what might be the cause of difference in behavior. But we have a hope now that the problem lies outside Akka code base. |
More on this. The difference between simple scenario that where I wasn't able to reproduce the issue and our real app is that the real app uses cluster sharding, so if an actor fails to recover its persistent state, the error is propagated to its supervisor which is a shard, it immediately retries to create the actor again, at some point it fails fast with OpenCircuitException and the exception is propagated again. I guess this is why we see thousands of attempts to create the same persistent actor that fails and fails again. |
Investigated more and I am closing the issue as not related to SQL Server persistence provider. I still have no explanation for why some actors instantiated via cluster sharding don't recover once CircuitBreakerOpenException is raised and reset timeout passed, but this is a completely different issue which I need to investigate more. |
This might be the same as #104 which is supposed to be fixed in 1.3.13. However we upgraded to 1.3.13 and it looks like this error occurs more often than before. Here are the symptoms:
The text was updated successfully, but these errors were encountered: