-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BatchingSqlServerJournal - stuck persistent actor #104
Comments
This sure sounds like a bug on our end - let me review the code on this |
Currently looking through the In theory, if we try to write to the database when it is unavailable we should get an exception which we would then handle with the appropriate Currently trying to write a test for this. |
This part looks safe, however the problem is level above: If database is unavailable, connection open will throw even sooner meaning, that all persistent actors having requests in executed chunk will not receive any reply. Same problem is here: If circuit breaker is opened, whole chunk will produce no replies. |
IMO this issue belongs to the akka.net project, and #3753 is most likely related to this one. |
Closed via akkadotnet/akka.net#3754 |
During the maintenance of our production SqlServer database hosting event journal (active/passive setup), we have encountered a problem, that whole application got unresponsive. Only restarting of actor systems on all nodes in cluster via pbm helped. We are using Akka.Persistence.SqlServer 1.3.7 and BatchingSqlServerJournal.
I managed to simulate problem locally by running load test and taking local journal DB offline.
Even after putting DB online, the actors are still stuck.
Problem seems to be between expectations of ReceivePersistentActor (Eventsourced parent class) and implementation of BatchingSqlServerJournal. When Persist is called in a persistent actor, a WriteMessages message is sent to journal actor and persistent actor changes its behaviour (waiting for a sequence of WriteMessageSuccess/WriteMessageRejected/WriteMessageFailurea and finally WriteMessagesSuccessful message, while stashing all other messages). The problem happens when journal actor will not reply, there is no timeout mechanism in persistent actor treating persist as failed and it blocks infinitely. This is the case of BatchingSqlServerJournal, which has some code paths not replying to journal requests (connection open error, circuit breaker opened and maybe more).
The consequences are quite severe, persistent actor may be unblocked only by external actor stop or whole actor system restart.
The text was updated successfully, but these errors were encountered: