Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BatchingSqlServerJournal - stuck persistent actor #104

Closed
balcko opened this issue Mar 14, 2019 · 5 comments · Fixed by #108
Closed

BatchingSqlServerJournal - stuck persistent actor #104

balcko opened this issue Mar 14, 2019 · 5 comments · Fixed by #108
Assignees

Comments

@balcko
Copy link

balcko commented Mar 14, 2019

During the maintenance of our production SqlServer database hosting event journal (active/passive setup), we have encountered a problem, that whole application got unresponsive. Only restarting of actor systems on all nodes in cluster via pbm helped. We are using Akka.Persistence.SqlServer 1.3.7 and BatchingSqlServerJournal.

I managed to simulate problem locally by running load test and taking local journal DB offline.
Even after putting DB online, the actors are still stuck.

Problem seems to be between expectations of ReceivePersistentActor (Eventsourced parent class) and implementation of BatchingSqlServerJournal. When Persist is called in a persistent actor, a WriteMessages message is sent to journal actor and persistent actor changes its behaviour (waiting for a sequence of WriteMessageSuccess/WriteMessageRejected/WriteMessageFailurea and finally WriteMessagesSuccessful message, while stashing all other messages). The problem happens when journal actor will not reply, there is no timeout mechanism in persistent actor treating persist as failed and it blocks infinitely. This is the case of BatchingSqlServerJournal, which has some code paths not replying to journal requests (connection open error, circuit breaker opened and maybe more).

The consequences are quite severe, persistent actor may be unblocked only by external actor stop or whole actor system restart.

@Aaronontheweb
Copy link
Member

The problem happens when journal actor will not reply, there is no timeout mechanism in persistent actor treating persist as failed and it blocks infinitely. This is the case of BatchingSqlServerJournal, which has some code paths not replying to journal requests (connection open error, circuit breaker opened and maybe more).

This sure sounds like a bug on our end - let me review the code on this BatchingSqlJournal real quick.

@Horusiath Horusiath self-assigned this Mar 15, 2019
@izavala
Copy link
Contributor

izavala commented Apr 4, 2019

Currently looking through the BatchingSqlJournal code, when we are waiting for the SQL command to be executed:

https://github.com/akkadotnet/akka.net/blob/980a0096ca0aa50f4eb8a27a306df23ffe3a027b/src/contrib/persistence/Akka.Persistence.Sql.Common/Journal/BatchingSqlJournal.cs#L1089-L1114

In theory, if we try to write to the database when it is unavailable we should get an exception which we would then handle with the appropriate WriteMessageFailed. This looks like the most likely place where we could be getting stuck if we do not get a response. According to the SqlCommand documentation if we do not get an exception we would be waiting here indefinitely since the timeout property would be ignored.

Currently trying to write a test for this.

@balcko
Copy link
Author

balcko commented Apr 5, 2019

This part looks safe, however the problem is level above:
https://github.com/akkadotnet/akka.net/blob/980a0096ca0aa50f4eb8a27a306df23ffe3a027b/src/contrib/persistence/Akka.Persistence.Sql.Common/Journal/BatchingSqlJournal.cs#L850

If database is unavailable, connection open will throw even sooner meaning, that all persistent actors having requests in executed chunk will not receive any reply.

Same problem is here:
https://github.com/akkadotnet/akka.net/blob/980a0096ca0aa50f4eb8a27a306df23ffe3a027b/src/contrib/persistence/Akka.Persistence.Sql.Common/Journal/BatchingSqlJournal.cs#L840

If circuit breaker is opened, whole chunk will produce no replies.

@ismaelhamed
Copy link
Member

IMO this issue belongs to the akka.net project, and #3753 is most likely related to this one.

@Aaronontheweb
Copy link
Member

Closed via akkadotnet/akka.net#3754

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants