BatchingSqlServerJournal - stuck persistent actor #104

balcko · 2019-03-14T10:11:45Z

During the maintenance of our production SqlServer database hosting event journal (active/passive setup), we have encountered a problem, that whole application got unresponsive. Only restarting of actor systems on all nodes in cluster via pbm helped. We are using Akka.Persistence.SqlServer 1.3.7 and BatchingSqlServerJournal.

I managed to simulate problem locally by running load test and taking local journal DB offline.
Even after putting DB online, the actors are still stuck.

Problem seems to be between expectations of ReceivePersistentActor (Eventsourced parent class) and implementation of BatchingSqlServerJournal. When Persist is called in a persistent actor, a WriteMessages message is sent to journal actor and persistent actor changes its behaviour (waiting for a sequence of WriteMessageSuccess/WriteMessageRejected/WriteMessageFailurea and finally WriteMessagesSuccessful message, while stashing all other messages). The problem happens when journal actor will not reply, there is no timeout mechanism in persistent actor treating persist as failed and it blocks infinitely. This is the case of BatchingSqlServerJournal, which has some code paths not replying to journal requests (connection open error, circuit breaker opened and maybe more).

The consequences are quite severe, persistent actor may be unblocked only by external actor stop or whole actor system restart.

Aaronontheweb · 2019-03-14T20:54:07Z

The problem happens when journal actor will not reply, there is no timeout mechanism in persistent actor treating persist as failed and it blocks infinitely. This is the case of BatchingSqlServerJournal, which has some code paths not replying to journal requests (connection open error, circuit breaker opened and maybe more).

This sure sounds like a bug on our end - let me review the code on this BatchingSqlJournal real quick.

izavala · 2019-04-04T20:40:05Z

Currently looking through the BatchingSqlJournal code, when we are waiting for the SQL command to be executed:

https://github.com/akkadotnet/akka.net/blob/980a0096ca0aa50f4eb8a27a306df23ffe3a027b/src/contrib/persistence/Akka.Persistence.Sql.Common/Journal/BatchingSqlJournal.cs#L1089-L1114

In theory, if we try to write to the database when it is unavailable we should get an exception which we would then handle with the appropriate WriteMessageFailed. This looks like the most likely place where we could be getting stuck if we do not get a response. According to the SqlCommand documentation if we do not get an exception we would be waiting here indefinitely since the timeout property would be ignored.

Currently trying to write a test for this.

balcko · 2019-04-05T06:36:18Z

This part looks safe, however the problem is level above:
https://github.com/akkadotnet/akka.net/blob/980a0096ca0aa50f4eb8a27a306df23ffe3a027b/src/contrib/persistence/Akka.Persistence.Sql.Common/Journal/BatchingSqlJournal.cs#L850

If database is unavailable, connection open will throw even sooner meaning, that all persistent actors having requests in executed chunk will not receive any reply.

Same problem is here:
https://github.com/akkadotnet/akka.net/blob/980a0096ca0aa50f4eb8a27a306df23ffe3a027b/src/contrib/persistence/Akka.Persistence.Sql.Common/Journal/BatchingSqlJournal.cs#L840

If circuit breaker is opened, whole chunk will produce no replies.

ismaelhamed · 2019-04-06T07:38:23Z

IMO this issue belongs to the akka.net project, and #3753 is most likely related to this one.

Aaronontheweb · 2019-04-08T21:40:22Z

Closed via akkadotnet/akka.net#3754

Horusiath self-assigned this Mar 15, 2019

Aaronontheweb closed this as completed Apr 8, 2019

This was referenced Apr 30, 2019

v1.3.13 Production Release akkadotnet/akka.net#3772

Merged

added Akka.Persistence.SqlServer 1.3.13 release notes #108

Merged

Akka.Persistence.SqlServer 1.3.13 Release #109

Merged

object mentioned this issue May 23, 2019

Persistent actors are still stuck after network failure #114

Closed

balcko mentioned this issue Jul 29, 2019

SqlSnapshotStore with autoinitialization stops if DB is temporrarily inaccessible akkadotnet/akka.net#3870

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BatchingSqlServerJournal - stuck persistent actor #104

BatchingSqlServerJournal - stuck persistent actor #104

balcko commented Mar 14, 2019 •

edited

Loading

Aaronontheweb commented Mar 14, 2019

izavala commented Apr 4, 2019

balcko commented Apr 5, 2019 •

edited

Loading

ismaelhamed commented Apr 6, 2019

Aaronontheweb commented Apr 8, 2019

BatchingSqlServerJournal - stuck persistent actor #104

BatchingSqlServerJournal - stuck persistent actor #104

Comments

balcko commented Mar 14, 2019 • edited Loading

Aaronontheweb commented Mar 14, 2019

izavala commented Apr 4, 2019

balcko commented Apr 5, 2019 • edited Loading

ismaelhamed commented Apr 6, 2019

Aaronontheweb commented Apr 8, 2019

balcko commented Mar 14, 2019 •

edited

Loading

balcko commented Apr 5, 2019 •

edited

Loading