Split brain resolver intermittently fails to recreate shard #3455

raypurchasett · 2018-05-17T14:20:46Z

Version: Akka.Cluster.Sharding 1.3.6-beta62

I've got a three node cluster with a very simple sharded entity. I'm testing the static-quorum split brain resolver strategy and have hit a bug.

The quorum size is set to 2 so when I bring down one node I expect the shard to migrate to one of the UP nodes. I'm also using SQL Server persistence. 'Auto down' is not enabled.

Sometimes this works, however around 50% of the time I get the following exception

[ERROR][17/05/2018 13:37:53][Thread 0020][[akka://akka-cluster-server/system/sharding/my-actorCoordinator/singleton/coordinator#1726064772]] Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [9] for persistenceId [/system/sharding/my-actorCoordinator/singleton/coordinator] Cause: System.ArgumentException: Shard 1 is already allocated Parameter name: e at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e) at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message) at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message) at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message)

The text was updated successfully, but these errors were encountered:

Aaronontheweb · 2018-05-18T12:43:14Z

Thanks! We'll look into it.

Aaronontheweb · 2018-05-22T20:33:29Z

I think the bug here is just in Cluster.Sharding itself, so I'll see if we can recreate this locally.

danielab · 2018-09-03T13:47:26Z

Version: Akka.Cluster.Sharding 1.3.8-beta66
(Akka.Net Version: 1.3.8)

We are using persistence with BigTable (https://github.com/hafslundnett/hn-akka-persistence-bigtable)
Split-brain-resolver : active-strategy : keep-majority

I think we got a similar error:


[WARNING][09/03/2018 07:54:57][Thread 0085][[akka://meteringpoint-system/system/sharding/MeteringPointActor#658978779]] Trying to register to coordinator at [/system/sharding/MeteringPointActorCoordinator/singleton/coordinator], but no acknowledgement. Total [11700] buffered messages.
[ERROR][09/03/2018 07:54:57][Thread 0102][[akka://meteringpoint-system/system/sharding/MeteringPointActorCoordinator/singleton/coordinator#131635967]] Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [128] for persistenceId [/system/sharding/MeteringPointActorCoordinator/singleton/coordinator]
Cause: System.ArgumentException: Shard 76 is already allocated
Parameter name: e
  at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
  at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
  at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
  at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message)

We are running with 8 nodes (akka-seed-0 to akka-seed-7), where akka-seed-0 and akka-seed-1 are seed nodes.

It seems the error occurred after akka-seed-0 and akka-seed-4 were restarted about the same time (time 04:00:07 and 04:00:10).

It seems like the nodes that went down is not able to register to coordinator again.
Logs are attached.

akka-seed-0.txt
akka-seed-1.txt
akka-seed-2.txt
akka-seed-3.txt
akka-seed-4.txt
akka-seed-5.txt
akka-seed-6.txt
akka-seed-7.txt

zbynek001 · 2018-09-03T14:04:53Z

looks like the same issue as #3204

Aaronontheweb · 2019-12-10T21:04:15Z

closed as part of Akka.NET v1.3.12.

Aaronontheweb added akka-cluster-sharding potential bug labels May 18, 2018

Aaronontheweb added this to the 1.3.8 milestone May 22, 2018

Aaronontheweb mentioned this issue May 23, 2018

Exception in PersistentShardCoordinator ReceiveRecover #3414

Closed

marcpiechura modified the milestones: 1.3.8, 1.3.9 Jun 5, 2018

Aaronontheweb modified the milestones: 1.3.9, 1.3.10 Sep 5, 2018

Aaronontheweb modified the milestones: 1.3.10, 1.3.11, 1.3.12 Dec 14, 2018

Aaronontheweb closed this as completed Dec 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split brain resolver intermittently fails to recreate shard #3455

Split brain resolver intermittently fails to recreate shard #3455

raypurchasett commented May 17, 2018

Aaronontheweb commented May 18, 2018

Aaronontheweb commented May 22, 2018

danielab commented Sep 3, 2018

zbynek001 commented Sep 3, 2018

Aaronontheweb commented Dec 10, 2019

Split brain resolver intermittently fails to recreate shard #3455

Split brain resolver intermittently fails to recreate shard #3455

Comments

raypurchasett commented May 17, 2018

Aaronontheweb commented May 18, 2018

Aaronontheweb commented May 22, 2018

danielab commented Sep 3, 2018

zbynek001 commented Sep 3, 2018

Aaronontheweb commented Dec 10, 2019