Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split brain resolver intermittently fails to recreate shard #3455

Closed
raypurchasett opened this issue May 17, 2018 · 5 comments
Closed

Split brain resolver intermittently fails to recreate shard #3455

raypurchasett opened this issue May 17, 2018 · 5 comments

Comments

@raypurchasett
Copy link
Contributor

Version: Akka.Cluster.Sharding 1.3.6-beta62

I've got a three node cluster with a very simple sharded entity. I'm testing the static-quorum split brain resolver strategy and have hit a bug.

The quorum size is set to 2 so when I bring down one node I expect the shard to migrate to one of the UP nodes. I'm also using SQL Server persistence. 'Auto down' is not enabled.

Sometimes this works, however around 50% of the time I get the following exception

[ERROR][17/05/2018 13:37:53][Thread 0020][[akka://akka-cluster-server/system/sharding/my-actorCoordinator/singleton/coordinator#1726064772]] Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [9] for persistenceId [/system/sharding/my-actorCoordinator/singleton/coordinator] Cause: System.ArgumentException: Shard 1 is already allocated Parameter name: e at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e) at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message) at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message) at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message)

@Aaronontheweb
Copy link
Member

Thanks! We'll look into it.

@Aaronontheweb Aaronontheweb added this to the 1.3.8 milestone May 22, 2018
@Aaronontheweb
Copy link
Member

I think the bug here is just in Cluster.Sharding itself, so I'll see if we can recreate this locally.

@danielab
Copy link

danielab commented Sep 3, 2018

Version: Akka.Cluster.Sharding 1.3.8-beta66
(Akka.Net Version: 1.3.8)

We are using persistence with BigTable (https://github.com/hafslundnett/hn-akka-persistence-bigtable)
Split-brain-resolver : active-strategy : keep-majority

I think we got a similar error:


[WARNING][09/03/2018 07:54:57][Thread 0085][[akka://meteringpoint-system/system/sharding/MeteringPointActor#658978779]] Trying to register to coordinator at [/system/sharding/MeteringPointActorCoordinator/singleton/coordinator], but no acknowledgement. Total [11700] buffered messages.
[ERROR][09/03/2018 07:54:57][Thread 0102][[akka://meteringpoint-system/system/sharding/MeteringPointActorCoordinator/singleton/coordinator#131635967]] Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [128] for persistenceId [/system/sharding/MeteringPointActorCoordinator/singleton/coordinator]
Cause: System.ArgumentException: Shard 76 is already allocated
Parameter name: e
  at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
  at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
  at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
  at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message)
 

We are running with 8 nodes (akka-seed-0 to akka-seed-7), where akka-seed-0 and akka-seed-1 are seed nodes.

It seems the error occurred after akka-seed-0 and akka-seed-4 were restarted about the same time (time 04:00:07 and 04:00:10).

It seems like the nodes that went down is not able to register to coordinator again.
Logs are attached.

akka-seed-0.txt
akka-seed-1.txt
akka-seed-2.txt
akka-seed-3.txt
akka-seed-4.txt
akka-seed-5.txt
akka-seed-6.txt
akka-seed-7.txt

@zbynek001
Copy link
Contributor

looks like the same issue as #3204

@Aaronontheweb Aaronontheweb modified the milestones: 1.3.9, 1.3.10 Sep 5, 2018
@Aaronontheweb Aaronontheweb modified the milestones: 1.3.10, 1.3.11, 1.3.12 Dec 14, 2018
@Aaronontheweb
Copy link
Member

closed as part of Akka.NET v1.3.12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants