Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception in PersistentShardCoordinator ReceiveRecover #3414

Closed
joshgarnett opened this issue Apr 22, 2018 · 15 comments · Fixed by #3744
Closed

Exception in PersistentShardCoordinator ReceiveRecover #3414

joshgarnett opened this issue Apr 22, 2018 · 15 comments · Fixed by #3744

Comments

@joshgarnett
Copy link
Contributor

Akka 1.3.5

This morning while making some provisioning changes, we ended up in a state where two single node clusters were running that pointed to the same database. After fixing the error and starting only a single node, the underlying akka code was failing to recover.

2018-04-22 15:28:45.148 ERROR PersistentShardCoordinator Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [40650] for persistenceId [/system/sharding/zCoordinator/singleton/coordinator]
System.ArgumentException: Region [akka://AkkaCluster/system/sharding/z#673731278] not registered
Parameter name: e
   at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
   at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message)

2018-04-22 15:28:45.438 ERROR PersistentShardCoordinator Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [50633] for persistenceId [/system/sharding/oCoordinator/singleton/coordinator]
System.ArgumentException: Shard 78 is already allocated
Parameter name: e
   at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
   at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message

2018-04-22 15:29:07.714 ERROR PersistentShardCoordinator Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [11774] for persistenceId [/system/sharding/pCoordinator/singleton/coordinator]
System.ArgumentException: Region [akka://AkkaCluster/system/sharding/p#1560991559] not registered
Parameter name: e
   at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
   at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message)

2018-04-22 15:29:10.192 ERROR PersistentShardCoordinator Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [3087] for persistenceId [/system/sharding/wCoordinator/singleton/coordinator]
System.ArgumentException: Region [akka://AkkaCluster/system/sharding/w#43619595] not registered
Parameter name: e
   at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
   at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message)

2018-04-22 15:29:15.210 ERROR PersistentShardCoordinator Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [11109] for persistenceId [/system/sharding/eCoordinator/singleton/coordinator]
System.ArgumentException: Region [akka://AkkaCluster/system/sharding/e#1963167556] not registered
Parameter name: e
   at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
   at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message)

My expectation is in the case of two nodes attempting to own the same data, that one would eventually see a journal write error as the journal sequence number would not be unique, and that the ActorSystem would then shut itself down. On recovery it should always be able to get back into a consistent state.

In our case, it was caused by a user error, but this could easily occur in the case of a network partition where two nodes claim to own the same underlying dataset.

@Aaronontheweb
Copy link
Member

Going to look into this while I'm at it with #3455

@Aaronontheweb
Copy link
Member

The issue is that the PersistentShardCoordinator doesn’t save or recover its state in the correct order - the ShardHomeAllocated messages that are tripping the exception during recovery should only be persisted after a ShardRegionRegistered message (shards belong to shard regions.)

Three possible causes of this:

  1. There’s a fair bit of async code inside the PersistentShardCoordinator I’m still untangling - it’s possible a race condition could cause this if the actor were trying to Persist its events out of order. I’m still looking into that possibility. Technically, the actor should never even be asked to host a shard until its region gets created first. I doubt this is the issue, but I can’t rule it out 100%.
  2. I’m wondering if the data we write to Akka.Persistence when we save the sharding snapshot is accurate. I took a look through the serialization code and I’m a little suspicious that it’s persisting the region state correctly. That would also cause this issue: the data was saved, but not in the correct format.
  3. Last issue is the Akka.Persistence implementation itself: if the journal isn’t saving or replaying events for the PersistentShardCoordinator in the correct order, that would certainly cause this.

Going to eliminate number 2 first since that's the simplest - will look into the others next.

@Aaronontheweb
Copy link
Member

Manually verified the output of this spec:

[Fact]
public void ClusterShardingMessageSerializer_must_be_able_to_serializable_ShardCoordinator_snapshot_State()
{
var shards = ImmutableDictionary
.CreateBuilder<string, IActorRef>()
.AddAndReturn("a", region1)
.AddAndReturn("b", region2)
.AddAndReturn("c", region2)
.ToImmutableDictionary();
var regions = ImmutableDictionary
.CreateBuilder<IActorRef, IImmutableList<string>>()
.AddAndReturn(region1, ImmutableArray.Create("a"))
.AddAndReturn(region2, ImmutableArray.Create("b", "c"))
.AddAndReturn(region3, ImmutableArray<string>.Empty)
.ToImmutableDictionary();
var state = new PersistentShardCoordinator.State(
shards: shards,
regions: regions,
regionProxies: ImmutableHashSet.Create(regionProxy1, regionProxy2),
unallocatedShards: ImmutableHashSet.Create("d"));
CheckSerialization(state);
}

Can vouch for its accuracy - the sharding serializer appears to be working correctly.

@Aaronontheweb
Copy link
Member

This is probably the issue #3204

Going to create some reproduction specs and then see where things go.

@Aaronontheweb
Copy link
Member

Working on a fun reproduction of this using an actual integration test against SQL Server spun up via docker-compose https://github.com/Aaronontheweb/AkkaClusterSharding3414Repro

@Aaronontheweb Aaronontheweb modified the milestones: 1.3.12, 1.4.0 Mar 18, 2019
@izavala
Copy link
Contributor

izavala commented Mar 18, 2019

I was able to reproduce this issue on my end with with my copy of the above project: https://github.com/Aaronontheweb/AkkaClusterSharding3414Repro.

And received the same failing to recover error message:

[ERROR][03/18/2019 23:34:59][Thread 0003][[akka://ShardFight/system/sharding/fubersCoordinator/singleton/coordinator#329445421]] Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [113] for persistenceId [/system/sharding/fubersCoordinator/singleton/coordinator] sharding.shard_1 | Cause: System.ArgumentException: Shard 23 is already allocated sharding.shard_1 | Parameter name: e sharding.shard_1 | at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e) sharding.shard_1 | at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message) sharding.shard_1 | at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)

I've attached the data from the EvenJournal database in hopes to find more information on what is causing this behavior.
Journal.zip

@Aaronontheweb
Copy link
Member

@izavala I'll deserialize the data gathered from the repo here and see what's up - that should paint a clearer picture as to what's going on.

@Aaronontheweb
Copy link
Member

Wrote a custom tool using Akka.Persistence.Query using the dataset that created this error: https://github.com/Aaronontheweb/Cluster.Sharding.Viewer

Attached is the output. Haven't analyzed it yet, but this is the same data from @izavala's reproduction.

Shard-replay-crash-data.log

@Aaronontheweb
Copy link
Member

Worth noting in these logs: no Snapshots were ever saved for the PersistentShardCoordinator under this run - it logged 30-34 "ShardHomeAllocated" messages per run using our reproduction app. The default Akka.Cluster.Sharding settings have us only take a snapshot once every 1000 journaled entries.

@Aaronontheweb
Copy link
Member

So the logs we've produced confirm that #3204 is the issue - the exception in recovery only occurs when it's the same node with the same address trying to deserialize its own RemoteActorRefs each time. The issue doesn't occur when the node reboots using a new hostname when we tear down our Docker cluster and recreate it.

@Aaronontheweb Aaronontheweb removed this from the 1.4.0 milestone Mar 22, 2019
@Aaronontheweb Aaronontheweb added this to the 1.3.13 milestone Mar 22, 2019
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this issue Mar 22, 2019
@Aaronontheweb Aaronontheweb modified the milestones: 1.3.13, 1.4.0 Mar 26, 2019
@Aaronontheweb
Copy link
Member

Moving this to 1.4.0 - changes are too big to put into a point release. We're going to need to make a lot of changes to the serialization system for IActorRefs to complete this.

@heatonmatthew
Copy link
Contributor

Hey cool, I've run into this one too. Still in a prototype phase but it was on my mind for issues to address in moving to a more production preparation phase.

@Aaronontheweb Since you're making serialization system changes, just a heads up that with your netstandard2.0 update in #3668 the difference between Framework and Core disappear. See my commit referencing the issue for the code that removes the difference.

@Aaronontheweb
Copy link
Member

I've been able to verify via Aaronontheweb/AkkaClusterSharding3414Repro#10 that #3744 resolves this issue. I'm note done with #3744 yet - still need to make sure this works with serialize-messages and so on, but we're getting there.

Aaronontheweb added a commit that referenced this issue Jul 18, 2019
* fixed typo in RemoteActorRefProvider comment

* Working on #3414 - bringing SerializeWithTransport API up to par with JVM

* added spec to help validate CurrentTransportInformation issues

Based on the equivalent JVM spec

* working on bringing serialization up to snuff

* brought serialization class up to snuff

* wrapping up RmeoteActorRefProvider implementation

* WIP

* cleaning up Serialization class

* looks like there's a Lazy<SerializationInfo> translation from Scala to C# that we haven't quite done

* fixed Serialization class

* fixed bug with Akka.Remote.Serialization.SerializationTransportInformationSpec

* forced a couple of specs using default akka.remote configs to run sequentially

This was done in order to avoid the two specs trying to bind on the same port at the same time.

* added serialization verification to the Akka.Persistence.TCK

* fixed issues with default Akka.Perisstence.TCK specs

* fixed IActorRef serialziation support in Akka.Persistence journals and snapshot stores

* fixed compilation issuyes

* fixed Akka.Sql.Common serialization in a backwards-compatible fashion

* had to disable serialization specs for Sql Journals

* Added API approvals

* updated creator and serialize-all-messages serialization

* added ITestOutputHelper to Akka.Cluster.Sharding.Tests.SupervisionSpec

* made changes to LocalSnapshotSerializer

* fixed bug in WithTransport method

* updated Akka.Remote MessageSerializer
@Aaronontheweb
Copy link
Member

This is now resolved as of #3744

Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this issue Jul 21, 2019
…et#3744)

* fixed typo in RemoteActorRefProvider comment

* Working on akkadotnet#3414 - bringing SerializeWithTransport API up to par with JVM

* added spec to help validate CurrentTransportInformation issues

Based on the equivalent JVM spec

* working on bringing serialization up to snuff

* brought serialization class up to snuff

* wrapping up RmeoteActorRefProvider implementation

* WIP

* cleaning up Serialization class

* looks like there's a Lazy<SerializationInfo> translation from Scala to C# that we haven't quite done

* fixed Serialization class

* fixed bug with Akka.Remote.Serialization.SerializationTransportInformationSpec

* forced a couple of specs using default akka.remote configs to run sequentially

This was done in order to avoid the two specs trying to bind on the same port at the same time.

* added serialization verification to the Akka.Persistence.TCK

* fixed issues with default Akka.Perisstence.TCK specs

* fixed IActorRef serialziation support in Akka.Persistence journals and snapshot stores

* fixed compilation issuyes

* fixed Akka.Sql.Common serialization in a backwards-compatible fashion

* had to disable serialization specs for Sql Journals

* Added API approvals

* updated creator and serialize-all-messages serialization

* added ITestOutputHelper to Akka.Cluster.Sharding.Tests.SupervisionSpec

* made changes to LocalSnapshotSerializer

* fixed bug in WithTransport method

* updated Akka.Remote MessageSerializer
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this issue Jul 21, 2019
…et#3744)

* fixed typo in RemoteActorRefProvider comment

* Working on akkadotnet#3414 - bringing SerializeWithTransport API up to par with JVM

* added spec to help validate CurrentTransportInformation issues

Based on the equivalent JVM spec

* working on bringing serialization up to snuff

* brought serialization class up to snuff

* wrapping up RmeoteActorRefProvider implementation

* WIP

* cleaning up Serialization class

* looks like there's a Lazy<SerializationInfo> translation from Scala to C# that we haven't quite done

* fixed Serialization class

* fixed bug with Akka.Remote.Serialization.SerializationTransportInformationSpec

* forced a couple of specs using default akka.remote configs to run sequentially

This was done in order to avoid the two specs trying to bind on the same port at the same time.

* added serialization verification to the Akka.Persistence.TCK

* fixed issues with default Akka.Perisstence.TCK specs

* fixed IActorRef serialziation support in Akka.Persistence journals and snapshot stores

* fixed compilation issuyes

* fixed Akka.Sql.Common serialization in a backwards-compatible fashion

* had to disable serialization specs for Sql Journals

* Added API approvals

* updated creator and serialize-all-messages serialization

* added ITestOutputHelper to Akka.Cluster.Sharding.Tests.SupervisionSpec

* made changes to LocalSnapshotSerializer

* fixed bug in WithTransport method

* updated Akka.Remote MessageSerializer
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this issue Jul 26, 2019
…et#3744)

* fixed typo in RemoteActorRefProvider comment

* Working on akkadotnet#3414 - bringing SerializeWithTransport API up to par with JVM

* added spec to help validate CurrentTransportInformation issues

Based on the equivalent JVM spec

* working on bringing serialization up to snuff

* brought serialization class up to snuff

* wrapping up RmeoteActorRefProvider implementation

* WIP

* cleaning up Serialization class

* looks like there's a Lazy<SerializationInfo> translation from Scala to C# that we haven't quite done

* fixed Serialization class

* fixed bug with Akka.Remote.Serialization.SerializationTransportInformationSpec

* forced a couple of specs using default akka.remote configs to run sequentially

This was done in order to avoid the two specs trying to bind on the same port at the same time.

* added serialization verification to the Akka.Persistence.TCK

* fixed issues with default Akka.Perisstence.TCK specs

* fixed IActorRef serialziation support in Akka.Persistence journals and snapshot stores

* fixed compilation issuyes

* fixed Akka.Sql.Common serialization in a backwards-compatible fashion

* had to disable serialization specs for Sql Journals

* Added API approvals

* updated creator and serialize-all-messages serialization

* added ITestOutputHelper to Akka.Cluster.Sharding.Tests.SupervisionSpec

* made changes to LocalSnapshotSerializer

* fixed bug in WithTransport method

* updated Akka.Remote MessageSerializer
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this issue Jul 30, 2019
…et#3744)

* fixed typo in RemoteActorRefProvider comment

* Working on akkadotnet#3414 - bringing SerializeWithTransport API up to par with JVM

* added spec to help validate CurrentTransportInformation issues

Based on the equivalent JVM spec

* working on bringing serialization up to snuff

* brought serialization class up to snuff

* wrapping up RmeoteActorRefProvider implementation

* WIP

* cleaning up Serialization class

* looks like there's a Lazy<SerializationInfo> translation from Scala to C# that we haven't quite done

* fixed Serialization class

* fixed bug with Akka.Remote.Serialization.SerializationTransportInformationSpec

* forced a couple of specs using default akka.remote configs to run sequentially

This was done in order to avoid the two specs trying to bind on the same port at the same time.

* added serialization verification to the Akka.Persistence.TCK

* fixed issues with default Akka.Perisstence.TCK specs

* fixed IActorRef serialziation support in Akka.Persistence journals and snapshot stores

* fixed compilation issuyes

* fixed Akka.Sql.Common serialization in a backwards-compatible fashion

* had to disable serialization specs for Sql Journals

* Added API approvals

* updated creator and serialize-all-messages serialization

* added ITestOutputHelper to Akka.Cluster.Sharding.Tests.SupervisionSpec

* made changes to LocalSnapshotSerializer

* fixed bug in WithTransport method

* updated Akka.Remote MessageSerializer
@Caldas
Copy link

Caldas commented Jun 12, 2021

Hey guys, since this issue has been fixed I recommend updating README at https://github.com/petabridge/akkadotnet-cluster-workshop, since at end of it still point to this issue as an active one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants