Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster sharding deserialization issue #3664

Closed
ctrlaltdan opened this issue Nov 28, 2018 · 5 comments
Closed

Cluster sharding deserialization issue #3664

ctrlaltdan opened this issue Nov 28, 2018 · 5 comments

Comments

@ctrlaltdan
Copy link

ctrlaltdan commented Nov 28, 2018

After using cluster sharding for a while in our CI environment we occasionally see the following errors (which prevent the cluster from functioning)

Exception in ReceiveRecover when replaying event type ["Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated"] with sequence number [105] for persistenceId ["/system/sharding/customerCoordinator/singleton/coordinator"]

{
  "Depth": 0,
  "ClassName": "",
  "Message": "Region [akka.tcp://imburse@10.240.0.108:8081/system/sharding/customer#1665482693] not registered\nParameter name: e",
  "Source": "Akka.Cluster.Sharding",
  "StackTraceString": "   at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)\n   at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)\n   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)\n   at Akka.Persistence.Eventsourced.<>c__DisplayClass91_0.<Recovering>b__1(Receive receive, Object message)",
  "RemoteStackTraceString": "",
  "RemoteStackIndex": -1,
  "HResult": -2147024809,
  "HelpURL": null
}

We're unable to reproduce this error consistently however it appears to happen during a release. It's our assumption that one or many of our pods may become unavailable during the release before the sharding event journal is written to in a good state. This then creates the corrupted event journal which causes is problems.

Currently the only remedy from this situation is to drop all sharding records from the event journal and let the system start from scratch.

I have the data from the EventJournal table below.

Ordering PersistenceId SequenceNr Timestamp IsDeleted Manifest Payload Tags SerializerId
136 /system/sharding/customerCoordinator/singleton/coordinator 105 636783291061582824 0 AF 0x0A0233361248616B6B612E7463703A2F2F696D62757273654031302E3234302E302E3130383A383038312F73797374656D2F7368617264696E672F637573746F6D65722331363635343832363933 NULL 13
140 /system/sharding/customerCoordinator/singleton/coordinator 106 636784965012362487 0 AB 0x0A47616B6B612E7463703A2F2F696D62757273654031302E3234302E302E3132323A383038312F73797374656D2F7368617264696E672F637573746F6D657223323432333036343237 NULL 13
141 /system/sharding/customerCoordinator/singleton/coordinator 107 636784965012552987 0 AC 0x0A4C616B6B612E7463703A2F2F696D62757273654031302E3234302E302E35333A383038312F73797374656D2F7368617264696E672F637573746F6D657250726F78792332313137333131313037 NULL 13
142 /system/sharding/customerCoordinator/singleton/coordinator 108 636784965012782458 0 AC 0x0A4D616B6B612E7463703A2F2F696D62757273654031302E3234302E302E3132363A383038312F73797374656D2F7368617264696E672F637573746F6D657250726F78792331323135343337333535 NULL 13
143 /system/sharding/customerCoordinator/singleton/coordinator 109 636784965126035293 0 AD 0x0A46616B6B612E7463703A2F2F696D62757273654031302E3234302E302E33303A383038312F73797374656D2F7368617264696E672F637573746F6D657223393630383739383036 NULL 13
144 /system/sharding/customerCoordinator/singleton/coordinator 110 636784965229904859 0 AB 0x0A46616B6B612E7463703A2F2F696D62757273654031302E3234302E302E31343A383038312F73797374656D2F7368617264696E672F637573746F6D657223373231343333303930 NULL 13

System specs

  • Deployed to Kubernetes in docker containers (as pods)
  • Using netcoreapp2.0, specifically the microsoft/dotnet:2.0.9-runtime image to side-step dotnetty issues
  • Using the following package dependencies:
    • Akka 1.3.10
    • Akka.Bootstrap.Docker 0.1.3
    • Akka.Cluster.Sharding 1.3.10-beta
    • Akka.Cluster.Tools 1.3.10
    • Akka.Logger.Serilog 1.3.9
    • Akka.Persistence.SqlServer 1.3.7

Potentially related issues

#3414

#3204

@Aaronontheweb
Copy link
Member

Thanks @ctrlaltdan - we'll take a look at this. That's extremely annoying that you have to do a re-dump of all of that data. We'll fix that.

@Horusiath
Copy link
Contributor

@ctrlaltdan do you even need to use Akka.Persistence here? There's an alternative mode which utilized Akka.DistributedData for sharding. The only downside is that it doesn't let you use remember-entities option (yet).

You can set it up with akka.cluster.sharding.state-store-mode = ddata.

@ctrlaltdan
Copy link
Author

@Horusiath Yeah I'll give that a go when we schedule some time to upgrade our projects to the 1.3.11 release. We are using Akka.Persistence for saving our own state but we have no requirement to use remember-entities. Thanks for the tip.

Do you have any documentation weighing up the pros/cons of these two options. I'm pretty sold on avoiding SQL Server/external storage where possible. Would be good to understand any implications on the system if we use the ddata option.

@Aaronontheweb
Copy link
Member

This issue and #3414 are definitely the same bug.

@Aaronontheweb
Copy link
Member

This is now resolved as of #3744

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants