Akka.Cluster.Sharding: make `PersistentShardCoordinator` a tolerant reader #5604

Aaronontheweb · 2022-02-08T13:17:22Z

Is your feature request related to a problem? Please describe.
https://stackoverflow.com/questions/70651266/exception-of-shard-0-already-allocated/70668206#70668206 - this happens often with sharded ActorSystems not shutting down properly and it's a massive footgun.

I feel like the current persistence design is inherently brittle and we've just accepted that as the status quo for years, but I think it could be better.

The problem occurs here

akka.net/src/contrib/cluster/Akka.Cluster.Sharding/PersistentShardCoordinator.cs

Line 117 in 6187769

public State Updated(IDomainEvent e)

If the previous state hadn't been properly disposed of when the previous PersistentShardCoordinator singleton was terminated, the new incarnation of this actor fails upon recovery - this is done in order to guarantee the consistency of shard allocation data, since that's prioritized by the sharding system. If duplicate records are encountered during recovery, for any given ShardRegion or ShardHome, the recovery process fails and the underlying journal basically has to be repaired.

Describe the solution you'd like

I think we can modify the recovery process to make this more robust - if we're recovering old records that are going to be overwritten by newer ones that are in the journal, we can make the evaluation at the very end of the recovery process as to whether the recovered state is healthy or not, rather than blowing up the entire recovery process and making the cluster essentially non-operable.

Essentially what I'd like to do is model how to make the PersistentShardCoordinator recovery process idempotent and less brittle.

Some examples:

Duplicate ShardRegionTerminated messages don't matter - it's already dead.
Duplicate ShardRegionRegistered messages don't matter - overwrite the old one with the new.
Duplicate ShardRegionProxyRegistered messages don't matter - it's a no-op.

Where things get tricky are the ShardHomeAllocation messages - that needs to be either evaluated at the end of the recovery and checked for data consistency or kept as-is. I'd prefer the former.

Describe alternatives you've considered

Using state-store-mode=ddata is a viable workaround for this, but the problem is that DurableData does not play nicely at all with K8s due to the complexity of volume mounts and doesn't scale super well for remember-entities=on where entity counts are quite large.

https://github.com/petabridge/Akka.Cluster.Sharding.RepairTool - but that's a heavy duty solution to this problem that I'd rather make obsolete through updates to Akka.NET itself.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

Aaronontheweb · 2022-02-08T13:31:12Z

We should merge this in first #4629

zbynek001 · 2022-02-10T12:13:40Z

I don't think there is any issue with the sharding itself when using Persistent mode.
Usually issues like this are happening when there is an issue with incorrect serialization/de-serialization of IActorRef objects. That's the only thing that can messed up the sharding state

to11mtm · 2022-02-13T17:30:28Z

@zbynek001 I think I know what folks are seeing here, as I've observed it in the past (and successfully alleviated it.)

In every case I've encountered issues, I've been able to alleviate them by increasing the timeouts for cluster shutdown stages as well as the system itself, and increasing the timeouts of sharding recovery.

It basically all becomes a fun math problem:

sharding handoff-timeout, needs to be long enough to handle a slow/busy shutdown scenario
Split brain stable-after, Should be at least long enough to do a happy path shutdown/handover. Otherwise you might have two coordinators running at the same time.
cluster auto-down-unreachable-after, IIRC should be pretty long, I don't remember if it needs to be as long as stable-after. (maybe?)
shutdown cluster-sharding-shutdown-region
- this is the big footgun, default is 10s which is not enough if you're moving a lot over.
- -1 to your roll if you are running multiple types of shard, and they are sharing the same storage (e.x. all are using the same journal table in SQL persistence).
shutdown cluster-exiting

There may be others, probably worth a peek at where other sharding actions are wired into the shutdown process.

benbenwilde · 2022-03-04T21:56:49Z

I am encountering this issue as well. And unfortunately there are not really any available extension points to change this behavior. The errors make sense for the update path but not for the recovery path. The goal should be to get your shard coordinator back up and running if at all possible, and just do the best job possible of getting back to previous state. Don't default to making the whole cluster crash and burn...

Aaronontheweb · 2022-03-04T22:27:55Z

Hi @benbenwilde

I absolutely agree.

The changes in #4629 , which I just finished reviewing this morning, should definitely help as it decouples the Coordinator state from the remember-entities state - best of both worlds (ddata for state-store-mode, persistence for remember-entities). The blow-ups occur during missing IActorRef deserialization on state-store-mode.

We should still try to fix Akka.Persistence Coordinator recovery tolerance on persistence, but it's a much easier sell in a post-#4629 world since the persistence concerns are more cleanly separated there.

We're still debating whether to include that in Akka.NET v1.5 or in 1.4.* - and that will be discussed at our Community Standup on Wednesday, March 9th: #5691 - watch links and agenda are there.

Aaronontheweb · 2022-04-21T14:26:41Z

Merged in #4629 yesterday and it is part of Akka.NET v1.5 going forward.

Aaronontheweb · 2022-05-26T19:24:02Z

Seeing some users running into this issue again, i.e. #5495 - and I'd like to find a better solution to resolving this. I think there's an easy way to do this in order to accomplish what @benbenwilde mentioned:

The goal should be to get your shard coordinator back up and running if at all possible, and just do the best job possible of getting back to previous state. Don't default to making the whole cluster crash and burn...

Going to take a stab at doing that with a relatively simple change against the v1.4 branch and see how that goes...

…located` messages close akkadotnet#5604

…located` messages (#5967) close #5604

zbynek001 · 2022-05-27T08:26:51Z

I think this is not solving the issue itself, just a workaround.
There is basically no other possibility for this to happen unless there is an issue with IActorRef serialization/deserialization.
e.g. something like this: #3204 (comment)
Easiest would be to look at the sharding events and it should be visible there.

Aaronontheweb · 2022-05-27T13:57:23Z

That's what I thought until I saw the customer send me these logs:

Looking at the journal, I see 2 ShardHomeAllocatedManifest events for shard 16, which appears to cause the issue.

| 830 | AF | <br/>16Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |

and

| 834 | AF | <br/>16Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |

Pardon the formatting, but that's the exact same ShardHomeAllocated with a fully serialized remote ActorPath being persisted twice at the same location. That shouldn't be possible to persist if the ShardCoordinator's state is fully synced, so in terms of what could cause this problem on the write side - a split brain perhaps? Customer didn't mention that there was one - but this scenario could be solved by making the recovery idempotent in this instance.

Aaronontheweb · 2022-05-27T16:24:24Z

Conversation with @ismaelhamed on how safe my proposed changes are: #5970 (comment)

Aaronontheweb · 2022-05-27T16:32:39Z

@zbynek001 any ideas on how #5604 (comment) could occur?

zbynek001 · 2022-05-27T19:03:40Z

actually the issue was slightly different. It was not visible in ShardHomeAllocated events, but in ShardRegionTerminated, which can be created during ShardCoordinator recovery. And the region address in this message was serialized incorrectly.

Would be nice to see both types of events ShardHomeAllocated and ShardRegionTerminated when this issue is happening

Aaronontheweb · 2022-05-27T19:06:47Z

@zbynek001 I've asked the customer for some more data from their event logs before we make a decision on whether or not to publish this change, but AFAIK with modern versions of Akka.Persistence most of those plugins have matters like IActorRef serialization implemented correctly.

zbynek001 · 2022-05-27T19:47:32Z

hmm, if it's not the serialization, than probably incorrect split brain handling, can't think of anything else

Aaronontheweb · 2022-05-27T21:23:11Z

Data dump from the user with the issue:

| sequence\_nr | manifest | convert\_from |
| :--- | :--- | :--- |
| 898 | AF | <br/>�37�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 897 | AG | <br/>�37 |
| 896 | AF | <br/>�24�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 895 | AG | <br/>�24 |
| 894 | AF | <br/>�16�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 893 | AG | <br/>�16 |
| 892 | AF | <br/>�23�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 891 | AG | <br/>�23 |
| 890 | AF | <br/>�15�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 889 | AG | <br/>�15 |
| 888 | AF | <br/>�22�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 887 | AG | <br/>�22 |
| 886 | AF | <br/>�14�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 885 | AG | <br/>�14 |
| 884 | AF | <br/>�21�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 883 | AG | <br/>�21 |
| 882 | AF | <br/>�13�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 881 | AG | <br/>�13 |
| 880 | AF | <br/>�20�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 879 | AG | <br/>�20 |
| 878 | AF | <br/>�12�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 877 | AG | <br/>�12 |
| 876 | AF | <br/>�2�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 875 | AG | <br/>�2 |
| 874 | AF | <br/>�11�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 873 | AG | <br/>�11 |
| 872 | AF | <br/>�19�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 871 | AG | <br/>�19 |
| 870 | AF | <br/>�10�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 869 | AG | <br/>�10 |
| 868 | AF | <br/>�18�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 867 | AG | <br/>�18 |
| 866 | AF | <br/>�1�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 865 | AG | <br/>�1 |
| 864 | AF | <br/>�17�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 863 | AG | <br/>�17 |
| 862 | AF | <br/>�0�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 861 | AG | <br/>�0 |
| 860 | AB | <br/>Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 859 | AF | <br/>�74�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 858 | AF | <br/>�47�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 857 | AF | <br/>�81�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 856 | AF | <br/>�63�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 855 | AF | <br/>�88�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 854 | AF | <br/>�53�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 853 | AF | <br/>�93�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 852 | AF | <br/>�66�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 851 | AF | <br/>�87�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 850 | AF | <br/>�69�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 849 | AF | <br/>�50�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 848 | AF | <br/>�7�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 847 | AF | <br/>�83�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 846 | AF | <br/>�85�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 845 | AF | <br/>�94�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 844 | AF | <br/>�72�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 843 | AF | <br/>�90�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 842 | AF | <br/>�79�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 841 | AF | <br/>�54�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 840 | AF | <br/>�5�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 839 | AF | <br/>�70�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 838 | AF | <br/>�96�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 837 | AF | <br/>�91�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 836 | AF | <br/>�71�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 835 | AF | <br/>�77�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 834 | AF | <br/>�16�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 833 | AF | <br/>�64�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 832 | AF | <br/>�61�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 831 | AF | <br/>�57�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 830 | AF | <br/>�16�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 829 | AF | <br/>�76�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 828 | AF | <br/>�6�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 827 | AF | <br/>�84�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 826 | AF | <br/>�98�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 825 | AF | <br/>�65�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 824 | AF | <br/>�58�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 823 | AF | <br/>�80�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 822 | AF | <br/>�68�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 821 | AF | <br/>�9�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 820 | AF | <br/>�73�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 819 | AF | <br/>�48�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 818 | AF | <br/>�52�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 817 | AF | <br/>�86�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 816 | AF | <br/>�78�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 815 | AD | <br/>Takka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#165602300 |

No issues with split brains - issue can happen with with as few as 3 nodes during cluster startup with an empty event journal. That seems to indicate a write-side bug with the PersistentShardCoordinator.

zbynek001 · 2022-05-30T14:36:29Z

really strange, was this with version 1.5 or 1.4?
The logic which should prevent creating ShardHomeAllocated for the already known shard is the same in both so probably not related

Aaronontheweb · 2022-05-30T14:46:12Z

This was with 1.4. I'm in the process of developing a model-based test with FsCheck to see if there's a mutability problem inside the State object itself. These are expensive to write but they're the best way to exhaustively root out these "impossible" bugs. If my model-based test doesn't turn anything up in the State class itself then it's likely how the actor is calling it that is the next-best culprit.

In case anyone is interested, here's a tutorial I wrote on how to do this in C#: https://aaronstannard.com/fscheck-property-testing-csharp-part2/

Aaronontheweb · 2022-05-31T19:29:56Z

I'm checking to see if there are two different writerUUID values for this because otherwise, based on my model-based tests and a thorough review of the code involved I'm not sure how this error could occur outside of a split brain situation. Even a split brain scenario seems far-fetched to me since that would require the recovery to miss the ShardHomeAllocated event AND not have sequence number insertion problems (however, if the datastore being used is MongoDb or something that doesn't support check constraints, that could do it.)

close akkadotnet#5604

Aaronontheweb · 2022-05-31T20:04:10Z

Looks like this happened with Postgres.

Aaronontheweb added discussion akka-persistence akka-cluster-sharding labels Feb 8, 2022

Aaronontheweb closed this as completed Apr 21, 2022

Aaronontheweb reopened this May 26, 2022

Aaronontheweb added this to the 1.4.39 milestone May 26, 2022

Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this issue May 26, 2022

Allow PersistentShardCoordinator to tolerate duplicate `ShardHomeAl…

53b3fa7

…located` messages close akkadotnet#5604

Aaronontheweb mentioned this issue May 26, 2022

Allow PersistentShardCoordinator to tolerate duplicate ShardHomeAllocated messages #5967

Merged

4 tasks

Arkatufus pushed a commit that referenced this issue May 26, 2022

Allow PersistentShardCoordinator to tolerate duplicate `ShardHomeAl…

4fa7abb

…located` messages (#5967) close #5604

Aaronontheweb mentioned this issue May 31, 2022

Model-based tests for PersistentShardCoordinator.State #5974

Merged

3 tasks

Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this issue May 31, 2022

de-duplicate only identical ShardHomeAllocated messages

0d6325a

close akkadotnet#5604

Aaronontheweb mentioned this issue May 31, 2022

Fix sharding tolerant reader #5976

Merged

4 tasks

Aaronontheweb mentioned this issue Jun 1, 2022

added v1.4.39 release notes #5977

Merged

Aaronontheweb closed this as completed Jun 1, 2022

Aaronontheweb closed this as completed Jun 9, 2022

wesselkranenborg mentioned this issue Feb 15, 2023

ClusterShardRegion unable to start 'Shard 66 not allocated' #6407

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Akka.Cluster.Sharding: make `PersistentShardCoordinator` a tolerant reader #5604

Akka.Cluster.Sharding: make `PersistentShardCoordinator` a tolerant reader #5604

Aaronontheweb commented Feb 8, 2022

Aaronontheweb commented Feb 8, 2022

zbynek001 commented Feb 10, 2022 •

edited

Loading

to11mtm commented Feb 13, 2022 •

edited

Loading

benbenwilde commented Mar 4, 2022

Aaronontheweb commented Mar 4, 2022 •

edited

Loading

Aaronontheweb commented Apr 21, 2022

Aaronontheweb commented May 26, 2022

zbynek001 commented May 27, 2022

Aaronontheweb commented May 27, 2022

Aaronontheweb commented May 27, 2022

Aaronontheweb commented May 27, 2022

zbynek001 commented May 27, 2022

Aaronontheweb commented May 27, 2022

zbynek001 commented May 27, 2022

Aaronontheweb commented May 27, 2022

zbynek001 commented May 30, 2022

Aaronontheweb commented May 30, 2022

Aaronontheweb commented May 31, 2022

Aaronontheweb commented May 31, 2022

Akka.Cluster.Sharding: make PersistentShardCoordinator a tolerant reader #5604

Akka.Cluster.Sharding: make PersistentShardCoordinator a tolerant reader #5604

Comments

Aaronontheweb commented Feb 8, 2022

Aaronontheweb commented Feb 8, 2022

zbynek001 commented Feb 10, 2022 • edited Loading

to11mtm commented Feb 13, 2022 • edited Loading

benbenwilde commented Mar 4, 2022

Aaronontheweb commented Mar 4, 2022 • edited Loading

Aaronontheweb commented Apr 21, 2022

Aaronontheweb commented May 26, 2022

zbynek001 commented May 27, 2022

Aaronontheweb commented May 27, 2022

Aaronontheweb commented May 27, 2022

Aaronontheweb commented May 27, 2022

zbynek001 commented May 27, 2022

Aaronontheweb commented May 27, 2022

zbynek001 commented May 27, 2022

Aaronontheweb commented May 27, 2022

zbynek001 commented May 30, 2022

Aaronontheweb commented May 30, 2022

Aaronontheweb commented May 31, 2022

Aaronontheweb commented May 31, 2022

Akka.Cluster.Sharding: make `PersistentShardCoordinator` a tolerant reader #5604

Akka.Cluster.Sharding: make `PersistentShardCoordinator` a tolerant reader #5604

zbynek001 commented Feb 10, 2022 •

edited

Loading

to11mtm commented Feb 13, 2022 •

edited

Loading

Aaronontheweb commented Mar 4, 2022 •

edited

Loading