Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka.Cluster.Sharding: make PersistentShardCoordinator a tolerant reader #5604

Closed
Aaronontheweb opened this issue Feb 8, 2022 · 19 comments
Closed

Comments

@Aaronontheweb
Copy link
Member

Is your feature request related to a problem? Please describe.
https://stackoverflow.com/questions/70651266/exception-of-shard-0-already-allocated/70668206#70668206 - this happens often with sharded ActorSystems not shutting down properly and it's a massive footgun.

I feel like the current persistence design is inherently brittle and we've just accepted that as the status quo for years, but I think it could be better.

The problem occurs here

If the previous state hadn't been properly disposed of when the previous PersistentShardCoordinator singleton was terminated, the new incarnation of this actor fails upon recovery - this is done in order to guarantee the consistency of shard allocation data, since that's prioritized by the sharding system. If duplicate records are encountered during recovery, for any given ShardRegion or ShardHome, the recovery process fails and the underlying journal basically has to be repaired.

Describe the solution you'd like

I think we can modify the recovery process to make this more robust - if we're recovering old records that are going to be overwritten by newer ones that are in the journal, we can make the evaluation at the very end of the recovery process as to whether the recovered state is healthy or not, rather than blowing up the entire recovery process and making the cluster essentially non-operable.

Essentially what I'd like to do is model how to make the PersistentShardCoordinator recovery process idempotent and less brittle.

Some examples:

  • Duplicate ShardRegionTerminated messages don't matter - it's already dead.
  • Duplicate ShardRegionRegistered messages don't matter - overwrite the old one with the new.
  • Duplicate ShardRegionProxyRegistered messages don't matter - it's a no-op.

Where things get tricky are the ShardHomeAllocation messages - that needs to be either evaluated at the end of the recovery and checked for data consistency or kept as-is. I'd prefer the former.

Describe alternatives you've considered

Using state-store-mode=ddata is a viable workaround for this, but the problem is that DurableData does not play nicely at all with K8s due to the complexity of volume mounts and doesn't scale super well for remember-entities=on where entity counts are quite large.

https://github.com/petabridge/Akka.Cluster.Sharding.RepairTool - but that's a heavy duty solution to this problem that I'd rather make obsolete through updates to Akka.NET itself.

Additional context
Add any other context or screenshots about the feature request here.

@Aaronontheweb
Copy link
Member Author

We should merge this in first #4629

@zbynek001
Copy link
Contributor

zbynek001 commented Feb 10, 2022

I don't think there is any issue with the sharding itself when using Persistent mode.
Usually issues like this are happening when there is an issue with incorrect serialization/de-serialization of IActorRef objects. That's the only thing that can messed up the sharding state

@to11mtm
Copy link
Member

to11mtm commented Feb 13, 2022

@zbynek001 I think I know what folks are seeing here, as I've observed it in the past (and successfully alleviated it.)

In every case I've encountered issues, I've been able to alleviate them by increasing the timeouts for cluster shutdown stages as well as the system itself, and increasing the timeouts of sharding recovery.

It basically all becomes a fun math problem:

  • sharding handoff-timeout, needs to be long enough to handle a slow/busy shutdown scenario
  • Split brain stable-after, Should be at least long enough to do a happy path shutdown/handover. Otherwise you might have two coordinators running at the same time.
  • cluster auto-down-unreachable-after, IIRC should be pretty long, I don't remember if it needs to be as long as stable-after. (maybe?)
  • shutdown cluster-sharding-shutdown-region
    • this is the big footgun, default is 10s which is not enough if you're moving a lot over.
    • -1 to your roll if you are running multiple types of shard, and they are sharing the same storage (e.x. all are using the same journal table in SQL persistence).
  • shutdown cluster-exiting

There may be others, probably worth a peek at where other sharding actions are wired into the shutdown process.

@benbenwilde
Copy link

I am encountering this issue as well. And unfortunately there are not really any available extension points to change this behavior. The errors make sense for the update path but not for the recovery path. The goal should be to get your shard coordinator back up and running if at all possible, and just do the best job possible of getting back to previous state. Don't default to making the whole cluster crash and burn...

@Aaronontheweb
Copy link
Member Author

Aaronontheweb commented Mar 4, 2022

Hi @benbenwilde

I absolutely agree.

The changes in #4629 , which I just finished reviewing this morning, should definitely help as it decouples the Coordinator state from the remember-entities state - best of both worlds (ddata for state-store-mode, persistence for remember-entities). The blow-ups occur during missing IActorRef deserialization on state-store-mode.

We should still try to fix Akka.Persistence Coordinator recovery tolerance on persistence, but it's a much easier sell in a post-#4629 world since the persistence concerns are more cleanly separated there.

We're still debating whether to include that in Akka.NET v1.5 or in 1.4.* - and that will be discussed at our Community Standup on Wednesday, March 9th: #5691 - watch links and agenda are there.

@Aaronontheweb
Copy link
Member Author

Merged in #4629 yesterday and it is part of Akka.NET v1.5 going forward.

@Aaronontheweb Aaronontheweb reopened this May 26, 2022
@Aaronontheweb Aaronontheweb added this to the 1.4.39 milestone May 26, 2022
@Aaronontheweb
Copy link
Member Author

Seeing some users running into this issue again, i.e. #5495 - and I'd like to find a better solution to resolving this. I think there's an easy way to do this in order to accomplish what @benbenwilde mentioned:

The goal should be to get your shard coordinator back up and running if at all possible, and just do the best job possible of getting back to previous state. Don't default to making the whole cluster crash and burn...

Going to take a stab at doing that with a relatively simple change against the v1.4 branch and see how that goes...

@zbynek001
Copy link
Contributor

I think this is not solving the issue itself, just a workaround.
There is basically no other possibility for this to happen unless there is an issue with IActorRef serialization/deserialization.
e.g. something like this: #3204 (comment)
Easiest would be to look at the sharding events and it should be visible there.

@Aaronontheweb
Copy link
Member Author

That's what I thought until I saw the customer send me these logs:

Looking at the journal, I see 2 ShardHomeAllocatedManifest events for shard 16, which appears to cause the issue.

| 830 | AF | <br/>16Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |

and

| 834 | AF | <br/>16Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |

Pardon the formatting, but that's the exact same ShardHomeAllocated with a fully serialized remote ActorPath being persisted twice at the same location. That shouldn't be possible to persist if the ShardCoordinator's state is fully synced, so in terms of what could cause this problem on the write side - a split brain perhaps? Customer didn't mention that there was one - but this scenario could be solved by making the recovery idempotent in this instance.

@Aaronontheweb
Copy link
Member Author

Conversation with @ismaelhamed on how safe my proposed changes are: #5970 (comment)

@Aaronontheweb
Copy link
Member Author

@zbynek001 any ideas on how #5604 (comment) could occur?

@zbynek001
Copy link
Contributor

actually the issue was slightly different. It was not visible in ShardHomeAllocated events, but in ShardRegionTerminated, which can be created during ShardCoordinator recovery. And the region address in this message was serialized incorrectly.

Would be nice to see both types of events ShardHomeAllocated and ShardRegionTerminated when this issue is happening

@Aaronontheweb
Copy link
Member Author

@zbynek001 I've asked the customer for some more data from their event logs before we make a decision on whether or not to publish this change, but AFAIK with modern versions of Akka.Persistence most of those plugins have matters like IActorRef serialization implemented correctly.

@zbynek001
Copy link
Contributor

hmm, if it's not the serialization, than probably incorrect split brain handling, can't think of anything else

@Aaronontheweb
Copy link
Member Author

Data dump from the user with the issue:

| sequence\_nr | manifest | convert\_from |
| :--- | :--- | :--- |
| 898 | AF | <br/>�37�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 897 | AG | <br/>�37 |
| 896 | AF | <br/>�24�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 895 | AG | <br/>�24 |
| 894 | AF | <br/>�16�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 893 | AG | <br/>�16 |
| 892 | AF | <br/>�23�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 891 | AG | <br/>�23 |
| 890 | AF | <br/>�15�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 889 | AG | <br/>�15 |
| 888 | AF | <br/>�22�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 887 | AG | <br/>�22 |
| 886 | AF | <br/>�14�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 885 | AG | <br/>�14 |
| 884 | AF | <br/>�21�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 883 | AG | <br/>�21 |
| 882 | AF | <br/>�13�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 881 | AG | <br/>�13 |
| 880 | AF | <br/>�20�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 879 | AG | <br/>�20 |
| 878 | AF | <br/>�12�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 877 | AG | <br/>�12 |
| 876 | AF | <br/>�2�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 875 | AG | <br/>�2 |
| 874 | AF | <br/>�11�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 873 | AG | <br/>�11 |
| 872 | AF | <br/>�19�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 871 | AG | <br/>�19 |
| 870 | AF | <br/>�10�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 869 | AG | <br/>�10 |
| 868 | AF | <br/>�18�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 867 | AG | <br/>�18 |
| 866 | AF | <br/>�1�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 865 | AG | <br/>�1 |
| 864 | AF | <br/>�17�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 863 | AG | <br/>�17 |
| 862 | AF | <br/>�0�Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 861 | AG | <br/>�0 |
| 860 | AB | <br/>Uakka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#1970213401 |
| 859 | AF | <br/>�74�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 858 | AF | <br/>�47�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 857 | AF | <br/>�81�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 856 | AF | <br/>�63�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 855 | AF | <br/>�88�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 854 | AF | <br/>�53�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 853 | AF | <br/>�93�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 852 | AF | <br/>�66�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 851 | AF | <br/>�87�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 850 | AF | <br/>�69�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 849 | AF | <br/>�50�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 848 | AF | <br/>�7�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 847 | AF | <br/>�83�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 846 | AF | <br/>�85�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 845 | AF | <br/>�94�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 844 | AF | <br/>�72�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 843 | AF | <br/>�90�Uakka.tcp://hopper@hopperservice-1.hopperservice:5055/system/sharding/plans#1017041738 |
| 842 | AF | <br/>�79�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 841 | AF | <br/>�54�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 840 | AF | <br/>�5�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 839 | AF | <br/>�70�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 838 | AF | <br/>�96�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 837 | AF | <br/>�91�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 836 | AF | <br/>�71�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 835 | AF | <br/>�77�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 834 | AF | <br/>�16�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 833 | AF | <br/>�64�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 832 | AF | <br/>�61�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 831 | AF | <br/>�57�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 830 | AF | <br/>�16�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 829 | AF | <br/>�76�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 828 | AF | <br/>�6�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 827 | AF | <br/>�84�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 826 | AF | <br/>�98�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 825 | AF | <br/>�65�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 824 | AF | <br/>�58�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 823 | AF | <br/>�80�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 822 | AF | <br/>�68�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 821 | AF | <br/>�9�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 820 | AF | <br/>�73�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 819 | AF | <br/>�48�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 818 | AF | <br/>�52�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 817 | AF | <br/>�86�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 816 | AF | <br/>�78�Uakka.tcp://hopper@hopperservice-0.hopperservice:5055/system/sharding/plans#1757270690 |
| 815 | AD | <br/>Takka.tcp://hopper@hopperservice-2.hopperservice:5055/system/sharding/plans#165602300 |

No issues with split brains - issue can happen with with as few as 3 nodes during cluster startup with an empty event journal. That seems to indicate a write-side bug with the PersistentShardCoordinator.

@zbynek001
Copy link
Contributor

really strange, was this with version 1.5 or 1.4?
The logic which should prevent creating ShardHomeAllocated for the already known shard is the same in both so probably not related

@Aaronontheweb
Copy link
Member Author

This was with 1.4. I'm in the process of developing a model-based test with FsCheck to see if there's a mutability problem inside the State object itself. These are expensive to write but they're the best way to exhaustively root out these "impossible" bugs. If my model-based test doesn't turn anything up in the State class itself then it's likely how the actor is calling it that is the next-best culprit.

In case anyone is interested, here's a tutorial I wrote on how to do this in C#: https://aaronstannard.com/fscheck-property-testing-csharp-part2/

@Aaronontheweb
Copy link
Member Author

I'm checking to see if there are two different writerUUID values for this because otherwise, based on my model-based tests and a thorough review of the code involved I'm not sure how this error could occur outside of a split brain situation. Even a split brain scenario seems far-fetched to me since that would require the recovery to miss the ShardHomeAllocated event AND not have sequence number insertion problems (however, if the datastore being used is MongoDb or something that doesn't support check constraints, that could do it.)

Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this issue May 31, 2022
@Aaronontheweb
Copy link
Member Author

Looks like this happened with Postgres.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants