-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Akka.Cluster.Sharding: make PersistentShardCoordinator
a tolerant reader
#5604
Comments
We should merge this in first #4629 |
I don't think there is any issue with the sharding itself when using Persistent mode. |
@zbynek001 I think I know what folks are seeing here, as I've observed it in the past (and successfully alleviated it.) In every case I've encountered issues, I've been able to alleviate them by increasing the timeouts for cluster shutdown stages as well as the system itself, and increasing the timeouts of sharding recovery. It basically all becomes a fun math problem:
There may be others, probably worth a peek at where other sharding actions are wired into the shutdown process. |
I am encountering this issue as well. And unfortunately there are not really any available extension points to change this behavior. The errors make sense for the update path but not for the recovery path. The goal should be to get your shard coordinator back up and running if at all possible, and just do the best job possible of getting back to previous state. Don't default to making the whole cluster crash and burn... |
Hi @benbenwilde I absolutely agree. The changes in #4629 , which I just finished reviewing this morning, should definitely help as it decouples the We should still try to fix Akka.Persistence We're still debating whether to include that in Akka.NET v1.5 or in 1.4.* - and that will be discussed at our Community Standup on Wednesday, March 9th: #5691 - watch links and agenda are there. |
Merged in #4629 yesterday and it is part of Akka.NET v1.5 going forward. |
Seeing some users running into this issue again, i.e. #5495 - and I'd like to find a better solution to resolving this. I think there's an easy way to do this in order to accomplish what @benbenwilde mentioned:
Going to take a stab at doing that with a relatively simple change against the v1.4 branch and see how that goes... |
…located` messages close akkadotnet#5604
I think this is not solving the issue itself, just a workaround. |
That's what I thought until I saw the customer send me these logs:
Pardon the formatting, but that's the exact same |
Conversation with @ismaelhamed on how safe my proposed changes are: #5970 (comment) |
@zbynek001 any ideas on how #5604 (comment) could occur? |
actually the issue was slightly different. It was not visible in Would be nice to see both types of events |
@zbynek001 I've asked the customer for some more data from their event logs before we make a decision on whether or not to publish this change, but AFAIK with modern versions of Akka.Persistence most of those plugins have matters like |
hmm, if it's not the serialization, than probably incorrect split brain handling, can't think of anything else |
Data dump from the user with the issue:
No issues with split brains - issue can happen with with as few as 3 nodes during cluster startup with an empty event journal. That seems to indicate a write-side bug with the |
really strange, was this with version 1.5 or 1.4? |
This was with 1.4. I'm in the process of developing a model-based test with FsCheck to see if there's a mutability problem inside the In case anyone is interested, here's a tutorial I wrote on how to do this in C#: https://aaronstannard.com/fscheck-property-testing-csharp-part2/ |
I'm checking to see if there are two different |
Looks like this happened with Postgres. |
Is your feature request related to a problem? Please describe.
https://stackoverflow.com/questions/70651266/exception-of-shard-0-already-allocated/70668206#70668206 - this happens often with sharded
ActorSystem
s not shutting down properly and it's a massive footgun.I feel like the current persistence design is inherently brittle and we've just accepted that as the status quo for years, but I think it could be better.
The problem occurs here
akka.net/src/contrib/cluster/Akka.Cluster.Sharding/PersistentShardCoordinator.cs
Line 117 in 6187769
If the previous state hadn't been properly disposed of when the previous
PersistentShardCoordinator
singleton was terminated, the new incarnation of this actor fails upon recovery - this is done in order to guarantee the consistency of shard allocation data, since that's prioritized by the sharding system. If duplicate records are encountered during recovery, for any givenShardRegion
orShardHome
, the recovery process fails and the underlying journal basically has to be repaired.Describe the solution you'd like
I think we can modify the recovery process to make this more robust - if we're recovering old records that are going to be overwritten by newer ones that are in the journal, we can make the evaluation at the very end of the recovery process as to whether the recovered state is healthy or not, rather than blowing up the entire recovery process and making the cluster essentially non-operable.
Essentially what I'd like to do is model how to make the
PersistentShardCoordinator
recovery process idempotent and less brittle.Some examples:
ShardRegionTerminated
messages don't matter - it's already dead.ShardRegionRegistered
messages don't matter - overwrite the old one with the new.ShardRegionProxyRegistered
messages don't matter - it's a no-op.Where things get tricky are the
ShardHomeAllocation
messages - that needs to be either evaluated at the end of the recovery and checked for data consistency or kept as-is. I'd prefer the former.Describe alternatives you've considered
Using
state-store-mode=ddata
is a viable workaround for this, but the problem is that DurableData does not play nicely at all with K8s due to the complexity of volume mounts and doesn't scale super well forremember-entities=on
where entity counts are quite large.https://github.com/petabridge/Akka.Cluster.Sharding.RepairTool - but that's a heavy duty solution to this problem that I'd rather make obsolete through updates to Akka.NET itself.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: