Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka.Cluster: unable to mark member as Leaving if another instance of member with same Address is being marked as DOWN #7370

Closed
Aaronontheweb opened this issue Oct 30, 2024 · 0 comments · Fixed by #7371

Comments

@Aaronontheweb
Copy link
Member

Version Information
Version of Akka.NET? v1.5.30
Which Akka.NET Modules? Akka.Cluster, Akka.Cluster.Sharding

Describe the bug

It's possible for multiple Members in the cluster to have the same Address - usually after a node is rebooted and the old incarnation of the node hasn't been evicted from the cluster yet. This is why Akka.Cluster uses a separate UniqueAddress construct - to help us identify when these types of situations occur and to distinguish between two instances of nodes with the same Address.

We just recently fixed one error impacted by this non-uniqueness constraint in Akka.Cluster.Sharding with #7367 and it also looks like there are issues with Akka.Cluster's ClusterDaemon code itself where it's susceptible to these types of problems - for instance:

public void Leaving(Address address)
{
// only try to update if the node is available (in the member ring)
if (LatestGossip.Members.Any(m => m.Address.Equals(address) && m.Status is MemberStatus.Joining or MemberStatus.WeaklyUp or MemberStatus.Up))
{
// mark node as LEAVING
var newMembers = LatestGossip.Members.Select(m =>
{
if (m.Address == address) return m.Copy(status: MemberStatus.Leaving);
return m;
}).ToImmutableSortedSet(); // mark node as LEAVING
var newGossip = LatestGossip.Copy(members: newMembers);
UpdateLatestGossip(newGossip);
_cluster.LogInfo("Marked address [{0}] as [{1}]", address, MemberStatus.Leaving);
PublishMembershipState();
// immediate gossip to speed up the leaving process
SendGossip();
}
}

This is a really subtle issue, but basically: we should be iterating through EACH of these members, THEN apply the condition, and THEN remove them. The way the loop is designed right now is guaranteed to produce a ["System.InvalidOperationException: Invalid member status transition Down -> Leaving error if there are multiple instances of the node in the gossip at this time.

Expected behavior

Should be able to change status of Members without error even if there are multiple instances of the same Address in-use inside Akka.Cluster.

Actual behavior

The daemons crash and Akka.Cluster destabilizes.

Environment

Environments with stable addresses (i.e. StatefulSets in Kubernetes or bare metal) are susceptible to this problem - dynamically addressed environments are not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant