Avoid shard id update of replica if not matching with primary shard id #573

hpatro · 2024-05-29T21:06:56Z

Shard_id shouldn't be updated for a replica if the shard_id for the primary is different.

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in place, there is a possibility of corrupted config file and leads to failure of engine restart.

Scenario:

Let's say there are two nodes in a cluster i.e. Node A and Node B. All the admin operation is performed on Node B. Node A and Node B finish handshake and haven't shared the extensions information yet. Node B is made a replica of Node A. As part of Node B sharing the slaveof information, it also share(s) the temporary shard-id. During the regular packet processing in Node A, while handling the replication information, the shard id of Node A get(s) applied to Node B. And during the extensions processing in Node A, the shard id passed by Node B is applied which diverges from the shard id of Node A. A crash/restart followed by it leads to unrecoverable corrupted cluster configuration file state.

PingXie · 2024-05-31T04:13:51Z

I am not sure I understand the event sequence that leads to a corrupt state. can you elaborate?

The change makes sense to me. Essentially with this change there is now an order in which the shard-id is updated in a shard: primary first and replicas next.

btw, this change also requires us to sequence the assignment of the primary before the invocation of updateShardId. This seems to be the case already at https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L3092 and https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L5194.

There are some timeout failures in the test pass though. that is a bit surprising.

hpatro · 2024-06-03T19:17:25Z

The scenario is slightly difficult to explain, I've tried my best to depict it (updated the main comment). @PingXie / @madolson have a look.

enjoy-binbin

with the top comment picture, i think now i understand the case. the changes LGTM, btw the test seem to keep failing.

codecov · 2024-06-04T18:09:15Z

Codecov Report

Attention: Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 70.31%. Comparing base (752b6ee) to head (770cfa9).
Report is 6 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #573      +/-   ##
============================================
+ Coverage     70.20%   70.31%   +0.10%     
============================================
  Files           111      111              
  Lines         60242    60243       +1     
============================================
+ Hits          42295    42360      +65     
+ Misses        17947    17883      -64

Files	Coverage Δ
src/cluster_legacy.c	`85.92% <80.00%> (+<0.01%)`	⬆️

... and 17 files with indirect coverage changes

hpatro · 2024-06-04T19:27:07Z

unit/cluster/manual-takeover seems to get stuck on the CI. Unable to reproduce locally so far. Trying to understand why it gets stuck sometime with this change.

hpatro · 2024-06-10T18:51:53Z

There are some timeout failures in the test pass though. that is a bit surprising.

From further investigation, the timeout failure happens from an infinite while loop within this block.

clusterNode *clusterNodeGetPrimary(clusterNode *node) {
    while (node->replicaof != NULL) node = node->replicaof;
    return node;
}

https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L5855C1-L5858C2

Looks like there could be temporary invalid state in cluster where node(s) can be pointing to each other as primary/replica. We could take two approaches to this infinite loop:

Deep dive into why the invalid state is reached (cyclic replication state).
We could avoid this loop as chained replication isn't a valid configuration in cluster mode.

madolson · 2024-06-11T16:54:54Z

Deep dive into why the invalid state is reached (cyclic replication state).

We have had multiple of these issues in the past, and I think we always tried to figure it out. Maybe we should use this chance to add a helper method for setting the replicaof so that we check for loops.

src/cluster_legacy.c

hpatro · 2024-06-12T20:16:02Z

Deep dive into why the invalid state is reached (cyclic replication state).

We have had multiple of these issues in the past, and I think we always tried to figure it out. Maybe we should use this chance to add a helper method for setting the replicaof so that we check for loops.

And if we detect a loop, do we crash?

madolson · 2024-06-12T20:59:06Z

And if we detect a loop, do we crash?

Maybe we debug assert crash (as in only crash during a test). For normal production, we unwind we maybe ignore it and wait for the other node to update us.

PingXie

LGTM overall but would be great if you could provide some more context in the code comment (left a review feedback too)

src/cluster_legacy.c

PingXie · 2024-06-13T06:45:09Z

The scenario is slightly difficult to explain, I've tried my best to depict it (updated the main comment). @PingXie / @madolson have a look.

Great diagram! Thanks @hpatro. This helps a lot.

PingXie · 2024-06-13T18:20:37Z

And if we detect a loop, do we crash?

Maybe we debug assert crash (as in only crash during a test). For normal production, we unwind we maybe ignore it and wait for the other node to update us.

debugAssert is reasonable but I don't think we should crash the server just because there is a loop. In fact, we have logic to break the loop already. I will suggest a fix in #609

tests/unit/cluster/shardid-propagation.tcl

madolson · 2024-07-01T05:22:52Z

@hpatro Sorry for taking so long to circle back on this, the DCO was failing last time and I forgot to ping you to update. I think this is good to merge otherwise.

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>

hpatro · 2024-07-01T18:52:04Z

@madolson Had to force push. PTAL.

madolson · 2024-07-02T22:20:50Z

https://github.com/valkey-io/valkey/actions/runs/9768739963

madolson

LGTM, just want to wait for some more comprehensive tests.

PingXie · 2024-07-03T06:05:56Z

We actually hit the replication cycle assert rather consistently in the test run @madolson shared above. This is something that I haven't seen before.

*** Crash report found in valkey_2/log.txt ***
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
44713:M 02 Jul 2024 22:52:22.118 # === ASSERTION FAILED ===
44713:M 02 Jul 2024 22:52:22.118 # ==> cluster_legacy.c:5879 'primary->replicaof == ((void *)0)' is not true

hpatro · 2024-07-03T15:43:00Z

We actually hit the replication cycle assert rather consistently in the test run @madolson shared above. This is something that I haven't seen before.
*** Crash report found in valkey_2/log.txt ***

=== VALKEY BUG REPORT START: Cut & paste starting from here ===

44713:M 02 Jul 2024 22:52:22.118 # === ASSERTION FAILED ===

44713:M 02 Jul 2024 22:52:22.118 # ==> cluster_legacy.c:5879 'primary->replicaof == ((void *)0)' is not true

Yeah, this change invokes the API more frequently. Someone needs to deep dive further to understand how we reach this state.

madolson · 2024-07-06T18:03:30Z

Yeah, this change invokes the API more frequently. Someone needs to deep dive further to understand how we reach this state.

I deep dived it with an AWS engineer last week, I have a partial fix and will post it early next week.

PingXie · 2024-07-07T01:46:40Z

I took a look too and realized it’s a regression introduced by my slot migration PR #445. This change started allowing a replica to report its primary’s slot states and trigger clusterUdpateSlotsConfigWith.

PR #445 - Slot Migration Changes.

Here's what I think happens in these test failures involving a 3-node shard:

[T1] - Node A, B, and C are in the same shard with A as the primary.
[T2] - Node A loses its primaryship to B via a graceful/manual failover.
[T3] - After winning the election, B broadcasts the news to every node in the cluster, including C.
[T4] - C receives B's latest PING message and correctly registers B as its new primary.
[T5] - C then sends a new PING message to A, claiming B is its primary with all the slots.
[T6] - A still hasn't received B's broadcast message from [T3], and C's PING message from [T4] arrives at A.
And this is where things go wrong—a replicaof cycle is created.

At this point, A still thinks it’s the primary of the shard, and B -> replicaof == A. Since C is still a replica (as before), the role change handling logic doesn’t apply. So, A enters clusterUdpateSlotsConfigWith using C’s slot information (which is up to date with B’s). More importantly, B is passed in as the sender while at the same time A assumes B -> replicaof == A. The slot ownership update logic correctly gives the ownership of the slots to B. Now because A loses all its slots to B, who is in the same shard with a higher config epoch, this demotes A to a replica of the winner, B. And now with this PR, we set A -> replicaof = B, completing the replicaof cycle.

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>

hpatro · 2024-07-17T16:58:46Z

This still fails after merging #754 due to primary-replica cycle. Still needs deep dive.

PingXie · 2024-07-17T20:58:19Z

Interesting. @madolson can you share your findings when you get a chance? I assume it is different from #754?

bentotten · 2024-08-29T22:25:31Z

I think we should consider if this PR is still needed if/when we reduce the delay (see: #778) - This was a great PR and moved mountains in terms of figuring out what was wrong, but it would be great to reduce the delay entirely

Instead of mitigating the effects of shard ID not being stabilized, we can instead connect the needed flags to the node immediately during the handshake, thus avoiding this situation entirely. This approach will also have the benefit of increasing the speed of stabilization, as there will be less "hops" needed to reach a shard ID consensus.

PingXie · 2024-08-30T01:19:21Z

Interesting. @madolson can you share your findings when you get a chance? I assume it is different from #754?

I have a theory about how this could happen.

We had a stale PONG message issue, which was fixed in commit 28976a9

valkey/src/cluster_legacy.c

Line 3271 in 2b76c8f

if (sender->configEpoch > sender_claimed_config_epoch) {
However we didn't bail after detecting this stale message. We proceed to

valkey/src/cluster_legacy.c

Line 3311 in 2b76c8f

if (sender_claimed_primary && sender->replicaof != sender_claimed_primary) {
And then update sender's replicaof based on the stale message at:

valkey/src/cluster_legacy.c

Line 3317 in 2b76c8f

sender->replicaof = sender_claimed_primary;

Now, imagine the following scenario

[T0] Three nodes: primary A with replica B, and an observer node N
[T1] B's PONG message to N claiming A is its primary gets stuck somewhere on the way to N
[T2] B becomes primary after a manual failover and then notifies A (and N but that message will get stuck behind the PONG message at T1)
[T3] A becomes a replica of B
[T4] A, now a replica of B, sends PING to N, which goes through the following steps that end up "promote" B to a primary, indirectly

valkey/src/cluster_legacy.c

Line 3257 in 2b76c8f

if (sender) {
valkey/src/cluster_legacy.c

Line 3267 in 2b76c8f

if (sender_last_reported_as_primary) {
valkey/src/cluster_legacy.c

Line 3269 in 2b76c8f

if (sender_claimed_primary && areInSameShard(sender_claimed_primary, sender)) {
valkey/src/cluster_legacy.c

Line 3281 in 2b76c8f

clusterSetNodeAsPrimary(sender_claimed_primary);

and sets A's replicaof to B
valkey/src/cluster_legacy.c

Line 3311 in 2b76c8f

if (sender_claimed_primary && sender->replicaof != sender_claimed_primary) {
valkey/src/cluster_legacy.c

Line 3317 in 2b76c8f

sender->replicaof = sender_claimed_primary;

[T5] Finally, B's PONG message to N from [T1] arrives on N and it goes through the following steps
valkey/src/cluster_legacy.c

Line 3257 in 2b76c8f

if (sender) {
valkey/src/cluster_legacy.c

Line 3264 in 2b76c8f

/* Node is a replica. */

Due to step 4, B got promoted to primary, implicitly
valkey/src/cluster_legacy.c

Line 3267 in 2b76c8f

if (sender_last_reported_as_primary) {

However the epoch is stale, which is correctly handled
valkey/src/cluster_legacy.c

Line 3271 in 2b76c8f

if (sender->configEpoch > sender_claimed_config_epoch) {
valkey/src/cluster_legacy.c

Line 3273 in 2b76c8f

"Ignore stale message from %.40s (%s) in shard %.40s;"

We don't bail but instead continue to
valkey/src/cluster_legacy.c

Line 3311 in 2b76c8f

if (sender_claimed_primary && sender->replicaof != sender_claimed_primary) {

and finally updates B->replicaof to A, completing the loop
valkey/src/cluster_legacy.c

Line 3317 in 2b76c8f

sender->replicaof = sender_claimed_primary;

I have seen stale messages in the past and I also notice that the latest failure in the codecov run, which could alter the timing quite a bit so I think this theory is very plausible.

The fix would be to bail immediately after detecting the stale message

valkey/src/cluster_legacy.c

Line 3273 in 2b76c8f

"Ignore stale message from %.40s (%s) in shard %.40s;"

BTW, we have another undetected stale message issue (#798)

PingXie · 2024-08-30T01:22:21Z

I think we should consider if this PR is still needed if/when we reduce the delay (see: #778) - This was a great PR and moved mountains in terms of figuring out what was wrong, but it would be great to reduce the delay entirely

Instead of mitigating the effects of shard ID not being stabilized, we can instead connect the needed flags to the node immediately during the handshake, thus avoiding this situation entirely. This approach will also have the benefit of increasing the speed of stabilization, as there will be less "hops" needed to reach a shard ID consensus.

Yeah I think we will need both. Let me pick up my slack next ... :(

tests/unit/cluster/shardid-propagation.tcl

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>

PingXie · 2024-09-11T04:04:21Z

The tests still fail for replicaof loops. I think we need a fix for #1015 first.

hpatro requested review from PingXie and enjoy-binbin May 29, 2024 21:07

hpatro force-pushed the shard_id_divergence branch from 7cabc57 to 1714613 Compare May 29, 2024 21:10

hpatro requested a review from madolson June 3, 2024 19:16

enjoy-binbin reviewed Jun 4, 2024

View reviewed changes

madolson reviewed Jun 12, 2024

View reviewed changes

src/cluster_legacy.c Show resolved Hide resolved

hpatro mentioned this pull request Jun 12, 2024

[BUG] Flaky cluster tests 11-manual-takeover.tcl in 7.2 #609

Closed

PingXie reviewed Jun 13, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

madolson reviewed Jun 14, 2024

View reviewed changes

tests/unit/cluster/shardid-propagation.tcl Outdated Show resolved Hide resolved

madolson reviewed Jun 14, 2024

View reviewed changes

tests/unit/cluster/shardid-propagation.tcl Outdated Show resolved Hide resolved

Avoid shard id update of replica if not matching with primary shard id

770cfa9

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>

hpatro force-pushed the shard_id_divergence branch from 69a7d96 to 770cfa9 Compare July 1, 2024 18:39

madolson added the release-notes This issue should get a line item in the release notes label Jul 2, 2024

madolson approved these changes Jul 2, 2024

View reviewed changes

This was referenced Jul 7, 2024

Regression from PR #445 Incorrectly Allows Slot Ownership Updates via Replica #753

Closed

Ensure only primary sender drives slot ownership updates #754

Merged

bentotten mentioned this pull request Jul 12, 2024

[BUG] nodes.conf can be corrupted when node is restarted before cluster shard ID stabilizes (for 7.2) #774

Open

Merge branch 'unstable' into shard_id_divergence

3914a23

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>

PingXie mentioned this pull request Sep 1, 2024

Fix data loss when the old primary takes over the slots after online #974

Open

madolson reviewed Sep 10, 2024

View reviewed changes

tests/unit/cluster/shardid-propagation.tcl Outdated Show resolved Hide resolved

Update tests/unit/cluster/shardid-propagation.tcl

da13ec2

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>

PingXie mentioned this pull request Sep 11, 2024

Stale PONG message causes incorrect replicaof updates leading to replicaof loops #1015

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid shard id update of replica if not matching with primary shard id #573

Avoid shard id update of replica if not matching with primary shard id #573

hpatro commented May 29, 2024 •

edited

Loading

PingXie commented May 31, 2024

hpatro commented Jun 3, 2024

enjoy-binbin left a comment

codecov bot commented Jun 4, 2024 •

edited

Loading

hpatro commented Jun 4, 2024

hpatro commented Jun 10, 2024

madolson commented Jun 11, 2024

hpatro commented Jun 12, 2024

madolson commented Jun 12, 2024

PingXie left a comment

PingXie commented Jun 13, 2024

PingXie commented Jun 13, 2024

madolson commented Jul 1, 2024

hpatro commented Jul 1, 2024

madolson commented Jul 2, 2024

madolson left a comment

PingXie commented Jul 3, 2024

hpatro commented Jul 3, 2024

madolson commented Jul 6, 2024

PingXie commented Jul 7, 2024 •

edited

Loading

hpatro commented Jul 17, 2024

PingXie commented Jul 17, 2024

bentotten commented Aug 29, 2024 •

edited

Loading

PingXie commented Aug 30, 2024 •

edited

Loading

PingXie commented Aug 30, 2024

PingXie commented Sep 11, 2024

Avoid shard id update of replica if not matching with primary shard id #573

Are you sure you want to change the base?

Avoid shard id update of replica if not matching with primary shard id #573

Conversation

hpatro commented May 29, 2024 • edited Loading

PingXie commented May 31, 2024

hpatro commented Jun 3, 2024

enjoy-binbin left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 4, 2024 • edited Loading

Codecov Report

hpatro commented Jun 4, 2024

hpatro commented Jun 10, 2024

madolson commented Jun 11, 2024

hpatro commented Jun 12, 2024

madolson commented Jun 12, 2024

PingXie left a comment

Choose a reason for hiding this comment

PingXie commented Jun 13, 2024

PingXie commented Jun 13, 2024

madolson commented Jul 1, 2024

hpatro commented Jul 1, 2024

madolson commented Jul 2, 2024

madolson left a comment

Choose a reason for hiding this comment

PingXie commented Jul 3, 2024

hpatro commented Jul 3, 2024

madolson commented Jul 6, 2024

PingXie commented Jul 7, 2024 • edited Loading

hpatro commented Jul 17, 2024

PingXie commented Jul 17, 2024

bentotten commented Aug 29, 2024 • edited Loading

PingXie commented Aug 30, 2024 • edited Loading

PingXie commented Aug 30, 2024

PingXie commented Sep 11, 2024

hpatro commented May 29, 2024 •

edited

Loading

codecov bot commented Jun 4, 2024 •

edited

Loading

PingXie commented Jul 7, 2024 •

edited

Loading

bentotten commented Aug 29, 2024 •

edited

Loading

PingXie commented Aug 30, 2024 •

edited

Loading