Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale PONG message causes incorrect replicaof updates leading to replicaof loops #1015

Open
PingXie opened this issue Sep 11, 2024 · 0 comments

Comments

@PingXie
Copy link
Member

PingXie commented Sep 11, 2024

I have a theory about how this could happen.

  1. We had a stale PONG message issue, which was fixed in commit 28976a9
    if (sender->configEpoch > sender_claimed_config_epoch) {
  2. However we didn't bail after detecting this stale message. We proceed to
    if (sender_claimed_primary && sender->replicaof != sender_claimed_primary) {
  3. And then update sender's replicaof based on the stale message at:
    sender->replicaof = sender_claimed_primary;

Now, imagine the following scenario

[T0] Three nodes: primary A with replica B, and an observer node N
[T1] B's PONG message to N claiming A is its primary gets stuck somewhere on the way to N
[T2] B becomes primary after a manual failover and then notifies A (and N but that message will get stuck behind the PONG message at T1)
[T3] A becomes a replica of B
[T4] A, now a replica of B, sends PING to N, which goes through the following steps that end up "promote" B to a primary, indirectly

  1. if (sender) {
  2. if (sender_last_reported_as_primary) {
  3. if (sender_claimed_primary && areInSameShard(sender_claimed_primary, sender)) {
  4. clusterSetNodeAsPrimary(sender_claimed_primary);

    and sets A's replicaof to B
  5. if (sender_claimed_primary && sender->replicaof != sender_claimed_primary) {
  6. sender->replicaof = sender_claimed_primary;

    [T5] Finally, B's PONG message to N from [T1] arrives on N and it goes through the following steps
  7. if (sender) {
  8. /* Node is a replica. */

    Due to step 4, B got promoted to primary, implicitly
  9. if (sender_last_reported_as_primary) {

    However the epoch is stale, which is correctly handled
  10. if (sender->configEpoch > sender_claimed_config_epoch) {
  11. "Ignore stale message from %.40s (%s) in shard %.40s;"

    We don't bail but instead continue to
  12. if (sender_claimed_primary && sender->replicaof != sender_claimed_primary) {

    and finally updates B->replicaof to A, completing the loop
  13. sender->replicaof = sender_claimed_primary;

I have seen stale messages in the past and I also notice that the latest failure in the codecov run, which could alter the timing quite a bit so I think this theory is very plausible.

The fix would be to bail immediately after detecting the stale message

"Ignore stale message from %.40s (%s) in shard %.40s;"

BTW, we have another undetected stale message issue (#798)

Originally posted by @PingXie in #573 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant