Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reconfiguring sub-replica causing data loss when myself change shard_id #944

Merged
merged 11 commits into from
Aug 29, 2024

Conversation

enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Aug 26, 2024

When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

As part of the recent fix of #885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.

This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.

This one follow #885 and closes #942.

…ard_id

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

This one follow valkey-io#885 and closes valkey-io#942.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Aug 26, 2024
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin
Copy link
Member Author

enjoy-binbin commented Aug 26, 2024

Is there a good way to write a test case to cover this? I don't know how to trigger this sub-replica case easier.

Full CI: https://github.com/valkey-io/valkey/actions/runs/10555652357

Copy link

codecov bot commented Aug 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.46%. Comparing base (4fe8320) to head (f5cd658).
Report is 9 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #944      +/-   ##
============================================
+ Coverage     70.44%   70.46%   +0.02%     
============================================
  Files           113      114       +1     
  Lines         61744    61730      -14     
============================================
+ Hits          43493    43499       +6     
+ Misses        18251    18231      -20     
Files with missing lines Coverage Δ
src/cluster_legacy.c 85.89% <100.00%> (+0.32%) ⬆️

... and 21 files with indirect coverage changes

@PingXie
Copy link
Member

PingXie commented Aug 27, 2024

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated,

I take it that we ended up with a replica (myself) disagreeing with its primary on the shard-id? Do you have a log snippet handy that demonstrates this? Let me take a closer look at the shard id update logic, holistically. There have been a few tricky bugs in this area. I wonder if they are affected by common issues.

@enjoy-binbin
Copy link
Member Author

enjoy-binbin commented Aug 27, 2024

The link in #942 has the full logs, but i happend has some

9e03820e6636099e9a10d26a1677ae00b3b452d3:

# The primary
33772:M 25 Aug 2024 01:14:36.928 * Migrating slot 0 to node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 ()
33772:M 25 Aug 2024 01:14:37.027 * Slot 0 is migrated from node 9e03820e6636099e9a10d26a1677ae00b3b452d3 () in shard 1338c68632657b706e6233d3ad9ae051427d1b6b to node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 () in shard 7b9ca1414bb71e553cce31e358efb7fef29e7e32.
33772:M 25 Aug 2024 01:14:37.029 * Slot 0 is no longer being migrated to node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 () in shard 7b9ca1414bb71e553cce31e358efb7fef29e7e32.
33772:M 25 Aug 2024 01:14:37.043 * Configuration change detected. Reconfiguring myself as a replica of node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 () in shard 7b9ca1414bb71e553cce31e358efb7fef29e7e32

# The offset is 0 as we expected
33772:S 25 Aug 2024 01:14:37.055 * Connecting to PRIMARY 127.0.0.1:21455
33772:S 25 Aug 2024 01:14:37.057 * PRIMARY <-> REPLICA sync started
33772:S 25 Aug 2024 01:14:37.083 * Non blocking connect for SYNC fired the event.
33772:S 25 Aug 2024 01:14:37.090 * Primary replied to PING, replication can continue...
33772:S 25 Aug 2024 01:14:37.098 * Partial resynchronization not possible (no cached primary)
33772:S 25 Aug 2024 01:14:37.103 * Full resync from primary: 42896f8f4927136a9514b738af65fbb571806be4:32118
33772:S 25 Aug 2024 01:14:37.109 - Client closed connection id=6 addr=127.0.0.1:34708 laddr=127.0.0.1:21452 fd=27 name= age=19 idle=0 flags=S db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=16384 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=1 omem=16448 tot-mem=34758 events=r cmd=replconf user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=974 tot-net-out=635598 tot-cmds=26
33772:S 25 Aug 2024 01:14:37.110 * Connection with replica 127.0.0.1:21448 lost.
33772:S 25 Aug 2024 01:14:37.152 * Node 183d130727181bd5f28b3cf0675cbf0153f6b283 () is now a replica of node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 () in shard 1338c68632657b706e6233d3ad9ae051427d1b6b
33772:S 25 Aug 2024 01:14:48.593 * Start of election delayed for 2885 milliseconds (rank #2, offset 0).

183d130727181bd5f28b3cf0675cbf0153f6b283:

# The replica
33547:S 25 Aug 2024 01:14:36.914 * Migrating slot 0 to node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 ()
33547:S 25 Aug 2024 01:14:37.029 * Assigning slot 0 to node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 () in shard 7b9ca1414bb71e553cce31e358efb7fef29e7e32
33547:S 25 Aug 2024 01:14:37.097 * Node 9e03820e6636099e9a10d26a1677ae00b3b452d3 () is no longer primary of shard 1338c68632657b706e6233d3ad9ae051427d1b6b; removed all 0 slot(s) it used to own
33547:S 25 Aug 2024 01:14:37.097 * Node 9e03820e6636099e9a10d26a1677ae00b3b452d3 () is now part of shard 7b9ca1414bb71e553cce31e358efb7fef29e7e32

# In here, the replica's primary send its PING, and the replica need to update the sender's shard_id, and since the sender is myself's primary,
# here we also update myself's shard_id. see `updateShardId(sender, sender_claimed_primary->shard_id)`
33547:S 25 Aug 2024 01:14:37.097 * Node 9e03820e6636099e9a10d26a1677ae00b3b452d3 () is now a replica of node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 () in shard 7b9ca1414bb71e553cce31e358efb7fef29e7e32

# And then, doing the reconfiguring. Since myself's shard_id is updated in the above, here `!areInSameShard(myself->replicaof->replicaof, myself)`
# check failed, and we think we can do a psync and remain the old offset.
33547:S 25 Aug 2024 01:14:37.099 * I'm a sub-replica! Reconfiguring myself as a replica of 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 from 9e03820e6636099e9a10d26a1677ae00b3b452d3

# Doing the psync with the old offset
33547:M 25 Aug 2024 01:14:37.103 * Connection with primary lost.
33547:M 25 Aug 2024 01:14:37.104 * Caching the disconnected primary state.
33547:S 25 Aug 2024 01:14:37.110 * Connecting to PRIMARY 127.0.0.1:21455
33547:S 25 Aug 2024 01:14:37.110 * PRIMARY <-> REPLICA sync started
33547:S 25 Aug 2024 01:14:37.121 * Non blocking connect for SYNC fired the event.
33547:S 25 Aug 2024 01:14:37.123 * Primary replied to PING, replication can continue...
33547:S 25 Aug 2024 01:14:37.151 * Trying a partial resynchronization (request a23ef12063d15051cfb636553d48c95933a8429f:317789).
33547:S 25 Aug 2024 01:14:37.244 - Client closed connection id=8 addr=127.0.0.1:52912 laddr=127.0.0.1:21448 fd=28 name= age=1 idle=1 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=16384 argv-mem=0 multi-mem=0 rbs=1024 rbp=987 obl=0 oll=0 omem=0 tot-mem=18310 events=r cmd=cluster|nodes user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=56 tot-net-out=1974 tot-cmds=2
33547:S 25 Aug 2024 01:14:47.522 # Failed to read response from the server: Success
33547:S 25 Aug 2024 01:14:47.522 # Primary did not reply to PSYNC, will try later
33547:S 25 Aug 2024 01:14:47.604 - Connection with Node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 at 127.0.0.1:31455 failed: Connection refused
33547:S 25 Aug 2024 01:14:48.008 * Connecting to PRIMARY 127.0.0.1:21455
33547:S 25 Aug 2024 01:14:48.008 * PRIMARY <-> REPLICA sync started
33547:S 25 Aug 2024 01:14:48.009 # Error condition on socket for SYNC: Connection refused
33547:S 25 Aug 2024 01:14:48.010 - Connection with Node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 at 127.0.0.1:31455 failed: Connection refused
33547:S 25 Aug 2024 01:14:48.415 * NODE 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 () possibly failing.
33547:S 25 Aug 2024 01:14:48.418 - Connection with Node 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 at 127.0.0.1:31455 failed: Connection refused
33547:S 25 Aug 2024 01:14:48.512 * FAIL message received from 9e03820e6636099e9a10d26a1677ae00b3b452d3 () about 3081a0e079dd41c05fc5f8f7078f04c1816a8fe7 ()
33547:S 25 Aug 2024 01:14:48.513 # Cluster state changed: fail

# Using the old offset. It is wrong
33547:S 25 Aug 2024 01:14:48.521 * Start of election delayed for 729 milliseconds (rank #0, offset 317788).

@PingXie
Copy link
Member

PingXie commented Aug 27, 2024

Capturing the minimum repro setup and event sequence based on offline discussion (thanks for the details, @enjoy-binbin!).

Minimum repro setup

  • two shards, shard 1 with primary 1 and replica 2, and shard 2 with primary 2
  • shard 1 has one slot, slot 0, while shard 2 has the remaining slots
  • cluster-allow-replica-migration is enabled on primary 1 (which defaults to true)

Event sequence

  1. now migrate slot 0 from shard 1 to shard 2
  2. primary 1 loses slot 0, which is its last slot, to primary 2
  3. because cluster-allow-replica-migration (this name seems a misnomer btw) is enabled on primary 1, primary 1 joins shard 2, following primary 2.
  4. primary 1 sends PING to replica 1 claiming its replicaof is primary 2
  5. replica 1 processes the PING message in clusterProcessPacket

Code flow

  1. replica 1 first notices that primary 1 claims a different replicaof, changing from NULL to primary 2 at
    if (sender_claimed_primary && sender->replicaof != sender_claimed_primary) {
  2. replica 1 then updates its view of primary 1's replicaof at
    sender->replicaof = sender_claimed_primary;
  3. and updates primary 1's shard_id at
    updateShardId(sender, sender_claimed_primary->shard_id);
  4. it then proceeds to the chain replication reduction logic at
    if (myself->replicaof && myself->replicaof->replicaof && myself->replicaof->replicaof != myself) {
  5. there it decides to replicate directly from primary 1's replicaof, which is primary 2, at
    clusterSetPrimary(myself->replicaof->replicaof, 1, !areInSameShard(myself->replicaof->replicaof, myself));

Bug

As part of the recent fix of #885, replica 1 needs to decide whether a full sync is required or not when switching shards (from shard 1 to shard 2). This shard membership check is supposed to be done against replica 1's current shard_id, which however was lost in step 3 of "code flow". This then leads to replica 1 joining shard 2 with a completely different and incorrect replication history.

Does this match your understanding of the issue, @enjoy-binbin?

@enjoy-binbin
Copy link
Member Author

yes. It match, thank you for the full details, it's very clear

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. I assume the test failed as we expect without the server fix?

src/cluster_legacy.c Outdated Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
@enjoy-binbin
Copy link
Member Author

LGTM overall. I assume the test failed as we expect without the server fix?

Yes, the test i added will fail if the code without the server fix

enjoy-binbin and others added 3 commits August 29, 2024 14:42
Co-authored-by: Ping Xie <pingxie@outlook.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin merged commit ecbfb6a into valkey-io:unstable Aug 29, 2024
56 checks passed
@enjoy-binbin enjoy-binbin deleted the fix_data_loss branch August 29, 2024 14:39
madolson pushed a commit that referenced this pull request Sep 2, 2024
…ard_id (#944)

When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

As part of the recent fix of #885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.

This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.

This one follow #885 and closes #942.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
madolson pushed a commit that referenced this pull request Sep 3, 2024
…ard_id (#944)

When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

As part of the recent fix of #885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.

This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.

This one follow #885 and closes #942.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
PingXie added a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
…ard_id (valkey-io#944)

When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.

This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.

This one follow valkey-io#885 and closes valkey-io#942.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
PingXie added a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
…ard_id (valkey-io#944)

When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.

This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.

This one follow valkey-io#885 and closes valkey-io#942.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
PingXie added a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
…ard_id (valkey-io#944)

When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.

This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.

This one follow valkey-io#885 and closes valkey-io#942.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
PingXie added a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
…ard_id (valkey-io#944)

When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.

This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.

This one follow valkey-io#885 and closes valkey-io#942.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
Signed-off-by: Ping Xie <pingxie@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Test failure] Migrated replica reports zero repl offset issue
3 participants