Support for Diskless Replication #997

vazois · 2025-02-04T21:28:15Z

This PR adds support for diskless replication (it is more of diskless full synchronization, but I am using Redis terminology to be consistent).

When replicas attach to a primary and require full synchronization, due to AOF truncation, they might incur an on-demand-checkpoint and will require for the latest checkpoint data to be streamed to them for recovery.
This method of full synchronization is extremely inefficient for the following reasons:

Write amplification at the primary when flushing the checkpoint
Read amplification at the primary since replicas read and stream the checkpoint files in parallel.
Write and read amplification at the replica that has to first write and then read the checkpoint data in order to recover.

With diskless replication we aim to eliminate these inefficiencies. Diskless replication relies on the streaming snapshot feature of tsavorite (#824) to stream a consistent snapshot of key-value pairs to the replica when full synchronization is necessary.
When a replica attempts to synchronize with an active primary, it performs the following steps:

It issues a CLUSTER ATTACH_SYNC request to the corresponding primary which processes the request to create a sync task.
The attaching sync task tries to create a ReplicaSyncSession object, sets its sync status to INITIALIZING and under lock adds that object to the ReplicaSyncSessionTaskStore.
Once the sync session is added, the sync task proceeds to wait for a few seconds (--repl-diskless-sync-delay) to allow for other replicas to attach in a similar way.
After the wait time is over, the sync task that attached first will initiate the StreamingSnapshotDriver (SSD) as a background task. Afterwards, all sync tasks will proceed to wait for SSD to complete.
SSD acquires an exclusive lock that prevents any other sync tasks to be added and also orchestrates the full synchronization of replicas by streaming a consistent snapshot of the key-value pairs to all replicas that needed it.
The SSD completes by notifying any waiting sync tasks that synchronization has completed and releases the exclusive lock to allow for more tasks to be added for the next diskless replication session.
The waiting sync tasks will in parallel notify the replica to start recovery of the AOF and subsequently spawn a background AofSyncTask to start streaming the AOF records generated at the primary.

By using the streaming checkpoint approach, we eliminate write and read amplification at the primary.
In addition, by allowing multiple replicas to synchronize in parallel, we reduce the overhead of scanning the TsavoriteStore multiple times.
Finally, we eliminate both read and write amplification at the replica because we don't require writing and reading the checkpoint to recover before starting to stream the AOF records.

NOTES:

The SSD will release early any sync task that does not require full synchronization.
Currently, at the completion of a streaming checkpoint the AOF get safely truncated. This is not necessary and might conflict with any persistence guarantees but was done to avoid AOF getting arbitrarily large. The assumption is that taking regular checkpoints at the primary is orthogonal to diskless replication.
Currently, the replica will not write any data to its local disk when receiving the streaming checkpoint. It is possible to eliminate this restriction, but I felt that this goes against the spirit of truly diskless replication.
For now, diskless replication will operate separately from the disk-based approach to allow for a preview period. It could be possible to merge both features together or completely eliminate disk-based replication if not longer necessary.

…nager, add more logging

badrishc · 2025-02-11T18:10:54Z

Currently, at the completion of a streaming checkpoint the AOF get safely truncated.

if --main-memory-replication (to be renamed to --fast-aof-truncate) is on, we will truncate the AOF aggressively based on sending AOF to replicas, else we will truncate AOF until the last on-disk checkpoint, correct? This is important for the node to recover after restart by loading the last checkpoint from disk and rolling forward the AOF as usual.

Our system reuses the same AOF for local roll-forward as what is used for replication, so this would be necessary for completeness.

vazois · 2025-02-11T19:00:41Z

Currently, at the completion of a streaming checkpoint the AOF get safely truncated.

if --main-memory-replication (to be renamed to --fast-aof-truncate) is on, we will truncate the AOF aggressively based on sending AOF to replicas, else we will truncate AOF until the last on-disk checkpoint, correct? This is important for the node to recover after restart by loading the last checkpoint from disk and rolling forward the AOF as usual.

Our system reuses the same AOF for local roll-forward as what is used for replication, so this would be necessary for completeness.

Yes, that makes sense. I will add a guard for this based on the MMR flag and update the description.
Renaming should happen maybe later to have time to communicate the change to people that use the flag.

badrishc · 2025-02-11T20:15:02Z

Renaming should happen maybe later

The concern is that we introduce 'diskless replication' in this PR, which semantically conflicts with 'main memory replication' which makes it confusing. Could we perhaps add FastAofTruncate and yet keep MMR, i.e., set FastAofTruncate = MMR if they specify MMR, so that it's backwards compatible? Perhaps mark MMR as obsolete.

vazois force-pushed the vazois/diskless-repl branch 6 times, most recently from b802961 to 9a4ad8f Compare February 7, 2025 16:53

vazois added 24 commits February 10, 2025 10:00

expose diskless replication parameters

e075e06

refactor/cleanup legacy ReplicaSyncSession

af44845

add interface to support diskless replication session and aof tasks

67c2992

core diskless replication implementation

8d91395

expose diskless replication API

04a2b61

adding test for diskless replication

a8edddd

update gcs extension to clearly mark logging progress

c99db59

fix gcs dispose on diskless attach, call dispose of replicationSyncMa…

09da433

…nager, add more logging

complete first diskless replication test

d4f24bd

fix iterator check for null when empty store

43825b2

fix iterator for object store cluster sync

0c25f03

add simple diskless sync test

f6cb514

cleanup code

b29d460

replica fall behind test

219186f

wip

86da64a

register cts at wait for sync completion

789fee2

add db version alignment test

3d2fb36

avoid using close lock for leader based syncing

48eb79e

truncate AOF after streaming checkpoint is taken

ff3fca4

add tests for failover with diskless replication

ae3ed62

fix formatting and conversion to IPEndpoint

a02267a

fix RepCommandsTests

da783e8

dispose aofSyncTask if failed to add to AofSyncTaskStore

c6d2e16

overload dispose ReplicaSyncSession

4400acf

vazois added 9 commits February 10, 2025 10:01

explicitly dispose gcs used for full sync at replicaSyncSession sync

57a5f19

dispose gcs once on return

bd00199

code cleanup

dd805ff

update tests to provide more context logging

75ce63b

add more comprehensive logging of syncMetadata

e60c8d0

add timeout for streaming checkpoint

692781d

add clusterTimeout for diskless repl tests

1395bbf

some more logging

5dcd651

cleanup and refactor code

10df706

vazois force-pushed the vazois/diskless-repl branch from 5769f37 to 10df706 Compare February 10, 2025 18:01

vazois marked this pull request as ready for review February 10, 2025 19:15

TalZaccai requested a review from badrishc February 11, 2025 19:22

truncate AOF only when main-memory-replication is switched on

c2772b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Diskless Replication #997

Support for Diskless Replication #997

vazois commented Feb 4, 2025 •

edited

Loading

badrishc commented Feb 11, 2025

vazois commented Feb 11, 2025

badrishc commented Feb 11, 2025

Support for Diskless Replication #997

Are you sure you want to change the base?

Support for Diskless Replication #997

Conversation

vazois commented Feb 4, 2025 • edited Loading

badrishc commented Feb 11, 2025

vazois commented Feb 11, 2025

badrishc commented Feb 11, 2025

vazois commented Feb 4, 2025 •

edited

Loading