Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Diskless Replication #997

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

vazois
Copy link
Contributor

@vazois vazois commented Feb 4, 2025

This PR adds support for diskless replication (it is more of diskless full synchronization, but I am using Redis terminology to be consistent).

When replicas attach to a primary and require full synchronization, due to AOF truncation, they might incur an on-demand-checkpoint and will require for the latest checkpoint data to be streamed to them for recovery.
This method of full synchronization is extremely inefficient for the following reasons:

  1. Write amplification at the primary when flushing the checkpoint
  2. Read amplification at the primary since replicas read and stream the checkpoint files in parallel.
  3. Write and read amplification at the replica that has to first write and then read the checkpoint data in order to recover.

With diskless replication we aim to eliminate these inefficiencies. Diskless replication relies on the streaming snapshot feature of tsavorite (#824) to stream a consistent snapshot of key-value pairs to the replica when full synchronization is necessary.
When a replica attempts to synchronize with an active primary, it performs the following steps:

  1. It issues a CLUSTER ATTACH_SYNC request to the corresponding primary which processes the request to create a sync task.
  2. The attaching sync task tries to create a ReplicaSyncSession object, sets its sync status to INITIALIZING and under lock adds that object to the ReplicaSyncSessionTaskStore.
  3. Once the sync session is added, the sync task proceeds to wait for a few seconds (--repl-diskless-sync-delay) to allow for other replicas to attach in a similar way.
  4. After the wait time is over, the sync task that attached first will initiate the StreamingSnapshotDriver (SSD) as a background task. Afterwards, all sync tasks will proceed to wait for SSD to complete.
  5. SSD acquires an exclusive lock that prevents any other sync tasks to be added and also orchestrates the full synchronization of replicas by streaming a consistent snapshot of the key-value pairs to all replicas that needed it.
  6. The SSD completes by notifying any waiting sync tasks that synchronization has completed and releases the exclusive lock to allow for more tasks to be added for the next diskless replication session.
  7. The waiting sync tasks will in parallel notify the replica to start recovery of the AOF and subsequently spawn a background AofSyncTask to start streaming the AOF records generated at the primary.

By using the streaming checkpoint approach, we eliminate write and read amplification at the primary.
In addition, by allowing multiple replicas to synchronize in parallel, we reduce the overhead of scanning the TsavoriteStore multiple times.
Finally, we eliminate both read and write amplification at the replica because we don't require writing and reading the checkpoint to recover before starting to stream the AOF records.

NOTES:

  • The SSD will release early any sync task that does not require full synchronization.
  • Currently, at the completion of a streaming checkpoint the AOF get safely truncated. This is not necessary and might conflict with any persistence guarantees but was done to avoid AOF getting arbitrarily large. The assumption is that taking regular checkpoints at the primary is orthogonal to diskless replication.
  • Currently, the replica will not write any data to its local disk when receiving the streaming checkpoint. It is possible to eliminate this restriction, but I felt that this goes against the spirit of truly diskless replication.
  • For now, diskless replication will operate separately from the disk-based approach to allow for a preview period. It could be possible to merge both features together or completely eliminate disk-based replication if not longer necessary.

@vazois vazois force-pushed the vazois/diskless-repl branch 6 times, most recently from b802961 to 9a4ad8f Compare February 7, 2025 16:53
@vazois vazois force-pushed the vazois/diskless-repl branch from 5769f37 to 10df706 Compare February 10, 2025 18:01
@vazois vazois marked this pull request as ready for review February 10, 2025 19:15
@badrishc
Copy link
Contributor

Currently, at the completion of a streaming checkpoint the AOF get safely truncated.

if --main-memory-replication (to be renamed to --fast-aof-truncate) is on, we will truncate the AOF aggressively based on sending AOF to replicas, else we will truncate AOF until the last on-disk checkpoint, correct? This is important for the node to recover after restart by loading the last checkpoint from disk and rolling forward the AOF as usual.

Our system reuses the same AOF for local roll-forward as what is used for replication, so this would be necessary for completeness.

@vazois
Copy link
Contributor Author

vazois commented Feb 11, 2025

Currently, at the completion of a streaming checkpoint the AOF get safely truncated.

if --main-memory-replication (to be renamed to --fast-aof-truncate) is on, we will truncate the AOF aggressively based on sending AOF to replicas, else we will truncate AOF until the last on-disk checkpoint, correct? This is important for the node to recover after restart by loading the last checkpoint from disk and rolling forward the AOF as usual.

Our system reuses the same AOF for local roll-forward as what is used for replication, so this would be necessary for completeness.

Yes, that makes sense. I will add a guard for this based on the MMR flag and update the description.
Renaming should happen maybe later to have time to communicate the change to people that use the flag.

@TalZaccai TalZaccai requested a review from badrishc February 11, 2025 19:22
@badrishc
Copy link
Contributor

Renaming should happen maybe later

The concern is that we introduce 'diskless replication' in this PR, which semantically conflicts with 'main memory replication' which makes it confusing. Could we perhaps add FastAofTruncate and yet keep MMR, i.e., set FastAofTruncate = MMR if they specify MMR, so that it's backwards compatible? Perhaps mark MMR as obsolete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants