-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #149 from ijsong/rfc-commit-context
docs: draft on improving commit context
- Loading branch information
Showing
1 changed file
with
270 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,270 @@ | ||
- Feature: Commit context | ||
- Status: draft | ||
- Authors: @ijsong | ||
- Varlog Issue: #125 | ||
- RFC PR: | ||
|
||
# Summary | ||
|
||
A large amount of empty commit contexts slows down log stream recovery and | ||
consumes a significant amount of disk space. To solve these problems, this RFC | ||
proposes to store only the latest commit context rather than a sequence of | ||
commit contexts. From the log entries already stored in the local, the log | ||
stream can restore the local low and high watermark and the position where a | ||
log entry will save. It can also recover the highest version confirmed by the | ||
metadata repository from the latest commit context. | ||
|
||
# Background | ||
|
||
The log stream executor stores a series of commit contexts to help the cluster | ||
recover. It creates a commit context for each commit message sent from the | ||
metadata repository even if it does not contain logs, which we call "empty | ||
commit context." | ||
|
||
Previously, saving a series of commit contexts was useful while recovering a | ||
cluster. It helped a log stream report its status during bootstrapping or | ||
synchronizing and was also used for recovering a metadata repository that did | ||
not use WAL. | ||
|
||
However, a considerable number of empty commit contexts causes some problems: | ||
|
||
- They take up lots of disk space, although they have no information about log | ||
entries. | ||
- Restarting a log stream takes a long time since it full-scans commit contexts | ||
to look up a non-empty commit context. | ||
|
||
Keeping the least of commit contexts is necessary to save disk space and | ||
decrease bootstrap duration. | ||
|
||
## Report | ||
|
||
A log stream sends a report to the metadata repository; the report contains the | ||
followings: | ||
|
||
- The highest version the log stream replica knows | ||
- The local offset of log entries written but not committed | ||
- The length of log entries written but not committed | ||
|
||
The report tells the metadata repository how many log entries the log stream | ||
wants to commit. The metadata repository knows which offset and version the | ||
report must be for the log stream to store log entries sequentially. The log | ||
stream must not report the same offsets across different reports after | ||
committing any log entries. For example, if the log stream sent a report which | ||
has a log entry starting at offset 1 in version 1 and committed the log entry, | ||
the log stream must not report which has log entries starting at offset 1 in | ||
version 2. | ||
|
||
## Commit and its context | ||
|
||
The metadata repository sends a commit message to the log stream. The commit | ||
message contains the followings: | ||
|
||
- New version issued by the metadata repository | ||
- The local and global offset of log entries committed | ||
- The length of log entries committed | ||
|
||
The metadata repository announces how many logs the log stream can commit | ||
through the commit message. It tracks a unique tuple of `(version, | ||
local_offset, global_offset, length)` for each log stream. Once all | ||
replicas in a log stream accept the commit message, the metadata repository can | ||
trim the tuple already applied to the replicas in the log stream. How does the | ||
metadata repository know if all replicas in a log stream apply the commit | ||
message? As we see in the previous chapter, it decides by receiving all reports | ||
with the same version from the replicas. | ||
|
||
Commit context is the environment in which the metadata repository generates a | ||
commit. A log stream stores a commit context with committed log entries and | ||
restores the commit context during recovery. It makes the log stream send the | ||
metadata repository an accurate report after recovery. | ||
|
||
``` | ||
// https://github.com/kakao/varlog/blob/1340bba15a9ad52f9eb581e282617825da8a32d1/internal/storage/commit_context.go#L5-L11 | ||
type CommitContext struct { | ||
Version types.Version | ||
HighWatermark types.GLSN | ||
CommittedGLSNBegin types.GLSN | ||
CommittedGLSNEnd types.GLSN | ||
CommittedLLSNBegin types.LLSN | ||
} | ||
``` | ||
|
||
For tidiness, this proposal uses `version`, `local_offset`, `global_offset`, | ||
and `length` rather than `Version`, `CommittedLLSNBegin`, | ||
`CommittedGLSNBegin`, and `CommittedGLSNEnd-CommittedGLSNBegin`. | ||
|
||
## Empty commit context | ||
|
||
Empty commit context means a commit message has a higher version, but its | ||
length field is zero; it has no log entries committed. | ||
|
||
```mermaid | ||
sequenceDiagram | ||
autonumber | ||
participant LS1 as LogStream1 | ||
participant LS2 as LogStream2 | ||
participant MR as MetadataRepository | ||
LS1->>MR: Report(version=0, local_offset=1, length=1) | ||
LS2->>MR: Report(version=0, local_offset=1, length=1) | ||
MR->>LS1: Commit(version=1, global_offset=1, length=1) | ||
MR->>LS2: Commit(version=1, global_offset=2, length=1) | ||
LS1->>MR: Report(version=1, local_offset=2, length=0) | ||
LS2->>MR: Report(version=1, local_offset=2, length=1) | ||
MR->>LS1: Commit(version=2, global_offset=3, length=0) | ||
MR->>LS2: Commit(version=2, global_offset=3, length=1) | ||
``` | ||
|
||
Assume that there are two log streams in a cluster. Each has a written log | ||
entry; thus, they send the metadata repository reports (i.e., 1 and 2) | ||
indicating them. As a result of reports, the metadata repository replies to | ||
them with commit messages (i.e., 3 and 4) containing their log sequence number. | ||
After receiving the commit message, each log stream commits the written log | ||
with a commit context. For example, LogStream1 will store the following commit | ||
context: | ||
|
||
```go | ||
// LogStream1 | ||
commit_context{ | ||
version: 1, | ||
global_offset: 1, | ||
local_offset: 1, | ||
length: 1, | ||
} | ||
``` | ||
|
||
Next, LogStream1 has no written log entry. It, thus, sends reports whose length is | ||
zero (i.e., 5). The metadata repository generates the next version of a commit | ||
that issues a log sequence number only for LogStream2. LogStream1 receives a | ||
commit message (i.e., 7) with no committed log. After receiving this commit, | ||
LogStream1 will store an empty commit: | ||
|
||
```go | ||
// LogStream1 | ||
commit_context{ | ||
version: 2, | ||
global_offset: 3, | ||
local_offset: 2, | ||
length: 0, | ||
} | ||
``` | ||
|
||
As mentioned above, the commit context helped the log stream and metadata | ||
repository to recover. It will not be explained here because of out of scope. | ||
|
||
# Design | ||
|
||
## Invariant of the commit context | ||
|
||
A log stream stores commit context and committed log entries atomically. The | ||
commit context specifies the range of log entries; hence, `local_offset + | ||
length` in the commit context must be equal to the position of the following | ||
log entry. That is, `local_offset + length` in the commit context should equal | ||
the current local high watermark. We can guarantee that position of all | ||
committed log entries is less than `local_offset + length` in the commit | ||
context. | ||
|
||
## Recovery of log stream replica | ||
|
||
We need only the latest commit context to recover a log stream replica. | ||
|
||
1. Seek the first log entry, which is the `local_low_watermark.` | ||
2. Seek the last log entry, which is the `local_high_watermark.` | ||
3. Read the commit context. | ||
4. If the `local_offset + length` in the commit context is equal to the `local | ||
high watermark + 1`, recovery of the log stream replica completes; | ||
otherwise, it either needs synchronization or fails. | ||
|
||
Suppose the `local_offset + length` in the commit context is less than the | ||
`local_high_watermark + 1`. In that case, the replica cannot usually work since | ||
it breaks the above invariant. If the recovery fails, the replica can remove | ||
the erroneous data, including the commit context, and the log stream will | ||
synchronize after that. | ||
|
||
On the other hand, if the `local_offset + length` in the commit context is | ||
greater than the `local_high_watermark + 1`, it also fails. It means that some | ||
suffix log entries do not exist. Because the log stream deletes its prefix | ||
entries, suffix entries should remain rather than prefix entries. | ||
|
||
## Synchronization | ||
|
||
Replicas in a log stream should have duplicate log entries. If a replica has | ||
less than the others, it should receive missing log entries from the other; it | ||
is called synchronization. | ||
|
||
```mermaid | ||
sequenceDiagram | ||
autonumber | ||
participant Source | ||
participant Destination | ||
Source->>Destination: SyncInitRequest(first=1, last=10) | ||
Destination->>Source: SyncInitResponse(first=6, last=10) | ||
loop Each log entry in seq 6..10 | ||
Source->>Destination: SyncReplicate | ||
end | ||
``` | ||
|
||
Before copying the log entries, two replicas have to negotiate a range of log | ||
entries to copy; the RPC SyncInit does that. In the above sequence diagram, the | ||
destination replica already has log entries from 1 to 5; thus, it replies | ||
SyncInit with a range of missing log entries from 6 to 10. Then, it receives | ||
log entries through the RPC SyncReplicate. | ||
|
||
When a destination replica receives a range of log entries from the source | ||
replica, the range might not be sequential with the ones already had. | ||
Concretely, the starting log sequence number of the range sent from the source | ||
can be greater than the following sequence number of the replica's last log | ||
entry. It means that the Trim operation removed the prefix logs. The | ||
destination should wipe out all logs and commit context in this case. | ||
|
||
```mermaid | ||
sequenceDiagram | ||
autonumber | ||
participant Source | ||
participant Destination | ||
Source->>Destination: SyncInitRequest(first=11, last=20) | ||
Destination->>Destination: Remove commit context and log entries | ||
Destination->>Source: SyncInitResponse(first=11, last=20) | ||
loop Each log entry in seq 11..20 | ||
Source->>Destination: SyncReplicate | ||
end | ||
``` | ||
|
||
The RPC SyncReplicate has only a single payload, either a commit context or a | ||
log entry. It has some problems too: | ||
|
||
- Copying log entries one by one is not efficient. | ||
- It transfers a commit context separated from log entries. | ||
|
||
The first relates to performance, and well-known approaches such as batch or | ||
stream can improve that. The second can cause a subtle issue described on | ||
[Recovery of log stream replica](#recovery-of-log-stream-replica). Network failure might prevent the | ||
synchronization from transferring either a commit context or log entries.; it | ||
messes up the invariant of the commit context. This problem happens unless | ||
synchronization simultaneously clones all log entries and a commit context. | ||
Thus, we need to remove data as few as possible when the invariant fails. | ||
|
||
## Report of replica | ||
|
||
Previously, a log stream replica sent a report to the metadata repository | ||
regardless of its status. However, this proposal does not while a replica | ||
synchronizes with another replica. Because the source replica will have only | ||
the latest commit context, it will not send old ones. The destination replica | ||
cannot send accurate reports without knowing the old commit contexts. So, the | ||
metadata repository should allow silent replicas for a while. | ||
|
||
Note that the replica can be silent during startup since it can also be invalid | ||
while recovering. | ||
|
||
## How to store a commit context | ||
|
||
Previously, a sequence of commit contexts is stored separately from log | ||
entries, which is okay. However, we can integrate a commit context with log | ||
entries. The only essential and tiniest information should be stored to avoid | ||
space overhead by duplicating the commit context in each log entry. For | ||
instance, the version and commit offset in the committed log entries. Even the | ||
byte encoding can decrease their size. |