forked from kakao/varlog
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path20220915_commit_context.md
270 lines (215 loc) · 10.8 KB
/
20220915_commit_context.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
- Feature: Commit context
- Status: draft
- Authors: @ijsong
- Varlog Issue: #125
- RFC PR:
# Summary
A large amount of empty commit contexts slows down log stream recovery and
consumes a significant amount of disk space. To solve these problems, this RFC
proposes to store only the latest commit context rather than a sequence of
commit contexts. From the log entries already stored in the local, the log
stream can restore the local low and high watermark and the position where a
log entry will save. It can also recover the highest version confirmed by the
metadata repository from the latest commit context.
# Background
The log stream executor stores a series of commit contexts to help the cluster
recover. It creates a commit context for each commit message sent from the
metadata repository even if it does not contain logs, which we call "empty
commit context."
Previously, saving a series of commit contexts was useful while recovering a
cluster. It helped a log stream report its status during bootstrapping or
synchronizing and was also used for recovering a metadata repository that did
not use WAL.
However, a considerable number of empty commit contexts causes some problems:
- They take up lots of disk space, although they have no information about log
entries.
- Restarting a log stream takes a long time since it full-scans commit contexts
to look up a non-empty commit context.
Keeping the least of commit contexts is necessary to save disk space and
decrease bootstrap duration.
## Report
A log stream sends a report to the metadata repository; the report contains the
followings:
- The highest version the log stream replica knows
- The local offset of log entries written but not committed
- The length of log entries written but not committed
The report tells the metadata repository how many log entries the log stream
wants to commit. The metadata repository knows which offset and version the
report must be for the log stream to store log entries sequentially. The log
stream must not report the same offsets across different reports after
committing any log entries. For example, if the log stream sent a report which
has a log entry starting at offset 1 in version 1 and committed the log entry,
the log stream must not report which has log entries starting at offset 1 in
version 2.
## Commit and its context
The metadata repository sends a commit message to the log stream. The commit
message contains the followings:
- New version issued by the metadata repository
- The local and global offset of log entries committed
- The length of log entries committed
The metadata repository announces how many logs the log stream can commit
through the commit message. It tracks a unique tuple of `(version,
local_offset, global_offset, length)` for each log stream. Once all
replicas in a log stream accept the commit message, the metadata repository can
trim the tuple already applied to the replicas in the log stream. How does the
metadata repository know if all replicas in a log stream apply the commit
message? As we see in the previous chapter, it decides by receiving all reports
with the same version from the replicas.
Commit context is the environment in which the metadata repository generates a
commit. A log stream stores a commit context with committed log entries and
restores the commit context during recovery. It makes the log stream send the
metadata repository an accurate report after recovery.
```
// https://github.com/kakao/varlog/blob/1340bba15a9ad52f9eb581e282617825da8a32d1/internal/storage/commit_context.go#L5-L11
type CommitContext struct {
Version types.Version
HighWatermark types.GLSN
CommittedGLSNBegin types.GLSN
CommittedGLSNEnd types.GLSN
CommittedLLSNBegin types.LLSN
}
```
For tidiness, this proposal uses `version`, `local_offset`, `global_offset`,
and `length` rather than `Version`, `CommittedLLSNBegin`,
`CommittedGLSNBegin`, and `CommittedGLSNEnd-CommittedGLSNBegin`.
## Empty commit context
Empty commit context means a commit message has a higher version, but its
length field is zero; it has no log entries committed.
```mermaid
sequenceDiagram
autonumber
participant LS1 as LogStream1
participant LS2 as LogStream2
participant MR as MetadataRepository
LS1->>MR: Report(version=0, local_offset=1, length=1)
LS2->>MR: Report(version=0, local_offset=1, length=1)
MR->>LS1: Commit(version=1, global_offset=1, length=1)
MR->>LS2: Commit(version=1, global_offset=2, length=1)
LS1->>MR: Report(version=1, local_offset=2, length=0)
LS2->>MR: Report(version=1, local_offset=2, length=1)
MR->>LS1: Commit(version=2, global_offset=3, length=0)
MR->>LS2: Commit(version=2, global_offset=3, length=1)
```
Assume that there are two log streams in a cluster. Each has a written log
entry; thus, they send the metadata repository reports (i.e., 1 and 2)
indicating them. As a result of reports, the metadata repository replies to
them with commit messages (i.e., 3 and 4) containing their log sequence number.
After receiving the commit message, each log stream commits the written log
with a commit context. For example, LogStream1 will store the following commit
context:
```go
// LogStream1
commit_context{
version: 1,
global_offset: 1,
local_offset: 1,
length: 1,
}
```
Next, LogStream1 has no written log entry. It, thus, sends reports whose length is
zero (i.e., 5). The metadata repository generates the next version of a commit
that issues a log sequence number only for LogStream2. LogStream1 receives a
commit message (i.e., 7) with no committed log. After receiving this commit,
LogStream1 will store an empty commit:
```go
// LogStream1
commit_context{
version: 2,
global_offset: 3,
local_offset: 2,
length: 0,
}
```
As mentioned above, the commit context helped the log stream and metadata
repository to recover. It will not be explained here because of out of scope.
# Design
## Invariant of the commit context
A log stream stores commit context and committed log entries atomically. The
commit context specifies the range of log entries; hence, `local_offset +
length` in the commit context must be equal to the position of the following
log entry. That is, `local_offset + length` in the commit context should equal
the current local high watermark. We can guarantee that position of all
committed log entries is less than `local_offset + length` in the commit
context.
## Recovery of log stream replica
We need only the latest commit context to recover a log stream replica.
1. Seek the first log entry, which is the `local_low_watermark.`
2. Seek the last log entry, which is the `local_high_watermark.`
3. Read the commit context.
4. If the `local_offset + length` in the commit context is equal to the `local
high watermark + 1`, recovery of the log stream replica completes;
otherwise, it either needs synchronization or fails.
Suppose the `local_offset + length` in the commit context is less than the
`local_high_watermark + 1`. In that case, the replica cannot usually work since
it breaks the above invariant. If the recovery fails, the replica can remove
the erroneous data, including the commit context, and the log stream will
synchronize after that.
On the other hand, if the `local_offset + length` in the commit context is
greater than the `local_high_watermark + 1`, it also fails. It means that some
suffix log entries do not exist. Because the log stream deletes its prefix
entries, suffix entries should remain rather than prefix entries.
## Synchronization
Replicas in a log stream should have duplicate log entries. If a replica has
less than the others, it should receive missing log entries from the other; it
is called synchronization.
```mermaid
sequenceDiagram
autonumber
participant Source
participant Destination
Source->>Destination: SyncInitRequest(first=1, last=10)
Destination->>Source: SyncInitResponse(first=6, last=10)
loop Each log entry in seq 6..10
Source->>Destination: SyncReplicate
end
```
Before copying the log entries, two replicas have to negotiate a range of log
entries to copy; the RPC SyncInit does that. In the above sequence diagram, the
destination replica already has log entries from 1 to 5; thus, it replies
SyncInit with a range of missing log entries from 6 to 10. Then, it receives
log entries through the RPC SyncReplicate.
When a destination replica receives a range of log entries from the source
replica, the range might not be sequential with the ones already had.
Concretely, the starting log sequence number of the range sent from the source
can be greater than the following sequence number of the replica's last log
entry. It means that the Trim operation removed the prefix logs. The
destination should wipe out all logs and commit context in this case.
```mermaid
sequenceDiagram
autonumber
participant Source
participant Destination
Source->>Destination: SyncInitRequest(first=11, last=20)
Destination->>Destination: Remove commit context and log entries
Destination->>Source: SyncInitResponse(first=11, last=20)
loop Each log entry in seq 11..20
Source->>Destination: SyncReplicate
end
```
The RPC SyncReplicate has only a single payload, either a commit context or a
log entry. It has some problems too:
- Copying log entries one by one is not efficient.
- It transfers a commit context separated from log entries.
The first relates to performance, and well-known approaches such as batch or
stream can improve that. The second can cause a subtle issue described on
[Recovery of log stream replica](#recovery-of-log-stream-replica). Network failure might prevent the
synchronization from transferring either a commit context or log entries.; it
messes up the invariant of the commit context. This problem happens unless
synchronization simultaneously clones all log entries and a commit context.
Thus, we need to remove data as few as possible when the invariant fails.
## Report of replica
Previously, a log stream replica sent a report to the metadata repository
regardless of its status. However, this proposal does not while a replica
synchronizes with another replica. Because the source replica will have only
the latest commit context, it will not send old ones. The destination replica
cannot send accurate reports without knowing the old commit contexts. So, the
metadata repository should allow silent replicas for a while.
Note that the replica can be silent during startup since it can also be invalid
while recovering.
## How to store a commit context
Previously, a sequence of commit contexts is stored separately from log
entries, which is okay. However, we can integrate a commit context with log
entries. The only essential and tiniest information should be stored to avoid
space overhead by duplicating the commit context in each log entry. For
instance, the version and commit offset in the committed log entries. Even the
byte encoding can decrease their size.