docs: add doc about the local disk based WAL (#140)

Co-authored-by: Jiacai Liu <dev@liujiacai.net>
apache · Oct 18, 2024 · 9692866 · 9692866
1 parent e393a13
commit 9692866
Show file tree

Hide file tree

Showing 2 changed files with 282 additions and 0 deletions.
diff --git a/content/cn/docs/design/wal_on_disk.md b/content/cn/docs/design/wal_on_disk.md
@@ -0,0 +1,140 @@
+---
+title: "基于本地磁盘的 WAL"
+---
+
+## 架构
+
+本节将介绍基于本地磁盘的单机版 WAL（Write-Ahead Log，以下简称日志）的实现。在此实现中，日志按 region 级别进行管理。
+
+```
+            ┌────────────────────────────┐
+            │          HoraeDB           │
+            │                            │
+            │ ┌────────────────────────┐ │
+            │ │          WAL           │ │         ┌────────────────────────┐
+            │ │                        │ │         │                        │
+            │ │         ......         │ │         │      File System       │
+            │ │                        │ │         │                        │
+            │ │ ┌────────────────────┐ │ │ manage  │ ┌────────────────────┐ │
+ Write ─────┼─┼─►       Region       ├─┼─┼─────────┼─►     Region Dir     │ │
+            │ │ │                    │ │ │         │ │                    │ │
+ Read  ─────┼─┼─►   ┌────────────┐   │ │ │  mmap   │ │ ┌────────────────┐ │ │
+            │ │ │   │  Segment 0 ├───┼─┼─┼─────────┼─┼─► Segment File 0 │ │ │
+            │ │ │   └────────────┘   │ │ │         │ │ └────────────────┘ │ │
+Delete ─────┼─┼─►   ┌────────────┐   │ │ │  mmap   │ │ ┌────────────────┐ │ │
+            │ │ │   │  Segment 1 ├───┼─┼─┼─────────┼─┼─► SegmenteFile 1 │ │ │
+            │ │ │   └────────────┘   │ │ │         │ │ └────────────────┘ │ │
+            │ │ │   ┌────────────┐   │ │ │  mmap   │ │ ┌────────────────┐ │ │
+            │ │ │   │  Segment 2 ├───┼─┼─┼─────────┼─┼─► SegmenteFile 2 │ │ │
+            │ │ │   └────────────┘   │ │ │         │ │ └────────────────┘ │ │
+            │ │ │       ......       │ │ │         │ │       ......       │ │
+            │ │ └────────────────────┘ │ │         │ └────────────────────┘ │
+            │ │         ......         │ │         │         ......         │
+            │ └────────────────────────┘ │         └────────────────────────┘
+            └────────────────────────────┘
+```
+
+## 数据模型
+
+### 文件路径
+
+每个 region 都拥有一个目录，用于管理该 region 的所有 segment。目录名为 region 的 ID。每个 segment 的命名方式为 `seg_<id>`，ID 从 0 开始递增。
+
+### Segment 的格式
+
+一个 region 中所有表的日志都存储在 segments 中，并按照 sequence number 从小到大排列。segment 文件的结构如下：
+
+```
+   Segment0            Segment1
+┌────────────┐      ┌────────────┐
+│ Magic Num  │      │ Magic Num  │
+├────────────┤      ├────────────┤
+│   Record   │      │   Record   │
+├────────────┤      ├────────────┤
+│   Record   │      │   Record   │
+├────────────┤      ├────────────┤   ....
+│   Record   │      │   Record   │
+├────────────┤      ├────────────┤
+│     ...    │      │     ...    │
+│            │      │            │
+└────────────┘      └────────────┘
+    seg_0               seg_1
+```
+
+在内存中，每个 segment 还会存储一些额外的信息以供读写和删除操作使用：
+
+```
+pub struct Segment {
+    /// A hashmap storing both min and max sequence numbers of records within
+    /// this segment for each `TableId`.
+    table_ranges: HashMap<TableId, (SequenceNumber, SequenceNumber)>,
+
+    /// An optional vector of positions within the segment.
+    record_position: Vec<Position>,
+
+    ...
+}
+```
+
+### 日志格式
+
+segment 中的日志格式如下：
+
+```
++---------+--------+------------+--------------+--------------+-------+
+| version |  crc   |  table id  | sequence num | value length | value |
+|  (u8)   | (u32)  |   (u64)    |    (u64)     |     (u32)    |(bytes)|
++---------+--------+------------+--------------+--------------+-------+
+```
+
+字段说明：
+
+1. `version`：日志版本号。
+
+2. `crc`：用于确保数据一致性。计算从 table id 到该记录结束的 CRC 校验值。
+
+3. `table id`：表的唯一标识符。
+
+4. `sequence num`：记录的序列号。
+
+5. `value length`：value 的字节长度。
+
+6. `value`：通用日志格式中的值。
+
+日志中不存储 region ID，因为可以通过文件路径获取该信息。
+
+## 主要流程
+
+### 打开 Wal
+
+1. 识别 Wal 目录下的所有 region 目录。
+
+2. 在每个 region 目录下，识别所有 segment 文件。
+
+3. 打开每个 segment 文件，遍历其中的所有日志，记录其中每个日志开始和结束的偏移量和每个 `TableId` 在该 segment 中的最小和最大序列号，然后关闭文件。
+
+4. 如果不存在 region 目录或目录下没有任何 segment 文件，则自动创建相应的目录和文件。
+
+### 读日志
+
+1. 根据 segment 的元数据，确定本次读取操作涉及的所有 segment。
+2. 按照 id 从小到大的顺序，依次打开这些 segment，将原始字节解码为日志。
+
+### 写日志
+
+1. 将待写入的日志序列化为字节数据，追加到 id 最大的 segment 文件中。
+2. 每个 segment 创建时预分配固定大小的 64MB，不会动态改变。当预分配的空间用完后，创建一个新的 segment，并切换到新的 segment 继续追加。
+
+3. 每次追加后不会立即调用 flush；默认情况下，每写入十次或在 segment 文件关闭时才执行 flush。
+
+4. 在内存中更新 segment 的元数据 `table_ranges`。
+
+### 删除日志
+
+假设需要将 id 为 `table_id` 的表中，序列号小于 seq_num 的日志标记为删除：
+
+1. 在内存中更新相关 segment 的 `table_ranges` 字段，将该表的最小序列号更新为 seq_num + 1。
+
+2. 如果修改后，该表在此 segment 中的最小序列号大于最大序列号，则从 `table_ranges` 中删除该表。
+
+3. 如果一个 segment 的 `table_ranges` 为空，且不是 id 最大的 segment，则删除该 segment 文件。
diff --git a/content/en/docs/design/wal_on_disk.md b/content/en/docs/design/wal_on_disk.md
@@ -0,0 +1,142 @@
+---
+title: "WAL on Disk"
+---
+
+## Architecture
+
+This section introduces the implementation of a standalone Write-Ahead Log (WAL, hereinafter referred to as "the log") based on a local disk. In this implementation, the log is managed at the region level.
+
+```
+            ┌────────────────────────────┐
+            │          HoraeDB           │
+            │                            │
+            │ ┌────────────────────────┐ │
+            │ │          WAL           │ │         ┌────────────────────────┐
+            │ │                        │ │         │                        │
+            │ │         ......         │ │         │      File System       │
+            │ │                        │ │         │                        │
+            │ │ ┌────────────────────┐ │ │ manage  │ ┌────────────────────┐ │
+ Write ─────┼─┼─►       Region       ├─┼─┼─────────┼─►     Region Dir     │ │
+            │ │ │                    │ │ │         │ │                    │ │
+ Read  ─────┼─┼─►   ┌────────────┐   │ │ │  mmap   │ │ ┌────────────────┐ │ │
+            │ │ │   │  Segment 0 ├───┼─┼─┼─────────┼─┼─► Segment File 0 │ │ │
+            │ │ │   └────────────┘   │ │ │         │ │ └────────────────┘ │ │
+Delete ─────┼─┼─►   ┌────────────┐   │ │ │  mmap   │ │ ┌────────────────┐ │ │
+            │ │ │   │  Segment 1 ├───┼─┼─┼─────────┼─┼─► Segment File 1 │ │ │
+            │ │ │   └────────────┘   │ │ │         │ │ └────────────────┘ │ │
+            │ │ │   ┌────────────┐   │ │ │  mmap   │ │ ┌────────────────┐ │ │
+            │ │ │   │  Segment 2 ├───┼─┼─┼─────────┼─┼─► Segment File 2 │ │ │
+            │ │ │   └────────────┘   │ │ │         │ │ └────────────────┘ │ │
+            │ │ │       ......       │ │ │         │ │       ......       │ │
+            │ │ └────────────────────┘ │ │         │ └────────────────────┘ │
+            │ │         ......         │ │         │         ......         │
+            │ └────────────────────────┘ │         └────────────────────────┘
+            └────────────────────────────┘
+```
+
+## Data Model
+
+### File Paths
+
+Each region has its own directory to manage all segments for that region. The directory is named after the region's ID. Each segment is named using the format `seg_<id>`, with IDs starting from 0 and incrementing.
+
+### Segment Format
+
+Logs for all tables within a region are stored in segments, arranged in ascending order of sequence numbers. The structure of the segment files is as follows:
+
+```
+   Segment0            Segment1
+┌────────────┐      ┌────────────┐
+│ Magic Num  │      │ Magic Num  │
+├────────────┤      ├────────────┤
+│   Record   │      │   Record   │
+├────────────┤      ├────────────┤
+│   Record   │      │   Record   │
+├────────────┤      ├────────────┤   ....
+│   Record   │      │   Record   │
+├────────────┤      ├────────────┤
+│     ...    │      │     ...    │
+│            │      │            │
+└────────────┘      └────────────┘
+    seg_0               seg_1
+```
+
+In memory, each segment stores additional information used for read, write, and delete operations:
+
+```rust
+pub struct Segment {
+    /// A hashmap storing both min and max sequence numbers of records within
+    /// this segment for each `TableId`.
+    table_ranges: HashMap<TableId, (SequenceNumber, SequenceNumber)>,
+
+    /// An optional vector of positions within the segment.
+    record_position: Vec<Position>,
+
+    ...
+}
+```
+
+### Log Format
+
+The log format within a segment is as follows:
+
+```
++---------+--------+------------+--------------+--------------+-------+
+| version |  crc   |  table id  | sequence num | value length | value |
+|  (u8)   | (u32)  |   (u64)    |    (u64)     |     (u32)    |(bytes)|
++---------+--------+------------+--------------+--------------+-------+
+```
+
+Field Descriptions:
+
+1. `version`: Log version number.
+
+2. `crc`: Used to ensure data consistency. Computes the CRC checksum from the table id to the end of the record.
+
+3. `table id`: The unique identifier of the table.
+
+4. `sequence num`: The sequence number of the record.
+
+5. `value length`: The byte length of the value.
+
+6. `value`: The value in the general log format.
+
+The region ID is not stored in the log because it can be obtained from the file path.
+
+## Main Processes
+
+### Opening the WAL
+
+1. Identify all region directories under the WAL directory.
+
+2. In each region directory, identify all segment files.
+
+3. Open each segment file, traverse all logs within it, record the start and end offsets of each log, and record the minimum and maximum sequence numbers of each `TableId` in the segment, then close the file.
+
+4. If there is no region directory or there are no segment files under the directory, automatically create the corresponding directory and files.
+
+### Reading Logs
+
+1. Based on the metadata of the segments, determine all segments involved in the current read operation.
+
+2. Open these segments in order of their IDs from smallest to largest, and decode the raw bytes into logs.
+
+### Writing Logs
+
+1. Serialize the logs to be written into byte data and append them to the segment file with the largest ID.
+
+2. When a segment is created, it pre-allocates a fixed size of 64MB and will not change dynamically. When the pre-allocated space is used up, a new segment is created, and appending continues in the new segment.
+
+3. After each append, `flush` is not called immediately; by default, `flush` is performed every ten writes or when the segment file is closed.
+
+4. Update the segment's metadata `table_ranges` in memory.
+
+### Deleting Logs
+
+Suppose logs in the table with ID `table_id` and sequence numbers less than `seq_num` need to be marked as deleted:
+
+1. Update the `table_ranges` field of the relevant segments in memory, updating the minimum sequence number of the table to `seq_num + 1`.
+
+2. If after modification, the minimum sequence number of the table in this segment is greater than the maximum sequence number, remove the table from `table_ranges`.
+
+3. If a segment's `table_ranges` is empty and it is not the segment with the largest ID, delete the segment file.