finish 2.7

Signed-off-by: Alex Chi <iskyzh@gmail.com>
skyzh · Jan 28, 2024 · b4485f4 · b4485f4
1 parent b964793
commit b4485f4
Show file tree

Hide file tree

Showing 15 changed files with 165 additions and 36 deletions.
diff --git a/mini-lsm-book/src/SUMMARY.md b/mini-lsm-book/src/SUMMARY.md
@@ -20,7 +20,7 @@
   - [Leveled Compaction Strategy](./week2-04-leveled.md)
   - [Manifest](./week2-05-manifest.md)
   - [Write-Ahead Log (WAL)](./week2-06-wal.md)
-  - [Snack Time: Batch Write and Checksums (WIP)](./week2-07-snacks.md)
+  - [Snack Time: Batch Write and Checksums](./week2-07-snacks.md)
 
 - [Week 3 Overview: MVCC (WIP)](./week3-overview.md)
   - [Timestamp Encoding + Refactor](./week3-01-ts-key-refactor.md)

diff --git a/mini-lsm-book/src/week1-07-sst-optimizations.md b/mini-lsm-book/src/week1-07-sst-optimizations.md
@@ -89,6 +89,15 @@ src/lsm_storage.rs
 
 For the bloom filter encoding, you can append the bloom filter to the end of your SST file. You will need to store the bloom filter offset at the end of the file, and compute meta offsets accordingly.
 
+```plaintext
+-----------------------------------------------------------------------------------------------------
+|         Block Section         |                            Meta Section                           |
+-----------------------------------------------------------------------------------------------------
+| data block | ... | data block | metadata | meta block offset | bloom filter | bloom filter offset |
+|                               |  varlen  |         u32       |    varlen    |        u32          |
+-----------------------------------------------------------------------------------------------------
+```
+
 We use the `farmhash` crate to compute the hashes of the keys. When building the SST, you will need also to build the bloom filter by computing the key hash using `farmhash::fingerprint32`. You will need to encode/decode the bloom filters with the block meta. You can choose false positive rate 0.01 for your bloom filter. You may need to add new fields to the structures apart from the ones provided in the starter code as necessary.
 
 After that, you can modify the `get` read path to filter SSTs based on bloom filters.

diff --git a/mini-lsm-book/src/week2-05-manifest.md b/mini-lsm-book/src/week2-05-manifest.md
@@ -21,8 +21,18 @@ src/manifest.rs
 
 We encode the manifest records using JSON. You may use `serde_json::to_vec` to encode a manifest record to a json, write it to the manifest file, and do a fsync. When you read from the manifest file, you may use `serde_json::Deserializer::from_slice` and it will return a stream of records. You do not need to store the record length or so, as `serde_json` can automatically find the split of the records.
 
+
+The manifest format is like:
+
+```
+| JSON record | JSON record | JSON record | JSON record |
+```
+
+Again, note that we do not record the information of how many bytes each record has.
+
 After the engine runs for several hours, the manifest file might get very large. At that time, you may periodically compact the manifest file to store the current snapshot and truncate the logs. This is an optimization you may implement as part of bonus tasks.
 
+
 ## Task 2: Write Manifests
 
 You can now go ahead and modify your LSM engine to write manifests when necessary. In this task, you will need to modify:

diff --git a/mini-lsm-book/src/week2-06-wal.md b/mini-lsm-book/src/week2-06-wal.md
@@ -25,6 +25,8 @@ The WAL encoding is simply a list of key-value pairs.
 
 You will also need to implement the `recover` function to read the WAL and recover the state of a memtable.
 
+Note that we are using a `BufWriter` for writing the WAL. Using a `BufWriter` can reduce the number of syscalls into the OS, so as to reduce the latency of the write path. The data is not guaranteed to be written to the disk when the user modifies a key. Instead, the engine only guarantee that the data is persisted when `sync` is called. To correctly persist the data to the disk, you will need to first flush the data from the buffer writer to the file object by calling `flush()`, and then do a fsync on the file by using `get_mut().sync_all()`.
+
 ## Task 2: Integrate WALs
 
 In this task, you will need to modify:

diff --git a/mini-lsm-book/src/week2-07-snacks.md b/mini-lsm-book/src/week2-07-snacks.md
@@ -9,16 +9,102 @@ In this chapter, you will:
 * Implement the batch write interface.
 * Add checksums to the blocks, SST metadata, manifest, and WALs.
 
+**Note: We do not have unit tests for this chapter. As long as you pass all previous tests and ensure checksums are written to your file, it would be fine.**
+
 ## Task 1: Write Batch Interface
 
+In this task, we will prepare for week 3 of this tutorial by adding a write batch API. You will need to modify:
+
+```
+src/lsm_storage.rs
+```
+
+The user provides `write_batch` with a batch of records to be written to the database. The records are `WriteBatchRecord<T: AsRef<[u8]>>`, and therefore it can be either `Bytes`, `&[u8]` or `Vec<u8>`. There are two types of records: delete and put. You may handle them in the same way as your `put` and `delete` function.
+
+After that, you may refactor your original `put` and `delete` function to call `write_batch`.
+
+You should pass all test cases in previous chapters after implementing this functionality.
+
 ## Task 2: Block Checksum
 
-## Task 3: SST Checksum
+In this task, you will need to add a block checksum at the end of each block when encoding the SST. You will need to modify:
+
+```
+src/table/builder.rs
+src/table.rs
+```
+
+The format of the SST will be changed to:
+
+```plaintext
+---------------------------------------------------------------------------------------------------------------------------
+|                   Block Section                     |                            Meta Section                           |
+---------------------------------------------------------------------------------------------------------------------------
+| data block | checksum | ... | data block | checksum | metadata | meta block offset | bloom filter | bloom filter offset |
+|   varlen   |    u32   |     |   varlen   |    u32   |  varlen  |         u32       |    varlen    |        u32          |
+---------------------------------------------------------------------------------------------------------------------------
+```
+
+We use crc32 as our checksum algorithm. You can use `crc32fast::hash` to generate the checksum for the block after building a block.
+
+Usually, when user specify the target block size in the storage options, the size should include both block content and checksum. For example, if the target block size is 4096, and the checksum takes 4 bytes, the actual block content target size should be 4092. However, to avoid breaking previous test cases and for simplicity, in our tutorial, we will **still** use the target block size as the target content size, and simply append the checksum at the end of the block.
+
+When you read the block, you should verify the checksum in `read_block` correctly generate the slices for the block content. You should pass all test cases in previous chapters after implementing this functionality.
+
+## Task 3: SST Meta Checksum
+
+In this task, you will need to add a block checksum for bloom filters and block metadata:
+
+```
+src/table/builder.rs
+src/table.rs
+src/bloom.rs
+```
+
+```plaintext
+----------------------------------------------------------------------------------------------------------
+|                                                Meta Section                                            |
+----------------------------------------------------------------------------------------------------------
+| no. of block | metadata | checksum | meta block offset | bloom filter | checksum | bloom filter offset |
+|     u32      |  varlen  |    u32   |        u32        |    varlen    |    u32   |        u32          |
+----------------------------------------------------------------------------------------------------------
+```
+
+You will need to add a checksum at the end of the bloom filter in `Bloom::encode` and `Bloom::decode`. Note that most of our APIs take an existing buffer that the implementation will write into, for example, `Bloom::encode`. Therefore, you should record the offset of the beginning of the bloom filter before writing the encoded content, and only checksum the bloom filter itself instead of the whole buffer.
+
+After that, you can add a checksum at the end of block metadata. You might find it helpful to also add a length of metadata at the beginning of the section, so that it will be easier to know where to stop when decoding the block metadata.
 
 ## Task 4: WAL Checksum
 
+In this task, you will need to modify:
+
+```
+src/wal.rs
+```
+
+We will do a per-record checksum in the write-ahead log. To do this, you have two choices:
+
+* Generate a buffer of the key-value record, and use `crc32fast::hash` to compute the checksum at once.
+* Write one field at a time (e.g., key length, key slice), and use a `crc32fast::Hasher` to compute the checksum incrementally on each field.
+
+This is up to your choice and you will need to *choose your own adventure*. The new WAL encoding should be like:
+
+```
+| key_len | key | value_len | value | checksum |
+```
+
 ## Task 5: Manifest Checksum
 
+Lastly, let us add a checksum on the manifest file. Manifest is similar to a WAL, except that previously, we do not store the length of each record. To make the implementation easier, we now add a header of record length at the beginning of a record, and add a checksum at the end of the record.
+
+The new manifest format is like:
+
+```
+| len | JSON record | checksum | len | JSON record | checksum | len | JSON record | checksum |
+```
+
+After implementing everything, you should pass all previous test cases. We do not provide new test cases in this chapter.
+
 ## Test Your Understanding
 
 * Consider the case that an LSM storage engine only provides `write_batch` as the write interface (instead of single put + delete). Is it possible to implement it as follows: there is a single write thread with an mpsc channel receiver to get the changes, and all threads send write batches to the write thread. The write thread is the single point to write to the database. What are the pros/cons of this implementation? (Congrats if you do so you get BadgerDB!)
@@ -28,6 +114,6 @@ We do not provide reference answers to the questions, and feel free to discuss a
 
 ## Bonus Tasks
 
-* **Try Recovering**. If there is a checksum error, open the database in a safe mode so that no writes can be performed and non-corrupted data can still be retrieved.
+* **Recovering on Corruption**. If there is a checksum error, open the database in a safe mode so that no writes can be performed and non-corrupted data can still be retrieved.
 
 {{#include copyright.md}}
diff --git a/mini-lsm-book/src/week3-02-snapshot-read-part-1.md b/mini-lsm-book/src/week3-02-snapshot-read-part-1.md
@@ -1 +1,3 @@
 # Snapshot Read - Memtables and SSTs
+
+During the refactor, you might need to change the signature of some functions from `&self` to `self: &Arc<Self>` as necessary.
diff --git a/mini-lsm-mvcc/src/mvcc/txn.rs b/mini-lsm-mvcc/src/mvcc/txn.rs
@@ -101,7 +101,7 @@ impl Transaction {
                 let committed_txns = self.inner.mvcc().committed_txns.lock();
                 for (_, txn_data) in committed_txns.range(self.read_ts..) {
                     for key_hash in read_set {
-                        if txn_data.key_hashes.contains(&key_hash) {
+                        if txn_data.key_hashes.contains(key_hash) {
                             bail!("serializable check failed");
                         }
                     }

diff --git a/mini-lsm-mvcc/src/table.rs b/mini-lsm-mvcc/src/table.rs
@@ -150,7 +150,7 @@ impl SsTable {
         let raw_bloom_offset = file.read(len - 4, 4)?;
         let bloom_offset = (&raw_bloom_offset[..]).get_u32() as u64;
         let raw_bloom = file.read(bloom_offset, len - 4 - bloom_offset)?;
-        let bloom_filter = Bloom::decode(&raw_bloom);
+        let bloom_filter = Bloom::decode(&raw_bloom)?;
         let raw_meta_offset = file.read(bloom_offset - 4, 4)?;
         let block_meta_offset = (&raw_meta_offset[..]).get_u32() as u64;
         let raw_meta = file.read(block_meta_offset, bloom_offset - 4 - block_meta_offset)?;

diff --git a/mini-lsm-mvcc/src/table/bloom.rs b/mini-lsm-mvcc/src/table/bloom.rs
@@ -1,6 +1,7 @@
 // Copyright 2021 TiKV Project Authors. Licensed under Apache-2.0.
 
-use bytes::{BufMut, Bytes, BytesMut};
+use anyhow::{bail, Result};
+use bytes::{Buf, BufMut, Bytes, BytesMut};
 
 /// Implements a bloom filter
 pub struct Bloom {
@@ -45,19 +46,26 @@ impl<T: AsMut<[u8]>> BitSliceMut for T {
 
 impl Bloom {
     /// Decode a bloom filter
-    pub fn decode(buf: &[u8]) -> Self {
-        let filter = &buf[..buf.len() - 1];
-        let k = buf[buf.len() - 1];
-        Self {
+    pub fn decode(buf: &[u8]) -> Result<Self> {
+        let checksum = (&buf[buf.len() - 4..buf.len()]).get_u32();
+        if checksum != crc32fast::hash(&buf[..buf.len() - 4]) {
+            bail!("checksum mismatched for bloom filters");
+        }
+        let filter = &buf[..buf.len() - 5];
+        let k = buf[buf.len() - 5];
+        Ok(Self {
             filter: filter.to_vec().into(),
             k,
-        }
+        })
     }
 
     /// Encode a bloom filter
     pub fn encode(&self, buf: &mut Vec<u8>) {
+        let offset = buf.len();
         buf.extend(&self.filter);
         buf.put_u8(self.k);
+        let checksum = crc32fast::hash(&buf[offset..]);
+        buf.put_u32(checksum);
     }
 
     /// Get bloom filter bits per key from entries count and FPR

diff --git a/mini-lsm-mvcc/src/wal.rs b/mini-lsm-mvcc/src/wal.rs
@@ -1,6 +1,6 @@
 use std::fs::{File, OpenOptions};
 use std::hash::Hasher;
-use std::io::{Read, Write};
+use std::io::{BufWriter, Read, Write};
 use std::path::Path;
 use std::sync::Arc;
 
@@ -12,20 +12,20 @@ use parking_lot::Mutex;
 use crate::key::{KeyBytes, KeySlice};
 
 pub struct Wal {
-    file: Arc<Mutex<File>>,
+    file: Arc<Mutex<BufWriter<File>>>,
 }
 
 impl Wal {
     pub fn create(path: impl AsRef<Path>) -> Result<Self> {
         Ok(Self {
-            file: Arc::new(Mutex::new(
+            file: Arc::new(Mutex::new(BufWriter::new(
                 OpenOptions::new()
                     .read(true)
                     .create_new(true)
                     .write(true)
                     .open(path)
                     .context("failed to create WAL")?,
-            )),
+            ))),
         })
     }
 
@@ -60,7 +60,7 @@ impl Wal {
             skiplist.insert(KeyBytes::from_bytes_with_ts(key, ts), value);
         }
         Ok(Self {
-            file: Arc::new(Mutex::new(file)),
+            file: Arc::new(Mutex::new(BufWriter::new(file))),
         })
     }
 
@@ -86,8 +86,9 @@ impl Wal {
     }
 
     pub fn sync(&self) -> Result<()> {
-        let file = self.file.lock();
-        file.sync_all()?;
+        let mut file = self.file.lock();
+        file.flush()?;
+        file.get_mut().sync_all()?;
         Ok(())
     }
 }
diff --git a/mini-lsm-starter/src/table/bloom.rs b/mini-lsm-starter/src/table/bloom.rs
@@ -1,5 +1,6 @@
 // Copyright 2021 TiKV Project Authors. Licensed under Apache-2.0.
 
+use anyhow::Result;
 use bytes::{BufMut, Bytes, BytesMut};
 
 /// Implements a bloom filter
@@ -45,13 +46,13 @@ impl<T: AsMut<[u8]>> BitSliceMut for T {
 
 impl Bloom {
     /// Decode a bloom filter
-    pub fn decode(buf: &[u8]) -> Self {
+    pub fn decode(buf: &[u8]) -> Result<Self> {
         let filter = &buf[..buf.len() - 1];
         let k = buf[buf.len() - 1];
-        Self {
+        Ok(Self {
             filter: filter.to_vec().into(),
             k,
-        }
+        })
     }
 
     /// Encode a bloom filter

diff --git a/mini-lsm-starter/src/wal.rs b/mini-lsm-starter/src/wal.rs
@@ -1,6 +1,7 @@
 #![allow(dead_code)] // REMOVE THIS LINE after fully implementing this functionality
 
 use std::fs::File;
+use std::io::BufWriter;
 use std::path::Path;
 use std::sync::Arc;
 
@@ -10,7 +11,7 @@ use crossbeam_skiplist::SkipMap;
 use parking_lot::Mutex;
 
 pub struct Wal {
-    file: Arc<Mutex<File>>,
+    file: Arc<Mutex<BufWriter<File>>>,
 }
 
 impl Wal {

diff --git a/mini-lsm/src/table.rs b/mini-lsm/src/table.rs
@@ -146,7 +146,7 @@ impl SsTable {
         let raw_bloom_offset = file.read(len - 4, 4)?;
         let bloom_offset = (&raw_bloom_offset[..]).get_u32() as u64;
         let raw_bloom = file.read(bloom_offset, len - 4 - bloom_offset)?;
-        let bloom_filter = Bloom::decode(&raw_bloom);
+        let bloom_filter = Bloom::decode(&raw_bloom)?;
         let raw_meta_offset = file.read(bloom_offset - 4, 4)?;
         let block_meta_offset = (&raw_meta_offset[..]).get_u32() as u64;
         let raw_meta = file.read(block_meta_offset, bloom_offset - 4 - block_meta_offset)?;

diff --git a/mini-lsm/src/table/bloom.rs b/mini-lsm/src/table/bloom.rs
@@ -1,6 +1,7 @@
 // Copyright 2021 TiKV Project Authors. Licensed under Apache-2.0.
 
-use bytes::{BufMut, Bytes, BytesMut};
+use anyhow::{bail, Result};
+use bytes::{Buf, BufMut, Bytes, BytesMut};
 
 /// Implements a bloom filter
 pub struct Bloom {
@@ -45,19 +46,26 @@ impl<T: AsMut<[u8]>> BitSliceMut for T {
 
 impl Bloom {
     /// Decode a bloom filter
-    pub fn decode(buf: &[u8]) -> Self {
-        let filter = &buf[..buf.len() - 1];
-        let k = buf[buf.len() - 1];
-        Self {
+    pub fn decode(buf: &[u8]) -> Result<Self> {
+        let checksum = (&buf[buf.len() - 4..buf.len()]).get_u32();
+        if checksum != crc32fast::hash(&buf[..buf.len() - 4]) {
+            bail!("checksum mismatched for bloom filters");
+        }
+        let filter = &buf[..buf.len() - 5];
+        let k = buf[buf.len() - 5];
+        Ok(Self {
             filter: filter.to_vec().into(),
             k,
-        }
+        })
     }
 
     /// Encode a bloom filter
     pub fn encode(&self, buf: &mut Vec<u8>) {
+        let offset = buf.len();
         buf.extend(&self.filter);
         buf.put_u8(self.k);
+        let checksum = crc32fast::hash(&buf[offset..]);
+        buf.put_u32(checksum);
     }
 
     /// Get bloom filter bits per key from entries count and FPR