Skip to content

Commit

Permalink
finish 2.7
Browse files Browse the repository at this point in the history
Signed-off-by: Alex Chi <iskyzh@gmail.com>
  • Loading branch information
skyzh committed Jan 28, 2024
1 parent b964793 commit b4485f4
Show file tree
Hide file tree
Showing 15 changed files with 165 additions and 36 deletions.
2 changes: 1 addition & 1 deletion mini-lsm-book/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
- [Leveled Compaction Strategy](./week2-04-leveled.md)
- [Manifest](./week2-05-manifest.md)
- [Write-Ahead Log (WAL)](./week2-06-wal.md)
- [Snack Time: Batch Write and Checksums (WIP)](./week2-07-snacks.md)
- [Snack Time: Batch Write and Checksums](./week2-07-snacks.md)

- [Week 3 Overview: MVCC (WIP)](./week3-overview.md)
- [Timestamp Encoding + Refactor](./week3-01-ts-key-refactor.md)
Expand Down
9 changes: 9 additions & 0 deletions mini-lsm-book/src/week1-07-sst-optimizations.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,15 @@ src/lsm_storage.rs

For the bloom filter encoding, you can append the bloom filter to the end of your SST file. You will need to store the bloom filter offset at the end of the file, and compute meta offsets accordingly.

```plaintext
-----------------------------------------------------------------------------------------------------
| Block Section | Meta Section |
-----------------------------------------------------------------------------------------------------
| data block | ... | data block | metadata | meta block offset | bloom filter | bloom filter offset |
| | varlen | u32 | varlen | u32 |
-----------------------------------------------------------------------------------------------------
```

We use the `farmhash` crate to compute the hashes of the keys. When building the SST, you will need also to build the bloom filter by computing the key hash using `farmhash::fingerprint32`. You will need to encode/decode the bloom filters with the block meta. You can choose false positive rate 0.01 for your bloom filter. You may need to add new fields to the structures apart from the ones provided in the starter code as necessary.

After that, you can modify the `get` read path to filter SSTs based on bloom filters.
Expand Down
10 changes: 10 additions & 0 deletions mini-lsm-book/src/week2-05-manifest.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,18 @@ src/manifest.rs

We encode the manifest records using JSON. You may use `serde_json::to_vec` to encode a manifest record to a json, write it to the manifest file, and do a fsync. When you read from the manifest file, you may use `serde_json::Deserializer::from_slice` and it will return a stream of records. You do not need to store the record length or so, as `serde_json` can automatically find the split of the records.


The manifest format is like:

```
| JSON record | JSON record | JSON record | JSON record |
```

Again, note that we do not record the information of how many bytes each record has.

After the engine runs for several hours, the manifest file might get very large. At that time, you may periodically compact the manifest file to store the current snapshot and truncate the logs. This is an optimization you may implement as part of bonus tasks.


## Task 2: Write Manifests

You can now go ahead and modify your LSM engine to write manifests when necessary. In this task, you will need to modify:
Expand Down
2 changes: 2 additions & 0 deletions mini-lsm-book/src/week2-06-wal.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ The WAL encoding is simply a list of key-value pairs.

You will also need to implement the `recover` function to read the WAL and recover the state of a memtable.

Note that we are using a `BufWriter` for writing the WAL. Using a `BufWriter` can reduce the number of syscalls into the OS, so as to reduce the latency of the write path. The data is not guaranteed to be written to the disk when the user modifies a key. Instead, the engine only guarantee that the data is persisted when `sync` is called. To correctly persist the data to the disk, you will need to first flush the data from the buffer writer to the file object by calling `flush()`, and then do a fsync on the file by using `get_mut().sync_all()`.

## Task 2: Integrate WALs

In this task, you will need to modify:
Expand Down
90 changes: 88 additions & 2 deletions mini-lsm-book/src/week2-07-snacks.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,102 @@ In this chapter, you will:
* Implement the batch write interface.
* Add checksums to the blocks, SST metadata, manifest, and WALs.

**Note: We do not have unit tests for this chapter. As long as you pass all previous tests and ensure checksums are written to your file, it would be fine.**

## Task 1: Write Batch Interface

In this task, we will prepare for week 3 of this tutorial by adding a write batch API. You will need to modify:

```
src/lsm_storage.rs
```

The user provides `write_batch` with a batch of records to be written to the database. The records are `WriteBatchRecord<T: AsRef<[u8]>>`, and therefore it can be either `Bytes`, `&[u8]` or `Vec<u8>`. There are two types of records: delete and put. You may handle them in the same way as your `put` and `delete` function.

After that, you may refactor your original `put` and `delete` function to call `write_batch`.

You should pass all test cases in previous chapters after implementing this functionality.

## Task 2: Block Checksum

## Task 3: SST Checksum
In this task, you will need to add a block checksum at the end of each block when encoding the SST. You will need to modify:

```
src/table/builder.rs
src/table.rs
```

The format of the SST will be changed to:

```plaintext
---------------------------------------------------------------------------------------------------------------------------
| Block Section | Meta Section |
---------------------------------------------------------------------------------------------------------------------------
| data block | checksum | ... | data block | checksum | metadata | meta block offset | bloom filter | bloom filter offset |
| varlen | u32 | | varlen | u32 | varlen | u32 | varlen | u32 |
---------------------------------------------------------------------------------------------------------------------------
```

We use crc32 as our checksum algorithm. You can use `crc32fast::hash` to generate the checksum for the block after building a block.

Usually, when user specify the target block size in the storage options, the size should include both block content and checksum. For example, if the target block size is 4096, and the checksum takes 4 bytes, the actual block content target size should be 4092. However, to avoid breaking previous test cases and for simplicity, in our tutorial, we will **still** use the target block size as the target content size, and simply append the checksum at the end of the block.

When you read the block, you should verify the checksum in `read_block` correctly generate the slices for the block content. You should pass all test cases in previous chapters after implementing this functionality.

## Task 3: SST Meta Checksum

In this task, you will need to add a block checksum for bloom filters and block metadata:

```
src/table/builder.rs
src/table.rs
src/bloom.rs
```

```plaintext
----------------------------------------------------------------------------------------------------------
| Meta Section |
----------------------------------------------------------------------------------------------------------
| no. of block | metadata | checksum | meta block offset | bloom filter | checksum | bloom filter offset |
| u32 | varlen | u32 | u32 | varlen | u32 | u32 |
----------------------------------------------------------------------------------------------------------
```

You will need to add a checksum at the end of the bloom filter in `Bloom::encode` and `Bloom::decode`. Note that most of our APIs take an existing buffer that the implementation will write into, for example, `Bloom::encode`. Therefore, you should record the offset of the beginning of the bloom filter before writing the encoded content, and only checksum the bloom filter itself instead of the whole buffer.

After that, you can add a checksum at the end of block metadata. You might find it helpful to also add a length of metadata at the beginning of the section, so that it will be easier to know where to stop when decoding the block metadata.

## Task 4: WAL Checksum

In this task, you will need to modify:

```
src/wal.rs
```

We will do a per-record checksum in the write-ahead log. To do this, you have two choices:

* Generate a buffer of the key-value record, and use `crc32fast::hash` to compute the checksum at once.
* Write one field at a time (e.g., key length, key slice), and use a `crc32fast::Hasher` to compute the checksum incrementally on each field.

This is up to your choice and you will need to *choose your own adventure*. The new WAL encoding should be like:

```
| key_len | key | value_len | value | checksum |
```

## Task 5: Manifest Checksum

Lastly, let us add a checksum on the manifest file. Manifest is similar to a WAL, except that previously, we do not store the length of each record. To make the implementation easier, we now add a header of record length at the beginning of a record, and add a checksum at the end of the record.

The new manifest format is like:

```
| len | JSON record | checksum | len | JSON record | checksum | len | JSON record | checksum |
```

After implementing everything, you should pass all previous test cases. We do not provide new test cases in this chapter.

## Test Your Understanding

* Consider the case that an LSM storage engine only provides `write_batch` as the write interface (instead of single put + delete). Is it possible to implement it as follows: there is a single write thread with an mpsc channel receiver to get the changes, and all threads send write batches to the write thread. The write thread is the single point to write to the database. What are the pros/cons of this implementation? (Congrats if you do so you get BadgerDB!)
Expand All @@ -28,6 +114,6 @@ We do not provide reference answers to the questions, and feel free to discuss a

## Bonus Tasks

* **Try Recovering**. If there is a checksum error, open the database in a safe mode so that no writes can be performed and non-corrupted data can still be retrieved.
* **Recovering on Corruption**. If there is a checksum error, open the database in a safe mode so that no writes can be performed and non-corrupted data can still be retrieved.

{{#include copyright.md}}
2 changes: 2 additions & 0 deletions mini-lsm-book/src/week3-02-snapshot-read-part-1.md
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
# Snapshot Read - Memtables and SSTs

During the refactor, you might need to change the signature of some functions from `&self` to `self: &Arc<Self>` as necessary.
2 changes: 1 addition & 1 deletion mini-lsm-mvcc/src/mvcc/txn.rs
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ impl Transaction {
let committed_txns = self.inner.mvcc().committed_txns.lock();
for (_, txn_data) in committed_txns.range(self.read_ts..) {
for key_hash in read_set {
if txn_data.key_hashes.contains(&key_hash) {
if txn_data.key_hashes.contains(key_hash) {
bail!("serializable check failed");
}
}
Expand Down
2 changes: 1 addition & 1 deletion mini-lsm-mvcc/src/table.rs
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ impl SsTable {
let raw_bloom_offset = file.read(len - 4, 4)?;
let bloom_offset = (&raw_bloom_offset[..]).get_u32() as u64;
let raw_bloom = file.read(bloom_offset, len - 4 - bloom_offset)?;
let bloom_filter = Bloom::decode(&raw_bloom);
let bloom_filter = Bloom::decode(&raw_bloom)?;
let raw_meta_offset = file.read(bloom_offset - 4, 4)?;
let block_meta_offset = (&raw_meta_offset[..]).get_u32() as u64;
let raw_meta = file.read(block_meta_offset, bloom_offset - 4 - block_meta_offset)?;
Expand Down
20 changes: 14 additions & 6 deletions mini-lsm-mvcc/src/table/bloom.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
// Copyright 2021 TiKV Project Authors. Licensed under Apache-2.0.

use bytes::{BufMut, Bytes, BytesMut};
use anyhow::{bail, Result};
use bytes::{Buf, BufMut, Bytes, BytesMut};

/// Implements a bloom filter
pub struct Bloom {
Expand Down Expand Up @@ -45,19 +46,26 @@ impl<T: AsMut<[u8]>> BitSliceMut for T {

impl Bloom {
/// Decode a bloom filter
pub fn decode(buf: &[u8]) -> Self {
let filter = &buf[..buf.len() - 1];
let k = buf[buf.len() - 1];
Self {
pub fn decode(buf: &[u8]) -> Result<Self> {
let checksum = (&buf[buf.len() - 4..buf.len()]).get_u32();
if checksum != crc32fast::hash(&buf[..buf.len() - 4]) {
bail!("checksum mismatched for bloom filters");
}
let filter = &buf[..buf.len() - 5];
let k = buf[buf.len() - 5];
Ok(Self {
filter: filter.to_vec().into(),
k,
}
})
}

/// Encode a bloom filter
pub fn encode(&self, buf: &mut Vec<u8>) {
let offset = buf.len();
buf.extend(&self.filter);
buf.put_u8(self.k);
let checksum = crc32fast::hash(&buf[offset..]);
buf.put_u32(checksum);
}

/// Get bloom filter bits per key from entries count and FPR
Expand Down
15 changes: 8 additions & 7 deletions mini-lsm-mvcc/src/wal.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
use std::fs::{File, OpenOptions};
use std::hash::Hasher;
use std::io::{Read, Write};
use std::io::{BufWriter, Read, Write};
use std::path::Path;
use std::sync::Arc;

Expand All @@ -12,20 +12,20 @@ use parking_lot::Mutex;
use crate::key::{KeyBytes, KeySlice};

pub struct Wal {
file: Arc<Mutex<File>>,
file: Arc<Mutex<BufWriter<File>>>,
}

impl Wal {
pub fn create(path: impl AsRef<Path>) -> Result<Self> {
Ok(Self {
file: Arc::new(Mutex::new(
file: Arc::new(Mutex::new(BufWriter::new(
OpenOptions::new()
.read(true)
.create_new(true)
.write(true)
.open(path)
.context("failed to create WAL")?,
)),
))),
})
}

Expand Down Expand Up @@ -60,7 +60,7 @@ impl Wal {
skiplist.insert(KeyBytes::from_bytes_with_ts(key, ts), value);
}
Ok(Self {
file: Arc::new(Mutex::new(file)),
file: Arc::new(Mutex::new(BufWriter::new(file))),
})
}

Expand All @@ -86,8 +86,9 @@ impl Wal {
}

pub fn sync(&self) -> Result<()> {
let file = self.file.lock();
file.sync_all()?;
let mut file = self.file.lock();
file.flush()?;
file.get_mut().sync_all()?;
Ok(())
}
}
7 changes: 4 additions & 3 deletions mini-lsm-starter/src/table/bloom.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
// Copyright 2021 TiKV Project Authors. Licensed under Apache-2.0.

use anyhow::Result;
use bytes::{BufMut, Bytes, BytesMut};

/// Implements a bloom filter
Expand Down Expand Up @@ -45,13 +46,13 @@ impl<T: AsMut<[u8]>> BitSliceMut for T {

impl Bloom {
/// Decode a bloom filter
pub fn decode(buf: &[u8]) -> Self {
pub fn decode(buf: &[u8]) -> Result<Self> {
let filter = &buf[..buf.len() - 1];
let k = buf[buf.len() - 1];
Self {
Ok(Self {
filter: filter.to_vec().into(),
k,
}
})
}

/// Encode a bloom filter
Expand Down
3 changes: 2 additions & 1 deletion mini-lsm-starter/src/wal.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#![allow(dead_code)] // REMOVE THIS LINE after fully implementing this functionality

use std::fs::File;
use std::io::BufWriter;
use std::path::Path;
use std::sync::Arc;

Expand All @@ -10,7 +11,7 @@ use crossbeam_skiplist::SkipMap;
use parking_lot::Mutex;

pub struct Wal {
file: Arc<Mutex<File>>,
file: Arc<Mutex<BufWriter<File>>>,
}

impl Wal {
Expand Down
2 changes: 1 addition & 1 deletion mini-lsm/src/table.rs
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ impl SsTable {
let raw_bloom_offset = file.read(len - 4, 4)?;
let bloom_offset = (&raw_bloom_offset[..]).get_u32() as u64;
let raw_bloom = file.read(bloom_offset, len - 4 - bloom_offset)?;
let bloom_filter = Bloom::decode(&raw_bloom);
let bloom_filter = Bloom::decode(&raw_bloom)?;
let raw_meta_offset = file.read(bloom_offset - 4, 4)?;
let block_meta_offset = (&raw_meta_offset[..]).get_u32() as u64;
let raw_meta = file.read(block_meta_offset, bloom_offset - 4 - block_meta_offset)?;
Expand Down
20 changes: 14 additions & 6 deletions mini-lsm/src/table/bloom.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
// Copyright 2021 TiKV Project Authors. Licensed under Apache-2.0.

use bytes::{BufMut, Bytes, BytesMut};
use anyhow::{bail, Result};
use bytes::{Buf, BufMut, Bytes, BytesMut};

/// Implements a bloom filter
pub struct Bloom {
Expand Down Expand Up @@ -45,19 +46,26 @@ impl<T: AsMut<[u8]>> BitSliceMut for T {

impl Bloom {
/// Decode a bloom filter
pub fn decode(buf: &[u8]) -> Self {
let filter = &buf[..buf.len() - 1];
let k = buf[buf.len() - 1];
Self {
pub fn decode(buf: &[u8]) -> Result<Self> {
let checksum = (&buf[buf.len() - 4..buf.len()]).get_u32();
if checksum != crc32fast::hash(&buf[..buf.len() - 4]) {
bail!("checksum mismatched for bloom filters");
}
let filter = &buf[..buf.len() - 5];
let k = buf[buf.len() - 5];
Ok(Self {
filter: filter.to_vec().into(),
k,
}
})
}

/// Encode a bloom filter
pub fn encode(&self, buf: &mut Vec<u8>) {
let offset = buf.len();
buf.extend(&self.filter);
buf.put_u8(self.k);
let checksum = crc32fast::hash(&buf[offset..]);
buf.put_u32(checksum);
}

/// Get bloom filter bits per key from entries count and FPR
Expand Down
Loading

0 comments on commit b4485f4

Please sign in to comment.