Skip to content

Commit

Permalink
book: more questions on multi-tenant systems (#87)
Browse files Browse the repository at this point in the history
Signed-off-by: Alex Chi Z <chi@neon.tech>
  • Loading branch information
skyzh committed Jul 19, 2024
1 parent dd333ca commit 42b94bd
Show file tree
Hide file tree
Showing 6 changed files with 22 additions and 4 deletions.
13 changes: 13 additions & 0 deletions mini-lsm-book/src/week1-02-merge-iterator.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,19 @@ a->1, b->del, c->4, d->5, e->4

The constructor of the merge iterator takes a vector of iterators. We assume the one with a lower index (i.e., the first one) has the latest data.

When using the Rust binary heap, you may find the `peek_mut` function useful.

```rust,no_run
let Some(mut inner) = heap.peek_mut() {
*inner += 1; // <- do some modifications to the inner item
}
// When the PeekMut reference gets dropped, the binary heap gets reordered automatically.
let Some(mut inner) = heap.peek_mut() {
PeekMut::pop(inner) // <- pop it out from the heap
}
```

One common pitfall is on error handling. For example,

```rust,no_run
Expand Down
1 change: 1 addition & 0 deletions mini-lsm-book/src/week1-04-sst.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ At this point, you may change your table iterator to use `read_block_cached` ins
* Does the usage of a block cache guarantee that there will be at most a fixed number of blocks in memory? For example, if you have a `moka` block cache of 4GB and block size of 4KB, will there be more than 4GB/4KB number of blocks in memory at the same time?
* Is it possible to store columnar data (i.e., a table of 100 integer columns) in an LSM engine? Is the current SST format still a good choice?
* Consider the case that the LSM engine is built on object store services (i.e., S3). How would you optimize/change the SST format/parameters and the block cache to make it suitable for such services?
* For now, we load the index of all SSTs into the memory. Assume you have a 16GB memory reserved for the indexes, can you estimate the maximum size of the database your LSM system can support? (That's why you need an index cache!)

We do not provide reference answers to the questions, and feel free to discuss about them in the Discord community.

Expand Down
4 changes: 3 additions & 1 deletion mini-lsm-book/src/week1-06-write-path.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,12 +120,14 @@ You can implement helper functions like `range_overlap` and `key_within` to simp
* What happens if a user requests to delete a key twice?
* How much memory (or number of blocks) will be loaded into memory at the same time when the iterator is initialized?
* Some crazy users want to *fork* their LSM tree. They want to start the engine to ingest some data, and then fork it, so that they get two identical dataset and then operate on them separately. An easy but not efficient way to implement is to simply copy all SSTs and the in-memory structures to a new directory and start the engine. However, note that we never modify the on-disk files, and we can actually reuse the SST files from the parent engine. How do you think you can implement this fork functionality efficiently without copying data? (Check out [Neon Branching](https://neon.tech/docs/introduction/branching)).
* Imagine you are building a multi-tenant LSM system where you host 10k databases on a single 128GB memory machine. The memtable size limit is set to 256MB. How much memory for memtable do you need for this setup?
* Obviously, you don't have enough memory for all these memtables. Assume each user still has their own memtable, how can you design the memtable flush policy to make it work? Does it make sense to make all these users share the same memtable (i.e., by encoding a tenant ID as the key prefix)?

We do not provide reference answers to the questions, and feel free to discuss about them in the Discord community.

## Bonus Tasks

* **Implement Write Stall.** When the number of memtables exceed the maximum number too much, you can stop users from writing to the storage engine. You may also implement write stall for L0 tables in week 2 after you have implemented compactions.
* **Implement Write/L0 Stall.** When the number of memtables exceed the maximum number too much, you can stop users from writing to the storage engine. You may also implement write stall for L0 tables in week 2 after you have implemented compactions.
* **Prefix Scan.** You may filter more SSTs by implementing the prefix scan interface and using the prefix information.

{{#include copyright.md}}
2 changes: 1 addition & 1 deletion mini-lsm-book/src/week2-03-tiered.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ As tiered compaction does not use the L0 level of the LSM state, you should dire
* What are the pros/cons of universal compaction compared with simple leveled/tiered compaction?
* How much storage space is it required (compared with user data size) to run universal compaction?
* Can we merge two tiers that are not adjacent in the LSM state?
* What happens if compaction speed cannot keep up with the SST flushes?
* What happens if compaction speed cannot keep up with the SST flushes for tiered compaction?
* What might needs to be considered if the system schedules multiple compaction tasks in parallel?
* SSDs also write its own logs (basically it is a log-structured storage). If the SSD has a write amplification of 2x, what is the end-to-end write amplification of the whole system? Related: [ZNS: Avoiding the Block Interface Tax for Flash-based SSDs](https://www.usenix.org/conference/atc21/presentation/bjorling).
* Consider the case that the user chooses to keep a large number of sorted runs (i.e., 300) for tiered compaction. To make the read path faster, is it a good idea to keep some data structure that helps reduce the time complexity (i.e., to `O(log n)`) of finding SSTs to read in each layer for some key ranges? Note that normally, you will need to do a binary search in each sorted run to find the key ranges that you will need to read. (Check out Neon's [layer map](https://neon.tech/blog/persistent-structures-in-neons-wal-indexing) implementation!)
Expand Down
2 changes: 1 addition & 1 deletion mini-lsm-book/src/week2-04-leveled.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ The implementation should be similar to simple leveled compaction. Remember to c
* Finding a good key split point for compaction may potentially reduce the write amplification, or it does not matter at all? (Consider that case that the user write keys beginning with some prefixes, `00` and `01`. The number of keys under these two prefixes are different and their write patterns are different. If we can always split `00` and `01` into different SSTs...)
* Imagine that a user was using tiered (universal) compaction before and wants to migrate to leveled compaction. What might be the challenges of this migration? And how to do the migration?
* And if we do it reversely, what if the user wants to migrate from leveled compaction to tiered compaction?
* What happens if compaction speed cannot keep up with the SST flushes?
* What happens if compaction speed cannot keep up with the SST flushes for leveled compaction?
* What might needs to be considered if the system schedules multiple compaction tasks in parallel?
* What is the peak storage usage for leveled compaction? Compared with universal compaction?
* Is it true that with a lower `level_size_multiplier`, you can always get a lower write amplification?
Expand Down
4 changes: 3 additions & 1 deletion mini-lsm-book/src/week2-06-wal.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ The WAL encoding is simply a list of key-value pairs.

You will also need to implement the `recover` function to read the WAL and recover the state of a memtable.

Note that we are using a `BufWriter` for writing the WAL. Using a `BufWriter` can reduce the number of syscalls into the OS, so as to reduce the latency of the write path. The data is not guaranteed to be written to the disk when the user modifies a key. Instead, the engine only guarantee that the data is persisted when `sync` is called. To correctly persist the data to the disk, you will need to first flush the data from the buffer writer to the file object by calling `flush()`, and then do a fsync on the file by using `get_mut().sync_all()`.
Note that we are using a `BufWriter` for writing the WAL. Using a `BufWriter` can reduce the number of syscalls into the OS, so as to reduce the latency of the write path. The data is not guaranteed to be written to the disk when the user modifies a key. Instead, the engine only guarantee that the data is persisted when `sync` is called. To correctly persist the data to the disk, you will need to first flush the data from the buffer writer to the file object by calling `flush()`, and then do a fsync on the file by using `get_mut().sync_all()`. Note that you *only* need to fsync when the engine's `sync` gets called. You *do not* need to fsync every time on writing data.

## Task 2: Integrate WALs

Expand Down Expand Up @@ -66,6 +66,8 @@ Remember to recover the correct `next_sst_id` from the state, which should be `m

## Test Your Understanding

* When should you call `fsync` in your engine? What happens if you call `fsync` too often (i.e., on every put key request)?
* How costly is the `fsync` operation in general on an SSD (solid state drive)?
* When can you tell the user that their modifications (put/delete) have been persisted?
* How can you handle corrupted data in WAL?

Expand Down

0 comments on commit 42b94bd

Please sign in to comment.