diff --git a/mini-lsm-book/src/week1-05-read-path.md b/mini-lsm-book/src/week1-05-read-path.md index dc5d7190..973e5f0f 100644 --- a/mini-lsm-book/src/week1-05-read-path.md +++ b/mini-lsm-book/src/week1-05-read-path.md @@ -78,6 +78,7 @@ For get requests, it will be processed as lookups in the memtables, and then sca ## Test Your Understanding * Consider the case that a user has an iterator that iterates the whole storage engine, and the storage engine is 1TB large, so that it takes ~1 hour to scan all the data. What would be the problems if the user does so? (This is a good question and we will ask it several times at different points of the tutorial...) +* Another popular interface provided by some LSM-tree storage engines is multi-get (or vectored get). The user can pass a list of keys that they want to retrieve. The interface returns the value of each of the key. For example, `multi_get(vec!["a", "b", "c", "d"]) -> a=1,b=2,c=3,d=4`. Obviously, an easy implementation is to simply doing a single get for each of the key. How will you implement the multi-get interface, and what optimizations you can do to make it more efficient? (Hint: some operations during the get process will only need to be done once for all keys, and besides that, you can think of an improved disk I/O interface to better support this multi-get interface). We do not provide reference answers to the questions, and feel free to discuss about them in the Discord community. diff --git a/mini-lsm-book/src/week1-06-write-path.md b/mini-lsm-book/src/week1-06-write-path.md index 76f88877..a9a80032 100644 --- a/mini-lsm-book/src/week1-06-write-path.md +++ b/mini-lsm-book/src/week1-06-write-path.md @@ -119,6 +119,7 @@ You can implement helper functions like `range_overlap` and `key_within` to simp * What happens if a user requests to delete a key twice? * How much memory (or number of blocks) will be loaded into memory at the same time when the iterator is initialized? +* Some crazy users want to *fork* their LSM tree. They want to start the engine to ingest some data, and then fork it, so that they get two identical dataset and then operate on them separately. An easy but not efficient way to implement is to simply copy all SSTs and the in-memory structures to a new directory and start the engine. However, note that we never modify the on-disk files, and we can actually reuse the SST files from the parent engine. How do you think you can implement this fork functionality efficiently without copying data? (Check out [Neon Branching](https://neon.tech/docs/introduction/branching)). We do not provide reference answers to the questions, and feel free to discuss about them in the Discord community. diff --git a/mini-lsm-book/src/week2-05-manifest.md b/mini-lsm-book/src/week2-05-manifest.md index cd55deee..a4c4b7ac 100644 --- a/mini-lsm-book/src/week2-05-manifest.md +++ b/mini-lsm-book/src/week2-05-manifest.md @@ -90,6 +90,8 @@ get 1500 * When do you need to call `fsync`? Why do you need to fsync the directory? * What are the places you will need to write to the manifest? * Consider an alternative implementation of an LSM engine that does not use a manifest file. Instead, it records the level/tier information in the header of each file, scans the storage directory every time it restarts, and recover the LSM state solely from the files present in the directory. Is it possible to correctly maintain the LSM state in this implementation and what might be the problems/challenges with that? +* Currently, we create all SST/concat iterators before creating the merge iterator, which means that we have to load the first block of the first SST in all levels into memory before starting the scanning process. We have start/end key in the manifest, and is it possible to leverage this information to delay the loading of the data blocks and make the time to return the first key-value pair faster? +* Is it possible not to store the tier/level information in the manifest? i.e., we only store the list of SSTs we have in the manifest without the level information, and rebuild the tier/level using the key range and timestamp information (SST metadata). ## Bonus Tasks diff --git a/mini-lsm-book/src/week3-04-watermark.md b/mini-lsm-book/src/week3-04-watermark.md index cc95c615..3092147d 100644 --- a/mini-lsm-book/src/week3-04-watermark.md +++ b/mini-lsm-book/src/week3-04-watermark.md @@ -76,6 +76,7 @@ Assume these are all keys in the engine. If we do a scan at ts=3, we will get `a * Why do we need to store an `Arc` of `Transaction` inside a transaction iterator? * What is the condition to fully remove a key from the SST file? * For now, we only remove a key when compacting to the bottom-most level. Is there any other prior time that we can remove the key? (Hint: you know the start/end key of each SST in all levels.) +* Consider the case that the user creates a long-running transaction and we could not garbage collect anything. The user keeps updating a single key. Eventually, there could be a key with thousands of versions in a single SST file. How would it affect performance, and how would you deal with it? ## Bonus Tasks