Skip to content

Commit

Permalink
docs: add compaction tradeoff figure
Browse files Browse the repository at this point in the history
Signed-off-by: Alex Chi <iskyzh@gmail.com>
  • Loading branch information
skyzh committed Mar 13, 2024
1 parent cb55a7f commit f840dc5
Show file tree
Hide file tree
Showing 2 changed files with 94 additions and 0 deletions.
92 changes: 92 additions & 0 deletions mini-lsm-book/src/lsm-tutorial/week2-00-triangle.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions mini-lsm-book/src/week2-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ The ratio of memtables flushed to the disk versus total data written to the disk

A good compaction strategy can balance read amplification, write amplification, and space amplification (we will talk about it soon). In a general-purpose LSM storage engine, it is generally impossible to find a strategy that can achieve the lowest amplification in all 3 of these factors, unless there are some specific data pattern that the engine could use. The good thing about LSM is that we can theoretically analyze the amplifications of a compaction strategy and all these things happen in the background. We can choose compaction strategies and dynamically change some parameters of them to adjust our storage engine to the optimal state. Compaction strategies are all about tradeoffs, and LSM-based storage engine enables us to select what to be traded at runtime.

![compaction tradeoffs](./lsm-tutorial/week2-00-triangle.svg)

One typical workload in the industry is like: the user first batch ingests data into the storage engine, usually gigabytes per second, when they start a product. Then, the system goes live and users start doing small transactions over the system. In the first phase, the engine should be able to quickly ingest data, and therefore we can use a compaction strategy that minimize write amplification to accelerate this process. Then, we adjust the parameters of the compaction algorithm to make it optimized for read amplification, and do a full compaction to reorder existing data, so that the system can run stably when it goes live.

If the workload is like a time-series database, it is possible that the user always populate and truncate data by time. Therefore, even if there is no compaction, these append-only data can still have low amplification on the disk. Therefore, in real life, you should watch for patterns or specific requirements from the users, and use these information to optimize your system.
Expand Down

0 comments on commit f840dc5

Please sign in to comment.