From ac335c91cc94aca0ad331dd0d52fcc9a54d2c395 Mon Sep 17 00:00:00 2001
From: Thomas Winant <thomas@well-typed.com>
Date: Thu, 31 Dec 2020 09:49:53 +0100
Subject: [PATCH 1/5] report: Immutable Database

---
 .../report/chapters/storage/immutabledb.tex   | 922 ++++++++++++++++++
 ouroboros-consensus/docs/report/report.tex    |   1 +
 2 files changed, 923 insertions(+)

diff --git a/ouroboros-consensus/docs/report/chapters/storage/immutabledb.tex b/ouroboros-consensus/docs/report/chapters/storage/immutabledb.tex
index 54e0174c45b..61ee9eb3e73 100644
--- a/ouroboros-consensus/docs/report/chapters/storage/immutabledb.tex
+++ b/ouroboros-consensus/docs/report/chapters/storage/immutabledb.tex
@@ -1,2 +1,924 @@
+\newcommand{\chunkNumber}[1]{\ensuremath{\mathsf{chunkNumber}(#1)}}
+\newcommand{\relativeSlot}[2]{\ensuremath{\mathsf{relativeSlot}(#1, #2)}}
+
 \chapter{Immutable Database}
 \label{immutable}
+
+The Immutable DB is tasked with storing the blocks that are part of the
+\emph{immutable} part of the chain. Because of the nature of this task, the
+requirements and \emph{non-requirements} of this component are fairly specific:
+
+\begin{itemize}
+\item \textbf{Append-only}: as it represents the immutable chain, blocks will
+  only be appended in the same order as they are ordered in the chain. Blocks
+  will never be \emph{modified} or \emph{deleted}.
+\item \textbf{Reading}: the database should be able to return the block or
+  header stored at a given \emph{point} (combination of slot number and hash)
+  efficiently.\todo{define point somewhere?}
+\item \textbf{Efficient streaming}: when serving blocks or headers to other
+  nodes, we need to be able to stream ranges of \emph{consecutive} blocks or
+  headers efficiently. As described in \cref{serialisation:network:serialised},
+  it should be possible to stream \emph{raw} blocks and headers, without
+  serialising them.
+\item \textbf{Recoverability}: it must be possible to validate the blocks stored
+  in the database. When a block in the database is corrupt or missing, it is
+  sufficient to truncate the database, representing an immutable chain, to the
+  last valid block before the corrupt or missing block. The truncated blocks can
+  simply be downloaded again. It is therefore not necessary to be able recover
+  the full database when blocks are missing.
+\end{itemize}
+
+While we touched upon some of the non-requirements already above, it is useful
+to highlight the following non-requirements.
+
+\begin{itemize}
+\item \textbf{Queries}: besides looking up a single block by its point and
+  streaming ranges of consecutive blocks, the database does have to be able to
+  answer queries about blocks. No searching or filtering is needed.
+\item \textbf{Durability}: the system does not require the durability guarantee
+  that traditional database systems provide (the D in ACID). If the system
+  crashes right after appending a block, it is acceptable that the block in
+  question is truncated when recovering the database. Because of the overlap
+  with the Volatile DB\todo{link}, such a truncation is likely to even go
+  unnoticed. In the worst case, the truncated blocks can simply be downloaded
+  again.
+\end{itemize}
+
+Because of the specific requirements and non-requirements listed above, we
+decided to write our own implementation, the \lstinline!ImmutableDB!, instead of
+using an existing off-the-shelf database system. Traditional database systems
+provide guarantees that are not needed and, conversely, do not take advantage of
+the requirements to optimise certain operations. For example, there is no need
+for a journal or flushing (\lstinline!fsync!) the buffers after each write
+because of our unique durability and recoverability (non-)requirements.
+
+\section{API}
+\label{immutable:api}
+
+Before we describe the implementation of the Immutable DB, we first describe its
+functionality. The Immutable DB has the following API:
+
+\begin{lstlisting}
+data ImmutableDB m blk = ImmutableDB {
+      closeDB :: m ()
+
+    , getTip :: STM m (WithOrigin (Tip blk))
+
+    , appendBlock :: blk -> m ()
+
+    , getBlockComponent ::
+           forall b.
+           BlockComponent blk b
+        -> RealPoint blk
+        -> m (Either (MissingBlock blk) b)
+
+    , stream ::
+           forall b.
+           ResourceRegistry m
+        -> BlockComponent blk b
+        -> StreamFrom blk
+        -> StreamTo   blk
+        -> m (Either (MissingBlock blk) (Iterator m blk b))
+    }
+\end{lstlisting}
+
+The database is parameterised over the block type \lstinline!blk! and the monad
+\lstinline!m!, like most of the consensus layer.\todo{mention io-sim}
+\todo{TODO} Mention our use of records for components?
+
+The \lstinline!closeDB! operation closes the database, allowing all opened
+resources, including open file handles, to be released. This is typically only
+used when shutting down the entire system. Calling any other operation on an
+already-closed database should result in an exception.
+
+The \lstinline!getTip! operation returns the current tip of the Immutable DB.
+The \lstinline!Tip! type contains information about the block at the tip like
+the slot number, the block number, the hash, etc. The \lstinline!WithOrigin!
+type is isomorphic to \lstinline!Maybe! and is used to account for the
+possibility of an empty database, i.e., when the tip is at the ``origin'' of the
+chain. This operation is an \lstinline!STM! operation, which allows it to be
+combined with other \lstinline!STM! operations in a single transaction, to
+obtain a consistent view on them. This also implies that no IO or disk access is
+needed to obtain the current tip.
+
+The \lstinline!appendBlock! operation appends a block to the Immutable DB. As
+slot numbers increase monotonically in the blockchain, the block's slot must be
+greater than the current tip's slot (or equal when the tip points at an EBB, see
+\cref{ebbs}). It is not required that each slot is filled,\todo{link?} so there
+can certainly be gaps in the slot numbers.
+
+The \lstinline!getBlockComponent! operation allows reading one or more
+components of the block in the database at the given point. We discuss what
+block components are in \cref{immutable:api:block-component}. The
+\lstinline!RealPoint! type represents a point that can only refer to a block,
+not to genesis (the empty chain), which the larger \lstinline!Point! type
+allows. As the given point might not be in the Immutable DB, this operation can
+also return a \lstinline!MissingBlock! error instead of the requested block
+component.
+
+The \lstinline!stream! operation returns an iterator to efficiently stream the
+blocks between the two given bounds. The bounds are defined as such:
+\begin{lstlisting}
+data StreamFrom blk =
+    StreamFromInclusive !(RealPoint blk)
+  | StreamFromExclusive !(Point     blk)
+
+newtype StreamTo blk =
+    StreamToInclusive (RealPoint blk)
+\end{lstlisting}
+Lower bounds can be either inclusive or exclusive, but exclusive upper bounds
+were omitted because they were not needed in practice. An inclusive bound must
+refer to a block, not genesis, hence the use of \lstinline!RealPoint!. The
+exclusive lower bound \emph{can} refer to genesis, hence the use of
+\lstinline!Point!, in particular to begin streaming from the start of the chain.
+As one or both of the bounds might not be in the Immutable DB, this operation
+can return a \lstinline!MissingBlock! error. We discuss what block components
+are in \cref{immutable:api:block-component}. The \lstinline!ResourceRegistry!
+will be used to allocate all the resources the iterator opens during its
+lifetime, e.g., file handles. By closing the registry in case of an exception
+(using \lstinline!bracket!), all open resources are released and nothing is
+leaked.\todo{link} More discussion about iterators follows in
+\cref{immutable:api:iterators}.
+
+\subsection{Block Component}
+\label{immutable:api:block-component}
+
+\todo{TODO} move to ChainDB?
+
+Besides reading or streaming blocks from the Immutable DB, it must be possible
+to read or stream headers, raw blocks (see
+\cref{serialisation:network:serialised}), but in some cases also \emph{nested
+contexts} (see \cref{serialisation:storage:nested-contents}) or even block
+sizes. Adding an operation to the API for each of these would result in too much
+duplication. We handle this with the \lstinline!BlockComponent! abstraction:
+when reading or streaming, one can choose which \emph{components} of a block
+should be returned, e.g., the block itself, the header of the block, the size of
+the block, the raw block, the raw header, etc. We model this with the following
+GADT:
+
+\begin{lstlisting}
+data BlockComponent blk a where
+  GetVerifiedBlock :: BlockComponent blk blk
+  GetBlock         :: BlockComponent blk blk
+  GetRawBlock      :: BlockComponent blk ByteString
+  GetHeader        :: BlockComponent blk (Header blk)
+  GetRawHeader     :: BlockComponent blk ByteString
+  GetHash          :: BlockComponent blk (HeaderHash blk)
+  GetSlot          :: BlockComponent blk SlotNo
+  GetIsEBB         :: BlockComponent blk IsEBB
+  GetBlockSize     :: BlockComponent blk Word32
+  GetHeaderSize    :: BlockComponent blk Word16
+  GetNestedCtxt    :: BlockComponent blk (SomeSecond (NestedCtxt Header) blk)
+  ..
+\end{lstlisting}
+The \lstinline!a! type index determines the type of the block component.
+Additionally, we have \lstinline!Functor! and \lstinline!Applicative! instances.
+The latter allows combining multiple \lstinline!BlockComponent!s into one. This
+can be considered a small DSL for querying components of a block.
+
+\subsection{Iterators}
+\label{immutable:api:iterators}
+
+The following API can be used to interact with an iterator:
+
+\begin{lstlisting}
+data Iterator m blk b = Iterator {
+      iteratorNext    :: m (IteratorResult b)
+    , iteratorHasNext :: STM m (Maybe (RealPoint blk))
+    , iteratorClose   :: m ()
+    }
+
+data IteratorResult b =
+    IteratorExhausted
+  | IteratorResult b
+\end{lstlisting}
+
+The \lstinline!iteratorNext! operation returns the current
+\lstinline!IteratorResult! and advances the iterator to the next block in the
+stream. When the iterator has reached its upper bound, it is exhausted. Remember
+that the \lstinline!b! type argument corresponds to the requested block
+component.
+
+The \lstinline!iteratorHasNext! operation returns the point corresponding to the
+block the next call to \lstinline!iteratorNext! will return. When exhausted,
+\lstinline!Nothing! is returned.
+
+As an open iterator can hold onto resources, e.g., open file handles, it should
+be explicitly closed using the \lstinline!iteratorClose! operation. Interacting
+with a closed iterator should result in an exception, except for calling
+\lstinline!iteratorClose!, which is idempotent.
+
+\section{Implementation}
+\label{immutable:implementation}
+
+We will now give a high-level overview of our custom implementation of the
+Immutable DB that satisfies the requirements and the API.
+
+\begin{itemize}
+\item We store blocks sequentially in a file, called a \emph{chunk file}. We
+  append each raw block, without any extra information before or after it, to
+  the chunk file. This will facilitate efficient binary streaming of blocks. In
+  principle, it is a matter of copying bytes from one buffer to another, without
+  any additional processing needed.
+
+\item Every $x$ \emph{slots}, where $x$ is the configurable chunk size, we start
+  a new chunk file to avoid storing all blocks in a single file.
+
+\item To facilitate looking up a block by a point, which consists of the hash
+  and the slot number, we ``index'' our database by slot numbers. One can then
+  look up the block in the given slot and compare its hash against the point's
+  hash. No searching will be needed.
+
+  Blocks are stored sequentially in chunk files, but slot numbers do \emph{not}
+  increase one-by-one; they are \emph{sparse}. This means we need a mapping from
+  the slot number to the offset and size of the block in the chunk file. We
+  store this mapping in the on-disk \emph{primary index}, one per chunk file,
+  which we discuss in more detail in \cref{immutable:implementation:indices}.
+
+\item As mentioned above, when looking up a block by a point, we will compare
+  the hash of the block at the point's slot in the Immutable DB with the point's
+  hash. We should be able to do this without first having to read and
+  deserialise the entire block in order to know its hash.
+
+  Moreover, it should be possible to read just the header of the block without
+  first having to read the entire block. As described in
+  \cref{serialisation:storage:nested-contents}, we can do this if have access to
+  the header offset, header size, and nested context of the block.
+
+  For these reasons, we store the aforementioned extra information, which should
+  be available without having to read and deserialise the entire block,
+  separately in the on-disk \emph{secondary index}, one per chunk file
+  (\cref{immutable:implementation:indices}).
+
+\item All the information stored in the primary and secondary indices can be
+  recovered from the blocks in the chunk files. This is described in
+  \cref{immutable:implementation:recovery}.
+
+\item Whenever a file-system operation fails, or a file is missing or corrupted,
+  we shut down the Immutable DB and consequently the whole system. When this
+  happens, either the system's file system is no longer reliable (e.g., disk
+  corruption), manual intervention (e.g., disk is full) is required, or there is
+  a bug in the system. In all cases, there is no point in trying to continue
+  operating. We shut down the system and flag the shutdown as \emph{dirty},
+  triggering a full validation on the next start-up, see
+  \cref{immutable:implementation:recovery}.
+
+  Not all forms of disk corruption can easily be detected. For example, when
+  some bytes in a block stored in a chunk file have been flipped on disk, this
+  can easily go unnoticed. Deserialising the block might fail if the
+  serialisation format is no longer valid, but the bitflip could also happen in,
+  e.g., the amount of a transaction, which will not be detected by the
+  deserialiser. In fact, the majority of blocks read will not even be
+  deserialised, as blocks served to other nodes are read and sent in their raw,
+  still serialised format. However, sending a corrupted block must be avoided,
+  as nodes receiving it will consider it invalid and can blacklist us, mistaking
+  us for an adversary.
+
+  To detect such forms of silent corruption, we store CRC32 checksums in the
+  secondary index (\cref{immutable:implementation:indices}) which we verify when
+  reading the block, which we can do even when not deserialising the block. Note
+  that we could use the block's own hash for this purpose,\footnote{To be
+  precise: we would have to check the block body against the body hash stored in
+  the header, and verify the signature of the header.} but because computing
+  such a cryptographic hash is much more expensive, we opted for a separate
+  CRC32 checksum, which is much more efficient to compute and designed for
+  exactly this purpose.
+
+\item We store the state of the current chunk, including its indices, in memory.
+  We store this state, a pure data type, in a \lstinline!StrictMVar!. Besides
+  avoiding space leaks by forcing its contents to WHNF, this
+  \lstinline!StrictMVar! type has another useful ability that its standard
+  non-strict variant is lacks: while it is locked when being modified, the
+  previous, \emph{stale} value can still be read.
+
+  This is convenient for the Immutable DB: we can support multiple concurrent
+  reads even when at most one append operation is in progress, as it is safe to
+  read a block based on the stale state because data will only be appended, not
+  modified.
+
+  To append a block to the Immutable DB, we lock the state to avoid concurrent
+  append operations. We append the block to the chunk file, and append the
+  necessary information to the primary and secondary indices. We unlock the
+  state, updated with the information about the newly appended block.
+
+\item We \emph{do not flush} any writes to disk, as discussed in the
+  introduction of this chapter. This makes appending a block quite cheap: the
+  serialised block is copied to an OS buffer, which is then asynchronously
+  flushed in the background.
+
+\item To avoid repeatedly reading and deserialising the same primary and
+  secondary indices of older chunks, we cache them in a LRU-cache that is
+  bounded in size.
+
+\item To open an iterator, we check its bounds using the (cached) indices. The
+  bounds are valid when both correspond to blocks present in the Immutable DB.
+  Next, a file handle is opened for the chunk file containing the first block to
+  stream. The same chunk file's indices are read (from the cache) and the
+  iterator will maintain a list of secondary index entries, one for each block
+  to stream from the chunk file. By having this list of entries in memory, the
+  indices will not have to be accessed for each streamed block.
+
+  When a block component is requested from the iterator, it is read from the
+  chunk file and/or extracted from the corresponding in-memory entry.
+  Afterwards, the entry is dropped from the in-memory list so that the next
+  entry is ready to be read. When the list of entries is exhausted without
+  reaching the end bound, we move on to the next chunk file. This process
+  repeats until the end bound is reached.
+\end{itemize}
+
+\subsection{Chunk layout}
+\label{immutable:implementation:chunk-layout}
+
+Each block in the block chain has a unique slot number (except for EBBs, which
+we discuss below). Slot numbers increase in the blockchain, but not all slots
+have to be filled. For example, in the Byron era (using the Permissive BFT
+consensus algorithm), nearly every slot will be filled, but in the Shelley era
+(using the Praos consensus algorithm), on average only one in twenty slots will
+be filled.
+
+As mentioned above, we want to group blocks into chunk files. Because we need to
+be able to look up blocks in the Immutable DB based on their slot number, we
+group blocks into chunk files based on their slot numbers so that the chunk file
+containing a block can be determined by looking at the slot number of the block.
+
+Internally, we translate \emph{absolute} slot numbers into \emph{chunk numbers}
+and \emph{relative slot numbers} (relative w.r.t.\ the chunk). As EBBs
+(\cref{ebbs}) have the same slot number as their successor, this translation is
+not injective. To restore injectivity, we include ``whether the block is an EBB
+or not'' as an input to the translation.
+
+\todo{TODO} how should this be formatted?
+
+\begin{definition}[Chunk number]
+  Let $s$ be the absolute slot number of a block. Using a chunk size of
+  $\mathit{sz}$:
+
+  \[
+  \chunkNumber{s} = \lfloor s / \mathit{sz} \rfloor
+  \]
+  Naturally, chunks are zero-indexed.
+
+\end{definition}
+
+\begin{definition}[Relative slot number]
+  Let $s$ be the absolute slot number of a block. Using a chunk size of
+  $\mathit{sz}$:
+
+  \[
+  \relativeSlot{s}{\mathit{isEBB}} =
+  \begin{cases}
+    0                         & \text{if}\,\mathit{isEBB} \\
+    (s \bmod \mathit{sz}) + 1 & \text{otherwise}
+  \end{cases}
+  \]
+  We reserve the very first relative slot for an EBB, hence the need to make
+  room for it by incrementing by one in the non-EBB case.
+\end{definition}
+
+In the example below, we show a chunk with chunk number 1 using a chunk size of
+100:
+
+\begin{center}
+\begin{tikzpicture}
+\draw (0, 0) -- (10, 0);
+\draw (0, 1) -- (10, 1);
+
+\draw ( 0, 0) -- ( 0, 1);
+\draw ( 1, 0) -- ( 1, 1);
+\draw ( 2, 0) -- ( 2, 1);
+\draw ( 3, 0) -- ( 3, 1);
+\draw ( 4, 0) -- ( 4, 1);
+\draw ( 8, 0) -- ( 8, 1);
+\draw ( 9, 0) -- ( 9, 1);
+\draw (10, 0) -- (10, 1);
+
+\draw (0.5, 0.5) node {\small EBB};
+\draw (1.5, 0.5) node {\small Block};
+\draw (2.5, 0.5) node {\small Block};
+\draw (3.5, 0.5) node {\small Block};
+\draw (6.0, 0.5) node {\small \ldots};
+\draw (8.5, 0.5) node {\small Block};
+\draw (9.5, 0.5) node {\small Block};
+
+\draw (-2.0, -0.5) node {\small Absolute slot numbers};
+\draw ( 0.5, -0.5) node {\small 100};
+\draw ( 1.5, -0.5) node {\small 100};
+\draw ( 2.5, -0.5) node {\small 101};
+\draw ( 3.5, -0.5) node {\small 103};
+\draw ( 8.5, -0.5) node {\small 197};
+\draw ( 9.5, -0.5) node {\small 199};
+
+\draw (-2.0, -1.2) node {\small Relative slot numbers};
+\draw ( 0.5, -1.2) node {\small 0};
+\draw ( 1.5, -1.2) node {\small 1};
+\draw ( 2.5, -1.2) node {\small 2};
+\draw ( 3.5, -1.2) node {\small 4};
+\draw ( 8.5, -1.2) node {\small 98};
+\draw ( 9.5, -1.2) node {\small 100};
+\end{tikzpicture}
+\end{center}
+Note that some slots are empty, e.g., 102 and 198 are missing. The first and
+lasts slots can be empty too. In practice, it will never be the case that an
+entire chunk is empty, but the implementation allows for it.
+
+If we were to pick a chunk size of 1 and store each block in its own file, we
+would need millions of files, as there are millions of blocks. When serving
+blocks to peer, we would constantly open and close individual block files, which
+is very inefficient.
+
+If we pick a very large or even unbounded chunk size, the resulting chunk file
+would be several gigabytes in size and keep growing. This would make the
+recovery process (\cref{immutable:implementation:recovery}) more complicated and
+potentially much slower, as more data might have to be read and
+validated.\todo{Other arguments?} Moreover, our current approach of caching
+indices per chunk would have to be revised.
+
+In practice, a chunk size of \num{21600} is used, which matches the \emph{epoch
+size} of Byron. It is no coincidence that there is (at most) one EBB at the
+start of each Byron epoch, fitting nicely in the first relative slot that we
+reserve for it. Originally, the Immutable DB called these chunk files
+\emph{epoch files}. With the advent of Shelley, which has a different epoch size
+than Byron, we decoupled the two and introduced the name ``chunk''.
+
+\paragraph{Dynamic chunk size}
+
+The \emph{chunking} scheme was designed with the possibility of a non-uniform
+chunk size in mind. Originally, the goal was to make the chunk size configurable
+such that the number of slots per chunk could change after a certain slot.
+Similarly, the reserving an extra slot for an EBB would be optional and could
+stop after a certain slot, i.e., when the production of EBBs stopped. The
+reasoning behind this was to allow the chunk size to change near the transition
+from Byron to Shelley. As the slot density goes down by a factor of twenty when
+transitioning to Shelley, the number of blocks per chunk file and, consequently,
+the chunk size would go down by the same factor, leading to too many, smaller
+chunk files. The intention was to configure the chunk size to increase by the
+same factor at the start of the Shelley era.
+
+The transition to another era, e.g., Shelley, is dynamic: the slot at which it
+happens is determined by on-chain voting and is not certain up to a number of
+hours in advance. Making the mapping from slot number to chunk and relative slot
+number rely on the actual slot at which the transition happened would complicate
+things significantly. It would make the mapping depend on the ledger state,
+which determines the current era. This would make an unwanted coupling between
+the Immutable DB, storing \emph{blocks}, to the ledger state obtained by
+applying these blocks. A reasonable compromise would be to hard-code the change
+in chunk size to the estimated transition slot. When the estimate is incorrect,
+only a few Byron chunks would contain more blocks than intended or only a few
+Shelley chunks would contain fewer blocks than intended.
+
+Unfortunately, due to lack of time, dynamic chunk sizes were not implemented in
+time for the transition to Shelley. This means the same chunk size is being used
+for the \emph{entire chain}, resulting in fewer blocks per Shelley chunk file
+than ideal, and, consequently, more chunk files than ideal.
+
+\paragraph{Indexing by block number}
+
+The problem of too small, too many chunk files described in the paragraph above
+is caused by the fact that slot numbers can be sparse and do not have to be
+consecutive. \emph{Block numbers} do not have the same problem: they are
+consecutive and thus dense, regardless the era of the blockchain. If instead of
+indexing the Immutable DB by slot numbers, we indexed it by \emph{block
+numbers}, we would not have the problem. Unfortunately, the point type, which is
+used throughout the network layer and the consensus layer to identify and look
+up blocks, consists of a hash and \emph{slot number}, not a block number.
+
+Either we would have to maintain another index from slot number to block number,
+which would require its own chunking scheme based on slot numbers or just one
+big file, which has its own downsides. Or, points should be based on block
+numbers instead of slot numbers. As points are omnipresent, this change is very
+far-reaching. The latter approach carries our preference, but is currently out
+of the question. The former is more localised, but the complexity and risks
+involved in migrating deployed on-disk databases to the new format do not
+outweigh the uncertain benefits.
+
+\subsection{Indices}
+\label{immutable:implementation:indices}
+
+As mentioned before, we have on-disk indices for the chunk files for two
+purposes:
+\begin{enumerate}
+\item To map the sparse slot numbers to the blocks that are densely stored in
+  the chunk files.
+\item To store the information about a block that should be available without
+  having to read and deserialise the actual block, e.g., the header offset, the
+  header size, the CRC32 checksum, etc.
+\end{enumerate}
+We use a separate index for each task: the \emph{primary index} for the first
+task and the \emph{secondary index} for the second task. Each chunk file has a
+corresponding primary index file and secondary index file. Because of a
+dependency of the primary index on the secondary index, we first discuss the
+latter.
+
+\paragraph{Secondary index}
+
+In the secondary index, we store the information about a block that is needed
+before or without having to read and deserialise the block. The secondary index
+is an append-only file, like the chunk file, and contains a \emph{secondary
+index entry} for each block. For simplicity and robustness, a secondary index
+merely contains a series of densely stored secondary index entries with no extra
+information between, before, or after them. This avoids needing to initialise or
+finalise such a file, which also makes the recovery process simpler
+(\cref{immutable:implementation:recovery}). A secondary index entry consists of
+the following fields:
+
+\begin{center}
+\begin{tabular}{l r}
+  field & size [bytes] \\
+  \hline
+  block offset  & 8 \\
+  header offset & 2 \\
+  header size   & 2 \\
+  checksum      & 4 \\
+  header hash   & X \\
+  block or EBB  & 8 \\
+\end{tabular}
+\end{center}
+
+\begin{itemize}
+\item The block offset is used to determine at which offset in the corresponding
+  chunk file the raw block can be read.
+
+  As blocks are variable-sized, the size of the block also needs to be known in
+  order to read it. Instead of spending another 8 bytes to store the block size
+  as an additional field, we read the block offset of the \emph{next entry} in
+  the secondary index, which corresponds to the block after it. The block size
+  can then be computed by subtracting the latter's block offset from the
+  former's.
+
+  In case the block is the final block in the chunk file, there is no next
+  entry. Instead, the final block's size can be derived from the chunk file's
+  size. When reading the final block $B_n$ of the current chunk file, it is
+  important to obtain the chunk file size at the right time, before any more
+  blocks ($B_{n+1}, B_{n+2}, \ldots$) are appended to the same file, increasing the
+  chunk file size. Otherwise, we risk reading the bytes corresponding to not
+  just the previously final block $B_n$, but also $B_{n+1}, B_{n+2},
+  \ldots$\footnote{In hindsight, storing the block size as a separate field would
+  have simplified the implementation.}
+
+  The reasoning behind using 8 bytes for the block offset is the following. The
+  maximum block header and block body sizes permitted by the blockchain itself
+  are dynamic parameters that can change through on-chain voting. At the time of
+  writing, the maximum header size is 1100 bytes and the maximum body size is
+  \num{65536} bytes. By multiplying this theoretical maximum block size of
+  $\num{1100} + \num{65536} = \num{66636}$ bytes by the chunk size used in
+  practice, i.e., \num{21600}, assuming a maximal density of 1.0 in the Byron
+  era, we get \num{1439337600} as the maximal file size for a chunk file. An
+  offset into a file of that size fits tightly in 4 bytes, but this would not
+  support any future maximum block size increases, hence the decision to use 8
+  bytes.
+
+\item The header offset and header size are needed to extract the header from a
+  block without first having to read and deserialise the entire block, as
+  discussed in \cref{serialisation:storage:nested-contents}. These are stored
+  per block, as the header size can differ from block to block. The nested
+  context is reconstructed by reading bytes from the start of the block, as
+  explained in our discussion of the \lstinline!ReconstructNestedCtxt! class in
+  \cref{serialisation:storage:nested-contents}.
+
+  Using 2 bytes for the header offset and header size is enough when taking the
+  following into account: (so far all types of) blocks start with their header,
+  the current maximum header size is \num{1100} bytes, and the header offset is
+  relative to the start of the block.
+
+\item As discussed before, to detect silent corruption of blocks, we store a CRC
+  checksum of each block, against which the block is verified after reading it.
+  This verification can be done without deserialising the block.
+
+  Note that we do not store a checksum of the raw header, which means we do not
+  check for silent corruption when streaming headers.\todo{Maybe we should?}
+
+\item The header hash is used for lookups and bounds checking, i.e., to check
+  whether the given point's hash matches the hash of the block as the same slot
+  in the Immutable DB. By storing it separately, we do not have to read and
+  compute the hash of the block's header just to check whether it has the right
+  hash.
+
+  The header hash field's size depends on the concrete instantiation of the
+  \lstinline!HeaderHash blk! type family. In practice, a 32-byte hash is used.
+
+\item The ``block or EBB'' field is represented in memory as follows:
+
+  \begin{lstlisting}
+  data BlockOrEBB =
+      Block !SlotNo
+    | EBB   !EpochNo
+  \end{lstlisting}
+
+  The former constructor represents a regular block with an absolute slot number
+  and the latter an EBB (\cref{ebbs}) with an epoch number (since there is only
+  a single EBB per epoch). The main reason this field is part of the secondary
+  index entry is to implement the \lstinline!iteratorHasNext! method of the
+  iterator API (see \cref{immutable:api:iterators}) without having to read the
+  next block from disk, as the iterator will keep these secondary index entries
+  in memory.
+
+  Both the \lstinline!SlotNo! and \lstinline!EpochNo! types are newtypes around
+  a \lstinline!Word64!, hence the 8 on-disk bytes. We omit the tag
+  distinguishing between the two constructors in the serialisation because in
+  nearly all cases, this information has already been retrieved from the primary
+  index, i.e., whether the first filled slot in a chunk is an EBB or
+  not.\footnote{In hindsight, having the tag in the serialisation would have
+  simplified the implementation.}
+
+\item Because of the fixed size of each field, it was originally decided to
+  (de)serialise the corresponding data type using the \lstinline!Binary! class.
+  Using CBOR would be more flexible to future changes. This would make the
+  encoding variable-sized, which is not necessarily an issue, which will become
+  clear in our description of the primary index.
+
+\end{itemize}
+
+\paragraph{Primary index}
+
+The primary index maps the sparse slot numbers to the secondary index entries of
+the corresponding blocks in the dense secondary index. As discussed above, the
+secondary index entry of a block tells us the offset in the chunk file of the
+corresponding block.
+
+The format of the primary index is as follows. The primary index start with a
+byte indicating its version number. Next, for each slot, empty or not, we store
+the offset at which its secondary index entry starts in the secondary index.
+This same offset will correspond to the \emph{end} of the previous secondary
+index entry. When a slot is empty, its offset will be the same as the offset of
+the slot before it, indicating that the corresponding secondary index entry is
+empty.
+
+When appending a new block, we append the previous offset as many times as the
+number of slots that was skipped, indicating that they are empty. Next, we
+append the offset after the newly appended secondary index entry corresponding
+to the new block.
+
+We use a fixed size of 4 bytes to store each offset. As this is an offset in the
+secondary index, it should be at least large enough to address the maximal size
+of a secondary index file. We can compute this by multiplying the used chunk
+size by the size of a secondary index entry: $\num{21600} * (8 + 2 + 2 + 4 + 32
++ 8) = \num{1209600}$, which requires more than 2 bytes to address.
+
+To look up the secondary index entry for a certain slot, we compute the
+corresponding chunk number and relative slot number using $\mathsf{chunkNumber}$
+and $\mathsf{relativeSlot}$ (we discuss how we deal with EBBs later). Because we
+use a fixed size for each offset, based on the relative slot number, we can
+compute exactly at which bytes should be read at which offset in the primary
+index, i.e., the 4 + 4 bytes corresponding to the offset at the relative slot
+and the offset after it. When both offsets are equal, the slot is empty. When
+not equal, we now know which bytes to read from the secondary index to obtain
+the secondary index entry corresponding to the block in question.
+
+However, as mentioned in \cref{immutable:implementation}, we maintain a cache of
+primary indices, which means that they are always read from disk in their
+entirety. After a cache hit, looking up a relative slot in the cached primary
+index corresponds to a constant-time lookup in a vector.
+
+We illustrate this format with an example primary index below, which matches the
+chunk out of the example from \cref{immutable:implementation:chunk-layout}. The
+offsets correspond to the blocks on the line below them, where $\emptyset$ indicates an
+empty slot. We assume a fixed size of 10 bytes for each secondary index entry.
+The offset $X$ corresponds the final size of the secondary index.
+
+\begin{center}
+\begin{tikzpicture}
+\draw (0, 0)  -- (12, 0);
+\draw (0, 1)  -- (12, 1);
+
+\draw ( 0, 0) -- ( 0, 1);
+\draw ( 1, 0) -- ( 1, 1);
+\draw ( 2, 0) -- ( 2, 1);
+\draw ( 3, 0) -- ( 3, 1);
+\draw ( 4, 0) -- ( 4, 1);
+\draw ( 5, 0) -- ( 5, 1);
+\draw ( 6, 0) -- ( 6, 1);
+\draw ( 8, 0) -- ( 8, 1);
+\draw ( 9, 0) -- ( 9, 1);
+\draw (10, 0) -- (10, 1);
+\draw (11, 0) -- (11, 1);
+\draw (12, 0) -- (12, 1);
+
+\draw (-2.0, 0.5) node {\small Offsets};
+\draw ( 0.5, 0.5) node {\scriptsize  0};
+\draw ( 1.5, 0.5) node {\scriptsize 10};
+\draw ( 2.5, 0.5) node {\scriptsize 20};
+\draw ( 3.5, 0.5) node {\scriptsize 30};
+\draw ( 4.5, 0.5) node {\scriptsize 30};
+\draw ( 5.5, 0.5) node {\scriptsize 40};
+\draw ( 7.0, 0.5) node {\scriptsize \ldots};
+\draw ( 8.5, 0.5) node {\scriptsize $X - 20$};
+\draw ( 9.5, 0.5) node {\scriptsize $X - 10$};
+\draw (10.5, 0.5) node {\scriptsize $X - 10$};
+\draw (11.5, 0.5) node {\scriptsize $X$};
+
+\draw[dashed] (0.5, -1) -- (11.5, -1);
+
+\draw[dashed] ( 0.5, -1) -- ( 0.5, 0);
+\draw[dashed] ( 1.5, -1) -- ( 1.5, 0);
+\draw[dashed] ( 2.5, -1) -- ( 2.5, 0);
+\draw[dashed] ( 3.5, -1) -- ( 3.5, 0);
+\draw[dashed] ( 4.5, -1) -- ( 4.5, 0);
+\draw[dashed] ( 5.5, -1) -- ( 5.5, 0);
+\draw[dashed] ( 8.5, -1) -- ( 8.5, 0);
+\draw[dashed] ( 9.5, -1) -- ( 9.5, 0);
+\draw[dashed] (10.5, -1) -- (10.5, 0);
+\draw[dashed] (11.5, -1) -- (11.5, 0);
+
+\draw (-2.0, -0.5) node {\small Blocks};
+\draw ( 1.0, -0.5) node {\scriptsize EBB};
+\draw ( 2.0, -0.5) node {\scriptsize Block};
+\draw ( 3.0, -0.5) node {\scriptsize Block};
+\draw ( 4.0, -0.5) node {\scriptsize $\emptyset$};
+\draw ( 5.0, -0.5) node {\scriptsize Block};
+\draw ( 7.0, -0.5) node {\scriptsize \ldots};
+\draw ( 9.0, -0.5) node {\scriptsize Block};
+\draw (10.0, -0.5) node {\scriptsize $\emptyset$};
+\draw (11.0, -0.5) node {\scriptsize Block};
+
+\draw (-2.0, -1.5) node {\small Absolute slot numbers};
+\draw ( 1.0, -1.5) node {\scriptsize 100};
+\draw ( 2.0, -1.5) node {\scriptsize 100};
+\draw ( 3.0, -1.5) node {\scriptsize 101};
+\draw ( 4.0, -1.5) node {\scriptsize 102};
+\draw ( 5.0, -1.5) node {\scriptsize 103};
+\draw ( 7.0, -1.5) node {\scriptsize \ldots};
+\draw ( 9.0, -1.5) node {\scriptsize 197};
+\draw (10.0, -1.5) node {\scriptsize 198};
+\draw (11.0, -1.5) node {\scriptsize 199};
+
+\draw (-2.0, -2.2) node {\small Relative slot numbers};
+\draw ( 1.0, -2.2) node {\scriptsize 0};
+\draw ( 2.0, -2.2) node {\scriptsize 1};
+\draw ( 3.0, -2.2) node {\scriptsize 2};
+\draw ( 4.0, -2.2) node {\scriptsize 3};
+\draw ( 5.0, -2.2) node {\scriptsize 4};
+\draw ( 7.0, -2.2) node {\scriptsize \ldots};
+\draw ( 9.0, -2.2) node {\scriptsize 98};
+\draw (10.0, -2.2) node {\scriptsize 99};
+\draw (11.0, -2.2) node {\scriptsize 100};
+
+\end{tikzpicture}
+\end{center}
+
+The version number we mentioned above can be used to migrate indices in the old
+format to a newer format, when the need would arise in the future. We do not
+include a version number in the secondary index, as both index formats are
+tightly coupled, which means that both index files should be migrated together.
+
+One might realise that because the size of a secondary index entry is static,
+the primary index could be represented more compactly using a bitmap. This is
+indeed the case and the reason for it not being a bitmap is mostly a historical
+accident. However, this accident has the upside that migrating to variable-sized
+secondary index entries, e.g., serialised using CBOR instead of
+\lstinline!Binary! is straightforward.
+
+\paragraph{Lookup}
+
+Having discussed both index formats, we can now finally detail the process of
+looking up a block by a point. Given a point with slot $s$ and hash $h$, we need
+to go through the following steps to read the corresponding block:
+
+\begin{enumerate}
+\item Determine the chunk number $c = \chunkNumber{s}$.
+\item Determine the relative slot $\mathit{rs}$ within chunk $c$ corresponding
+  to $s$: $\mathit{rs} = \relativeSlot{s}{\mathit{isEBB}}$.
+
+  Note the $\mathit{isEBB}$ argument, which is unknown at this point. Just by
+  looking at the slot and the static chunk size, we can tell whether the block
+  \emph{could} be an EBB or not: only the very first slot in a chunk (which has
+  the same size as a Byron epoch) could correspond to an EBB \emph{or} the
+  regular block after it. For all other slots we are certain they cannot
+  correspond to an EBB.
+
+  In case the slot $s$ corresponds to the very first slot in the chunk, we will
+  have to use the hash $h$ to determine whether the point corresponds to the EBB
+  or the regular block in slot $s$.
+\item We lookup the offset at $\mathit{rs}$ and the offset after it in the
+  primary index of chunk $c$. As discussed, these lookups go through a cache and
+  are cheap. We now have the offsets in the secondary index file corresponding
+  to the start and end of the secondary index entry we are interested in. If
+  both offsets are equal, the slot is empty, and the lookup process terminates.
+
+  In case of a potential EBB, we have to do two such lookups: once for relative
+  slot 0 and once for relative slot 1.
+
+\item We read the secondary index entry from the secondary index file. The
+  secondary indices are also cached on a per chunk basis. The secondary index
+  entry contains the header hash, which we can now compare against $h$. In case
+  of a match, we can read the block from the chunk file using the block offset
+  contained in the secondary index entry. When the hash does not match, the
+  lookup process terminates.
+
+  In case of a potential EBB, the hash comparisons finally tell us whether the
+  point corresponds to the EBB or the regular block in slot $s$, or
+  \emph{neither} in case both hashes do not match $h$.
+\end{enumerate}
+
+\subsection{Recovery}
+\label{immutable:implementation:recovery}
+
+Because of the specific requirements of the Immutable DB and the expected write
+patterns, we can use a much simpler recovery scheme than traditional database
+systems. Only the immutable, append-only part of the chain is stored, which
+means that data inconsistencies (e.g., because of a hard shutdown) are most
+likely to happen at the end of the chain, i.e., in the last chunk and its
+indices. We can simply truncate the chain in such cases. As we maintain some
+overlap with the Volatile DB\todo{link}, blocks truncated from the end of the
+chain are likely to still be in the Volatile DB, making the recovery process
+unnoticeable. If the overlap is not enough and the truncated blocks are not in
+the Volatile DB, they can simply be downloaded again.
+
+There are two modes of recovery:
+\begin{enumerate}
+\item Validate the last chunk: this is the default mode when opening the
+  Immutable DB. The last chunk file and its indices are validated. This will
+  detect and truncate append operations that did not go through entirely, e.g.,
+  a block that was only partially appended to a chunk file, or a block that was
+  appended to one or both of the indices, but not to the chunk file.
+
+  When after truncating a chunk file, the chunk file becomes empty, we validate
+  the chunk file before it. In the unlikely case that that chunk file has to be
+  truncated and ends up empty too, we validate the chunk file before it and so
+  on, until we reach a valid block or the database is empty.
+
+\item Validate all chunks: this is the full recovery mode that is triggered by a
+  dirty shut down, caused by a missing or corrupted file (e.g., a checksum
+  mismatch while reading), or because the node itself was not shut down
+  properly.\todo{In the latter case, validating the last would be enough} We
+  validate all chunk files and their indices, from oldest to newest. When a
+  corrupt or missing block is encountered, we truncate the chain to the last
+  valid block before it. Trying to recover from a chain with holes in it would
+  be terribly complex, we therefore do not even try it.
+
+\end{enumerate}
+In both recovery modes, chunks are validated the same way, which we will shortly
+describe. When in full recovery mode, we also check whether the last block in a
+chunk is the predecessor of the first block in the next chunk, by comparing the
+hashes. This helps sniff out a truncated chunk file that is not the final one,
+causing a gap in the chain.
+
+Validating a chunk proceeds as follows:
+\begin{itemize}
+\item In the common case, the chunk file and the corresponding primary and
+  secondary index files will be present and all valid. We optimise for this
+  case.\footnote{Unlike in other areas, where we try to maintain that the
+  average case is equal to the worst case.}
+
+\item The secondary index contains a CRC32 checksum of each block in the
+  corresponding chunk (see \cref{immutable:implementation:indices}), we extract
+  these checksums and pass them to the \emph{chunk file parser}.
+
+\item The chunk file parser will try to deserialise all blocks in a chunk file.
+  When a block fails to deserialise, it is treated as corrupt and we truncate
+  the chain to the last valid block before it. Each raw block is also checked
+  against the CRC32 checksum from the secondary index, to detect corruptions
+  that are not caught by deserialising, e.g., flipping a bit in a
+  \lstinline!Word64!, which can remain a valid, yet corrupt
+  \lstinline!Word64!.\footnote{One might think that deserialising the blocks is
+  not necessary if the checksums all match. However, the chunk file parser also
+  the corresponding secondary index, which is used to validate the on-disk one.
+  For this process deserialisation is required.}
+
+  When the CRC32 checksum is not available, because of a missing or partial
+  secondary index file, we fall back to the more expensive validation of the
+  block based on its cryptographic hashes to detect silent corruption. This type
+  of validation is block-dependent and provided in the form of the
+  \lstinline!nodeCheckIntegrity! method of the \lstinline!NodeInitStorage!
+  class. This validation is implemented by hashing the body of the block and
+  comparing it against the body hash stored in the header, and by verifying the
+  signature of the header.
+
+  When the CRC32 checksum \emph{is} available, but does not match the one
+  computed from the raw block, we also fall back to this validation, as we do
+  not know whether the checksum or the block was corrupted (although the latter
+  is far more likely).
+
+  The chunk file parser also verifies that the hashes line up within a chunk, to
+  detect missing blocks. It does this by comparing the ``previous hash'' of each
+  block with the previous block's hash.
+
+  The chunk file parser returns a list of secondary index entries, forming
+  together the corresponding secondary index.
+
+\item The chunk file containing the blocks is our source of truth. To check the
+  validity of the secondary index, we check whether it matches the secondary
+  index returned by the chunk file parser. If there is a mismatch, we overwrite
+  the entire secondary index file using the secondary index returned by the
+  chunk file parser.
+
+\item We can reconstruct the primary index from the secondary index returned by
+  the chunk file parser. When the on-disk primary index is missing or it is not
+  equal to the reconstructed one, we (over)write it using the reconstructed one.
+
+\item When truncating the chain, we always make sure that the resulting chain
+  ends with a block, i.e., a filled slot, not an empty slot. Even if this means
+  going back to the previous chunk.
+
+\end{itemize}
+
+We test the recovery process in our \lstinline!quickcheck-state-machine! tests
+of the Immutable DB. In various states of the Immutable DB, we generate one or
+more random corruptions for any of the on-disk files: either a simple deletion
+of the file, a truncation, or a random bitflip. We verify that after restarting
+the Immutable DB, it has recovered to the last valid block before the
+truncation.
+
+Additionally, in those same tests, we simulate file-system errors during
+operations. For example, while appending a block, we let the second disk write
+fail. This is another way of testing whether we can correctly recover from a
+write that was aborted half-way through.
diff --git a/ouroboros-consensus/docs/report/report.tex b/ouroboros-consensus/docs/report/report.tex
index 625736b3e0b..a614ebb5e84 100644
--- a/ouroboros-consensus/docs/report/report.tex
+++ b/ouroboros-consensus/docs/report/report.tex
@@ -8,6 +8,7 @@
 \usepackage{listings}
 \usepackage[nameinlink]{cleveref}
 \usepackage{microtype}
+\usepackage[group-separator={,}]{siunitx}
 
 \hypersetup{
   pdftitle={The Cardano Consensus and Storage Layer},

From ccadc7a9d903138e84e0de284b00be4c61a31c78 Mon Sep 17 00:00:00 2001
From: Thomas Winant <thomas@well-typed.com>
Date: Wed, 6 Jan 2021 11:16:55 +0100
Subject: [PATCH 2/5] report: Volatile Database

---
 .../report/chapters/storage/volatiledb.tex    | 398 ++++++++++++++++++
 1 file changed, 398 insertions(+)

diff --git a/ouroboros-consensus/docs/report/chapters/storage/volatiledb.tex b/ouroboros-consensus/docs/report/chapters/storage/volatiledb.tex
index 46aa1786c53..e54c34ee9c2 100644
--- a/ouroboros-consensus/docs/report/chapters/storage/volatiledb.tex
+++ b/ouroboros-consensus/docs/report/chapters/storage/volatiledb.tex
@@ -1,2 +1,400 @@
 \chapter{Volatile Database}
 \label{volatile}
+
+The Volatile DB is tasked with storing the blocks that are part of the
+\emph{volatile} part of the chain. Do not be misled by its name, the Volatile DB
+\emph{should persist} blocks stored to disk. The volatile part of the chain
+consists of the last $k$ (the security parameter, see
+\cref{consensus:overview:k}) blocks of the chain, which can still be rolled back
+when switching to a fork. This means that unlike the Immutable DB, which stores
+the immutable prefix of \emph{the} chosen chain, the Volatile DB can store
+potentially multiple chains, one of which will be the current chain. It will
+also store forks that we have switched away from, or will still switch to, when
+they grow longer and become preferable to our current chain. Moreover, the
+Volatile DB can contain disconnected blocks, as the block fetch
+client\todo{link} might download or receive blocks out of order.
+
+We list the requirements and non-requirements of this component in no particular
+order. Note that some of these requirements were defined in response to the
+requirements of the Immutable DB (see \cref{immutable}), and vice versa.
+
+\begin{itemize}
+\item \textbf{Add-only}: new blocks are always added, never modified.
+\item \textbf{Out-of-order}: new blocks can be added in any order, i.e.,
+  consecutive blocks on a chain are not necessarily added consecutively. They
+  can arrive in any order and can be interspersed with blocks from other chains.
+\item \textbf{Garbage-collected}: blocks in the current chain that become older
+  than $k$, i.e., there are at least $k$ more recent blocks in the current chain
+  after them, are copied from the Volatile DB to the Immutable DB, as they move
+  from the volatile to the immutable part of the chain. After copying them to
+  the Immutable DB, they can be \emph{garbage collected} from the Volatile DB.
+
+  Blocks that are not part of the chain but are too old to switch to, should
+  also be garbage collected.
+\item \textbf{Overlap}: by allowing an \emph{overlap} of blocks between the
+  Immutable DB and the Volatile DB, i.e., by delaying garbage collection so that
+  it does not happen right after copying the block to the Immutable DB, we can
+  weaken the durability requirement on the Immutable DB. Blocks truncated from
+  the end of the Immutable DB will likely still be in the Volatile DB, and can
+  simply be copied again.\todo{done by ChainDB}
+\item \textbf{Durability}: similar to the Immutable DB's durability
+  \emph{non-requirement}, losing a block because of a crash in the middle or
+  right after appending a block is inconsequential. The block can be downloaded
+  again.
+\item \textbf{Size}: because of garbage collection, there is a bound on the size
+  of the Volatile DB in terms of blocks: in the order of $k$, which is 2160 for
+  mainnet (we give a more detailed estimate of the size in
+  \cref{volatile:implementation:gc}). This makes the size of the Volatile DB
+  relatively small, allowing for some information to be kept in memory instead
+  of on disk.
+\item \textbf{Reading}: the database should be able to return the block or
+  header corresponding to the given hash efficiently. Unlike the Immutable DB,
+  we do not index by slot numbers, as multiple blocks, from different forks, can
+  have the same slot number. Instead, we use header hashes.
+\item \textbf{Queries}: it should be possible to query information about blocks.
+  For example, we need to be able to efficiently tell which blocks are stored in
+  the Volatile DB, or construct a path through the Volatile DB connecting a
+  block to another one by chasing its predecessors.\footnote{Note that
+  implementing this efficiently using SQL is not straightforward.} Such
+  operations should produce consistent results, even while blocks are being
+  added and garbage collected concurrently.
+\item \textbf{Recoverability}: because of its small size and it being acceptable
+  to download missing blocks again, it is not of paramount importance to be able
+  to recover as many blocks as possible in case of a corruption.
+
+  However, corrupted blocks should be detected and deleted from the Volatile DB.
+\item \textbf{Efficient streaming}: while blocks will be streamed from the
+  Volatile DB, this requirement is not as important as it is for the Immutable
+  DB. Only a small number of blocks will reside in the Volatile DB, hence fewer
+  blocks will be streamed. Most commonly, the block at the tip of the chain will
+  be streamed from the Volatile DB (and possibly some of its predecessors). In
+  this case, efficiently being able to read a single block will suffice.
+\end{itemize}
+
+\section{API}
+\label{volatile:api}
+
+Before we describe the implementation of the Volatile DB, we first describe its
+functionality. The Volatile DB has the following API:
+
+\begin{lstlisting}
+data VolatileDB m blk = VolatileDB {
+      closeDB :: m ()
+
+    , putBlock :: blk -> m ()
+
+    , getBlockComponent ::
+           forall b.
+           BlockComponent blk b
+        -> HeaderHash blk
+        -> m (Maybe b)
+
+    , garbageCollect :: SlotNo -> m ()
+
+    , getBlockInfo :: STM m (HeaderHash blk -> Maybe (BlockInfo blk))
+
+    , filterByPredecessor :: STM m (ChainHash blk -> Set (HeaderHash blk))
+
+    , getMaxSlotNo :: STM m MaxSlotNo
+    }
+\end{lstlisting}
+
+The database is parameterised over the block type \lstinline!blk! and the monad
+\lstinline!m!, like most of the consensus layer.\todo{mention io-sim}
+\todo{TODO} Mention our use of records for components?
+
+The \lstinline!closeDB! operation closes the database, allowing all opened
+resources, including open file handles, to be released. This is typically only
+used when shutting down the entire system. Calling any other operation on an
+already-closed database should result in an exception.
+
+The \lstinline!putBlock! operation adds a block to the Volatile DB. There are no
+requirements on this block. This operation is idempotent, as duplicate blocks
+are ignored.
+
+The \lstinline!getBlockComponent! operation allows reading one or more
+components of the block in the database with the given hash. See
+\cref{immutable:api:block-component} for a discussion about block components. As
+no block with the given hash might be in the Volatile DB, this operation returns
+a \lstinline!Maybe!.
+
+The \lstinline!garbageCollect! operation will try to garbage collect all blocks
+with a slot number less than the given one. This will be called after copying a
+block with the given slot number to the Immutable DB. Note that the condition is
+``less than'', not ``less than or equal to'', even though after a block with
+slot $s$ has become immutable, any other blocks produced in the same slot $s$
+can never be adopted again and can thus safely be garbage collected. Moreover,
+the block we have just copied to the Immutable DB will not even be garbage
+collected from the Volatile DB (that will be done after copying its successor
+and triggering a garbage collection for the successor's slot number).
+
+The reason for ``less than'' is because of EBBs (\cref{ebbs}). An EBB has the
+same slot number as its successor. This means that if an EBB has become
+immutable, and we were to garbage collected all blocks with a slot less than or
+\emph{equal} to its slot number, we would garbage collect its successor block
+too, before having copied it to the Immutable DB.
+
+The next two operations, \lstinline!getBlockInfo! and
+\lstinline!filterByPredecessor!, allow querying the Volatile DB. Both operations
+are \lstinline!STM!-transactions that return a function. This means that they
+can both be called in the same transaction to ensure they produce results that
+are consistent w.r.t.\ each other.
+
+The \lstinline!getBlockInfo! operation returns a function to look up the
+\lstinline!BlockInfo! corresponding to a block's hash. The \lstinline!BlockInfo!
+data type is defined as follows:
+\begin{lstlisting}
+data BlockInfo blk = BlockInfo {
+      biHash         :: !(HeaderHash blk)
+    , biSlotNo       :: !SlotNo
+    , biBlockNo      :: !BlockNo
+    , biPrevHash     :: !(ChainHash blk)
+    , biIsEBB        :: !IsEBB
+    , biHeaderOffset :: !Word16
+    , biHeaderSize   :: !Word16
+    }
+\end{lstlisting}
+This is similar to the information stored in the Immutable DB's on-disk indices,
+see \cref{immutable:implementation:indices}. However, in this case, the
+information has to be retrieved from an in-memory index, as the function
+returned from the \lstinline!STM! transaction is pure.
+
+The \lstinline!filterByPredecessor! operation returns a function to look up the
+successors of a given \lstinline!ChainHash!. The \lstinline!ChainHash! data type
+is defined as follows:\todo{Explain somewhere else and link?}
+\begin{lstlisting}
+data ChainHash b =
+    GenesisHash
+  | BlockHash !(HeaderHash b)
+\end{lstlisting}
+This extends the header hash type with a case for genesis, which is needed to
+look up the blocks that fit onto genesis. As the Volatile DB can store multiple
+forks, multiple blocks can have the same predecessor, hence a \emph{set} of
+header hashes is returned. This mapping is derived from the ``previous hash''
+stored in each block's header. Consequently, the set will only contain the
+header hashes of blocks that are currently in the Volatile DB. Hence the choice
+for the \lstinline!filterByPredecessor! name instead of the slightly misleading
+\lstinline!getSuccessors!. This operation can be used to efficiently construct a
+path between two blocks in the Volatile DB. Note that only a single access to
+the Volatile DB is need to retrieve the function instead of an access \emph{per
+lookup}.
+
+The final operation, \lstinline!getMaxSlotNo!, is also an STM query, returning
+the highest slot number stored in the Volatile DB so far. The
+\lstinline!MaxSlotNo! data type is defined as follows:
+\begin{lstlisting}
+data MaxSlotNo =
+    NoMaxSlotNo
+  | MaxSlotNo !SlotNo
+\end{lstlisting}
+This is used as an optimisation of fragment filtering in the block fetch
+client\todo{link}, look up the \lstinline!filterWithMaxSlotNo! function for more
+information.
+
+\section{Implementation}
+\label{volatile:implementation}
+
+We will now give a high-level overview of our custom implementation of the
+Volatile DB that satisfies the requirements and the API.
+
+\begin{itemize}
+\item We append each new block, without any extra information before or after
+  it, to a file. When $x$ blocks have been appended to the file, the file is
+  closed and a new file is created.
+
+  The smaller $x$, the more files are created.\todo{mention downsides} The
+  higher $x$, the longer it will take for a block to be garbage collected, as
+  explained in \cref{volatile:implementation:gc}. The default value for $x$ is
+  currently \num{1000}.
+
+  For each file, we track the following information:
+  \begin{lstlisting}
+  data FileInfo blk = FileInfo {
+      maxSlotNo :: !MaxSlotNo
+    , hashes    :: !(Set (HeaderHash blk))
+    }
+  \end{lstlisting}
+  The \lstinline!maxSlotNo! field caches the highest slot number stored in the
+  file. To compute the global \lstinline!MaxSlotNo!, we simply take the maximum
+  of these \lstinline!maxSlotNo! fields.
+
+\item We \emph{do not flush} any writes to disk, as discussed in the
+  introduction of this chapter. This makes writing a block quite cheap: the
+  serialised block is copied to an OS buffer, which is then asynchronously
+  flushed in the background.
+
+\item Besides tracking some information per file, we also maintain two in-memory
+  indices to implement the \lstinline!getBlockInfo! and
+  \lstinline!filterByPredecessor! operations.
+
+  The first index, called the \lstinline!ReverseIndex!\footnote{In a sense, this
+  is the reverse of the mapping from file to \lstinline!FileInfo!, hence the
+  name \lstinline!ReverseIndex!.} is defined as follows:
+  \begin{lstlisting}
+  type ReverseIndex blk = Map (HeaderHash blk) (InternalBlockInfo blk)
+
+  data InternalBlockInfo blk = InternalBlockInfo {
+        ibiFile        :: !FsPath
+      , ibiBlockOffset :: !BlockOffset
+      , ibiBlockSize   :: !BlockSize
+      , ibiBlockInfo   :: !(BlockInfo blk)
+      , ibiNestedCtxt  :: !(SomeSecond (NestedCtxt Header) blk)
+      }
+  \end{lstlisting}
+  In addition to the \lstinline!BlockInfo! that \lstinline!getBlockInfo! should
+  return, we also store in which file the block is stored, the offset in the
+  file, the size of the block, and the nested context (see
+  \cref{serialisation:storage:nested-contents}).
+
+  The second index, called the \lstinline!SuccessorsIndex! is defined as
+  follows:
+  \begin{lstlisting}
+  type SuccessorsIndex blk = Map (ChainHash blk) (Set (HeaderHash blk))
+  \end{lstlisting}
+
+  Both indices are updated when new blocks are added and when blocks are removed
+  due to garbage collection, see \cref{volatile:implementation:gc}.
+
+  The \lstinline!Map! type used is a strict ordered map from the standard
+  \lstinline!containers! package. As for any data that is stored as long-lived
+  state, we use strict data types to avoid space leaks. We opt for an ordered
+  map, i.e., a sized balanced binary tree, instead of a hashing-based map to
+  avoid hash collisions. If an attacker manages to feed us blocks that are
+  hashed to the same bucket in the hash map, the performance will deteriorate.
+  An ordered map is not vulnerable to this type of attack.
+
+\item Besides the mappings we discussed above, the in-memory state of the
+  Volatile DB consists of the path, file handle, and offset into the file to
+  which new blocks will be appended. We store this state, a pure data type, in a
+  \emph{read-append-write lock}, which we discuss in
+  \cref{volatile:implementation:rawlock}.
+
+\item To read a block, header, or any other block component from the Volatile
+  DB, we obtain read access to the state (see
+  \cref{volatile:implementation:rawlock}) and look up the
+  \lstinline!InternalBlockInfo! corresponding to the hash in the
+  \lstinline!ReverseIndex!. The found \lstinline!InternalBlockInfo! contains the
+  file path, the block offset, and the block size, which is all what is needed
+  to read the block. To read the header, we can use the file path, the block
+  offset, the nested context (see \cref{serialisation:storage:nested-contents}),
+  the header offset, and header size. The other block components can also be
+  derived from the \lstinline!InternalBlockInfo!.
+
+\item Note that unlike the Immutable DB, the Volatile DB does not maintain CRC32
+  checksums of the stored blocks to detect corruption. Instead, after reading a
+  block from the Volatile DB and before copying it to the Immutable DB, we
+  validate the block using the \lstinline!nodeCheckIntegrity! method, as
+  described in \cref{immutable:implementation:recovery}.
+
+\end{itemize}
+
+\subsection{Garbage collection}
+\label{volatile:implementation:gc}
+
+\todo{TODO} Sync with \cref{chaindb:gc}.
+
+As mentioned above, when a garbage collection for slot $s$ is triggered, all
+blocks with a slot less than $s$ should be removed from the Volatile DB.
+
+For simplicity and following our robust append-only approach, we do not modify
+files in-place during garbage collection. Either all the blocks in a file have a
+slot number less than $s$ and it can be deleted atomically, or at least one
+block has a slot number greater or equal to $s$ and we do \emph{not} delete the
+file. Checking whether a file can be garbage collected is simple and happens in
+constant time: the \lstinline!maxSlotNo! field of \lstinline!FileInfo! is
+compared against $s$.
+
+The default for blocks per file is currently \num{1000}. Let us now calculate
+what the effect of this number is on garbage collection. We will call blocks
+that with a slot older than $s$ \emph{garbage}. Garbage blocks that can be
+deleted because they are in a file only containing garbage are \emph{collected
+garbage}. Garbage blocks that cannot yet be deleted because there is a
+non-garbage block in the same file are \emph{uncollected garbage}.
+
+The lower the number of blocks per file, the fewer uncollected garbage there
+will be, and vice versa. In the extreme case, a single block is stored per file,
+resulting in no uncollected garbage, i.e., a garbage collection rate of 100\%.
+The downside is that for each new block to add, a new file will have to be
+created, which is less efficient than appending to an already open file. It will
+also result in lots of tiny files.
+
+The other extreme is to have no bound on the number of blocks per file, which
+will result in one single file containing all blocks. This means no garbage will
+ever be collected, i.e., a garbage collection rate of 0\%, which is of course
+not acceptable.
+
+During normal operation, roughly one block will be added every 20
+seconds.\footnote{When using the PBFT consensus protocol (\cref{bft}), exactly
+one block will be produced every 20 seconds. However, when using the Praos
+consensus protocol (\cref{praos}), on average there will be one block every 20
+seconds, but it is natural to have a fork now and then, leading to one or more
+extra blocks. For the purposes of this calculation, the difference is
+negligible.} The security parameter $k$ used for mainnet is \num{2160}. This
+mean that if a linear chain of \num{2161} blocks has been added, the oldest
+block has become immutable and can be copied to the Immutable DB, after which it
+can be garbage collected. If we assume no delay between copying and garbage
+collection, it will take $\num{1000} + \num{2160} = \num{3160}$ blocks before
+the first file containing \num{1000} blocks will be garbage collected.
+
+This means that in the above scenario, starting from a Volatile DB containing
+$k$ blocks, after every $\mathsf{blocksPerFile}$ new blocks and thus
+corresponding garbage collections, $\mathsf{blocksPerFile}$ blocks will be
+garbage collected.\todo{expand calculation}
+
+In practice we allow for overlap by delaying the garbage collection, which has
+an impact on the effective size of the Volatile DB, which we discuss in
+\todo{link ChainDB}.
+
+\subsection{Read-Append-Write lock}
+\label{volatile:implementation:rawlock}
+
+We use a \emph{read-append-write} lock to store the state of the Volatile DB.
+This is an extension of the more common read-write lock. A RAW lock allows
+multiple concurrent readers, at most one appender, which is allowed to run
+concurrently with the readers, and at most one writer, which has exclusive
+access to the lock.
+
+The \lstinline!getBlockComponent! operation corresponds to \emph{reading}, the
+\lstinline!putBlock! operation to \emph{appending}, and the
+\lstinline!garbageCollect! operation to \emph{writing}. Adding a new block can
+safely happen at the same time as blocks are being read. The new block will be
+appended to the current file or a new file will be started. This does not affect
+any concurrent reads of other blocks in the Volatile DB. At most one block can
+be added at a time, as blocks are appended one-by-one to the current file. To
+garbage collect the Volatile DB, we must obtain an exclusive lock on the state,
+as we might be deleting a file while trying to read from it at the same time.
+During garbage collection, we ignore the current file and will thus never try to
+delete it. This means that, strictly speaking, it would be possible to safely
+append blocks and garbage collect blocks concurrently. However, for simplicity
+(how should the concurrent changes to the indices be resolved?), we did not
+pursue this.
+
+As mentioned in \cref{volatile:implementation:gc}, it is often the case that no
+files can be garbage collected. As a (premature) optimisation, we first check
+(which is cheap) whether any files can be garbage collected before trying to
+obtain the corresponding, more expensive lock on the state.
+
+\subsection{Recovery}
+\label{volatile:implementation:recovery}
+
+Whenever a file-system operation fails, or a file is missing or corrupted, we
+shut down the Volatile DB and consequently the whole system. When this happens,
+either the system's file system is no longer reliable (e.g., disk corruption),
+manual intervention (e.g., disk is full) is required, or there is a bug in the
+system. In all cases, there is no point in trying to continue operating. We shut
+down the system and flag the shutdown as \emph{dirty}, triggering a full
+validation on the next start-up.
+
+When opening the Volatile DB, the previous in-memory state, including the
+indices, is reconstructed based on the on-disk files. The block in each file are
+read and deserialised. There are two validation modes: a standard validation and
+a full validation. The difference between the two is that during a full
+validation, the integrity of each block is verified to detect silent corruption
+using the \lstinline!nodeCheckIntegrity! method, as described in
+\cref{immutable:implementation:recovery}.
+
+When a block fails to deserialise or it is detected as a corrupt block when the
+full validation mode is enabled, the file is truncated to the last valid block
+before it. As mentioned at the start of this chapter, it is not crucial to
+recover every single block. Therefore, we do not try to deserialise the blocks
+after a corrupt one.

From 3c7f02677211de88ec0cc2cb0bef727746c1642e Mon Sep 17 00:00:00 2001
From: Thomas Winant <thomas@well-typed.com>
Date: Mon, 11 Jan 2021 17:24:16 +0100
Subject: [PATCH 3/5] report: Ledger Database

---
 .../report/chapters/consensus/protocol.tex    |  12 +-
 .../chapters/consensus/serialisation.tex      |   4 +-
 .../docs/report/chapters/storage/ledgerdb.tex | 376 +++++++++++++++++-
 .../docs/report/references.bib                |  17 +
 4 files changed, 397 insertions(+), 12 deletions(-)

diff --git a/ouroboros-consensus/docs/report/chapters/consensus/protocol.tex b/ouroboros-consensus/docs/report/chapters/consensus/protocol.tex
index 454dce70f50..f4b94980ed9 100644
--- a/ouroboros-consensus/docs/report/chapters/consensus/protocol.tex
+++ b/ouroboros-consensus/docs/report/chapters/consensus/protocol.tex
@@ -284,12 +284,12 @@ \subsection{Protocol state management}
 
 Re-applying previously-validated blocks happens when we are replaying blocks
 from the immutable database when initialising the in-memory ledger state
-(\cref{ledgerdb:initialisation}). It is also useful during chain selection
-(\cref{chainsel}): depending on the consensus protocol, we may end up switching
-relatively frequently between short-lived forks; when this happens, skipping
-expensive checks can improve the performance of the node. \todo{How does this
-relate to the best case == worst case thing? Or to the asymptotic
-attacker/defender costs?}
+(\cref{ledgerdb:on-disk:initialisation}). It is also useful during chain
+selection (\cref{chainsel}): depending on the consensus protocol, we may end up
+switching relatively frequently between short-lived forks; when this happens,
+skipping expensive checks can improve the performance of the node. \todo{How
+  does this relate to the best case == worst case thing? Or to the asymptotic
+  attacker/defender costs?}
 
 \subsection{Leader selection}
 \label{consensus:class:leaderselection}
diff --git a/ouroboros-consensus/docs/report/chapters/consensus/serialisation.tex b/ouroboros-consensus/docs/report/chapters/consensus/serialisation.tex
index 7f0af8845df..3025fe39d1c 100644
--- a/ouroboros-consensus/docs/report/chapters/consensus/serialisation.tex
+++ b/ouroboros-consensus/docs/report/chapters/consensus/serialisation.tex
@@ -47,8 +47,8 @@ \section{Serialising for storage}
 
 \begin{itemize}
 \item Blocks
-\item The extended ledger state (\cref{storage:extledgerstate}) which is the
-  combination of:
+\item The extended ledger state (see \cref{storage:extledgerstate} and
+  \cref{ledgerdb:on-disk}) which is the combination of:
   \begin{itemize}
   \item The header state (\cref{storage:headerstate})
   \item The ledger state\todo{link?}
diff --git a/ouroboros-consensus/docs/report/chapters/storage/ledgerdb.tex b/ouroboros-consensus/docs/report/chapters/storage/ledgerdb.tex
index 5841c8df397..69ff5a05ec9 100644
--- a/ouroboros-consensus/docs/report/chapters/storage/ledgerdb.tex
+++ b/ouroboros-consensus/docs/report/chapters/storage/ledgerdb.tex
@@ -1,8 +1,376 @@
 \chapter{Ledger Database}
 \label{ledgerdb}
 
-\section{Initialisation}
-\label{ledgerdb:initialisation}
+The Ledger DB is responsible for the following tasks:
 
-Describe why it is important that we store a single snapshot and then replay
-ledger events to construct the ledger DB.
+\begin{enumerate}
+\item \textbf{Maintaining the ledger state at the tip}: Maintaining the ledger
+  state corresponding to the current tip in memory. When we try to extend our
+  chain with a new block fitting onto our tip, the block must first be validated
+  using the right ledger state, i.e., the ledger state corresponding to the tip.
+  The current ledger state is needed for various other purposes.
+
+\item \textbf{Maintaining the past $k$ ledger states}: As discussed in
+  \cref{consensus:overview:k}, we might roll back up to $k$ blocks when
+  switching to a more preferable fork. Consider the example below:
+  %
+  \begin{center}
+  \begin{tikzpicture}
+  \draw (0, 0) -- (50pt, 0) coordinate (I);
+  \draw (I) -- ++(20pt,  20pt) coordinate (C1) -- ++(20pt, 0) coordinate (C2);
+  \draw (I) -- ++(20pt, -20pt) coordinate (F1) -- ++(20pt, 0) coordinate (F2) -- ++(20pt, 0) coordinate (F3);
+  \node at (I)  {$\bullet$};
+  \node at (C1) {$\bullet$};
+  \node at (C2) {$\bullet$};
+  \node at (F1) {$\bullet$};
+  \node at (F2) {$\bullet$};
+  \node at (F3) {$\bullet$};
+  \node at (I) [above left] {$I$};
+  \node at (C1) [above] {$C_1$};
+  \node at (C2) [above] {$C_2$};
+  \node at (F1) [below] {$F_1$};
+  \node at (F2) [below] {$F_2$};
+  \node at (F3) [below] {$F_3$};
+  \draw (60pt, 50pt) node {$\overbrace{\hspace{60pt}}$};
+  \draw (60pt, 60pt) node[fill=white] {$k$};
+  \draw [dashed] (30pt, -40pt) -- (30pt, 45pt);
+  \end{tikzpicture}
+  \end{center}
+  %
+  Our current chain's tip is $C_2$, but the fork containing blocks $F_1$, $F_2$,
+  and $F_3$ is more preferable. We roll back our chain to the intersection point
+  of the two chains, $I$, which must be not more than $k$ blocks back from our
+  current tip. Next, we must validate block $F_1$ using the ledger state at
+  block $I$, after which we can validate $F_2$ using the resulting ledger state,
+  and so on.
+
+  This means that we need access to all ledger states of the past $k$ blocks,
+  i.e., the ledger states corresponding to the volatile part of the current
+  chain.\footnote{Applying a block to a ledger state is not an invertible
+  operation, so it is not possible to simply ``unapply'' $C_1$ and $C_2$ to
+  obtain $I$.}
+
+  Access to the last $k$ ledger states is not only needed for validating candidate
+  chains, but also by the:
+  \begin{itemize}
+  \item \textbf{Local state query server}: To query any of the past $k$ ledger
+    states (\cref{servers:lsq}).
+  \item \textbf{Chain sync client}: To validate headers of a chain that
+    intersects with any of the past $k$ blocks
+    (\cref{chainsyncclient:validation}).
+  \end{itemize}
+
+\item \textbf{Storing on disk}: To obtain a ledger state for the current tip of
+  the chain, one has to apply \emph{all blocks in the chain} one-by-one to the
+  initial ledger state. When starting up the system with an on-disk chain
+  containing millions of blocks, all of them would have to be read from disk and
+  applied. This process can take tens of minutes, depending on the storage and
+  CPU speed, and is thus too costly to perform on each startup.
+
+  For this reason, a recent snapshot of the ledger state should be periodically
+  written to disk. Upon the next startup, that snapshot can be read and used to
+  restore the current ledger state, as well as the past $k$ ledger states.
+\end{enumerate}
+
+Note that whenever we say ``ledger state'', we mean the
+\lstinline!ExtLedgerState blk! type described in \cref{storage:extledgerstate}.
+
+The above duties are divided across the following modules:
+
+\begin{itemize}
+\item \lstinline!LedgerDB.InMemory!: this module defines a pure data structure,
+  named \lstinline!LedgerDB!, to represent the last $k$ ledger states in memory.
+  Operations to validate and append blocks, to switch to forks, to look up
+  ledger states, \ldots{} are provided.
+\item \lstinline!LedgerDB.OnDisk!: this module contains the functionality to
+  write a snapshot of the \lstinline!LedgerDB! to disk and how to restore a
+  \lstinline!LedgerDB! from a snapshot.
+\item \lstinline!LedgerDB.DiskPolicy!: this module contains the policy that
+  determines when a snapshot of the \lstinline!LedgerDB! is written to disk.
+\item \lstinline!ChainDB.Impl.LgrDB!: this module is part of the Chain DB, and
+  is responsible for maintaining the pure \lstinline!LedgerDB! in a
+  \lstinline!StrictTVar!.
+\end{itemize}
+
+We will now discuss the modules listed above.
+
+\section{In-memory representation}
+\label{ledgerdb:in-memory}
+
+The \lstinline!LedgerDB!, capable of represent the last $k$ ledger states, is an
+instantiation of the \lstinline!AnchoredSeq! data type. This data type is
+implemented using the \emph{finger tree} data structure~\cite{fingertrees} and
+has the following time complexities:
+
+\begin{itemize}
+\item Appending a new ledger state to the end in constant time.
+\item Rolling back to a previous ledger state in logarithmic time.
+\item Looking up a past ledger state by its point in logarithmic time.
+\end{itemize}
+
+One can think of a \lstinline!AnchoredSeq! as a \lstinline!Seq! from
+\lstinline!Data.Sequence! with a custom \emph{finger tree measure}, allowing for
+efficient lookups by point, combined with an \emph{anchor}. When fully
+\emph{saturated}, the sequence will contain $k$ ledger states. In case of a
+complete rollback of all $k$ blocks and thus ledger states, the sequence will
+become empty. A ledger state is still needed, i.e., one corresponding to the
+most recent immutable block that cannot be rolled back. The ledger state at the
+anchor plays this role.
+
+When a new ledger state is appended to a fully saturated \lstinline!LedgerDB!,
+the ledger state at the anchor is dropped and the oldest element in the sequence
+becomes the new anchor, as it has become immutable. This maintains the invariant
+that only the last $k$ ledger states are stored, \emph{excluding} the ledger
+state at the anchor. This means that in practice, $k + 1$ ledger states will be
+kept in memory. When fewer the \lstinline!LedgerDB! contains fewer than $k$
+elements, new ones are appended without shifting the anchor until it is
+saturated.
+
+\todo{TODO} figure?
+
+The \lstinline!LedgerDB! is parameterised over the ledger state $l$.
+Conveniently, the \lstinline!LedgerDB! can implement the same abstract interface
+(described in \cref{ledger:api}) that the ledger state itself implements. I.e.,
+the \lstinline!GetTip!, \lstinline!IsLedger!, and \lstinline!ApplyBlock!
+classes. This means that in most places, wherever a ledger state can be used, it
+is also possible to wrap it in a \lstinline!LedgerDB!, causing it to
+automatically maintain a history of the last $k$ ledger states.
+
+\todo{TODO} discuss \lstinline!Ap! and \lstinline!applyBlock!? These are
+actually orthogonal to \lstinline!LedgerDB! and should be separated.
+
+
+\paragraph{Memory usage}
+
+The ledger state is a big data structure that contains, amongst other things,
+the entire UTxO. Recent measurements\footnote{Using the ledger state at the
+block with slot number \num{16976076} and hash \lstinline!af0e6cb8ead39a86!.}
+show that the heap size of an Allegra ledger state is around \num{361}~MB.
+Fortunately, storing $k = \num{2160}$ ledger states in memory does \emph{not}
+require $\num{2160} * \num{361}~\textrm{MB} = \num{779760}~\textrm{MB} =
+\num{761}~\textrm{GB}$. The ledger state is defined using standard Haskell data
+structures, e.g., \lstinline!Data.Map.Strict!, which are \emph{persistent} data
+structures. This means that when we update a ledger state by applying a block to
+it, we only need extra memory for the new and the modified data. The majority of
+the data will stay the same and will be \emph{shared} with the previous ledger
+state.
+
+The memory required for storing the last $k$ ledger state is thus proportional
+to: the size of the oldest in-memory ledger state \emph{and} the changes caused
+by the last $k$ blocks, e.g., the number of transactions in those blocks.
+Compared to the \num{361}~MB required for a single ledger state, keeping the
+last $k$ ledger states in memory requires only \num{375}~MB in total. This is
+only \num{14}~MB or 3.8\% more memory. Which is a very small cost.
+
+\paragraph{Past design}
+
+In the past, before measuring this cost, we did not keep all $k$ past ledger
+states because of an ungrounded fear for the extra memory usage. The
+\lstinline!LedgerDB! data structure had a \lstinline!snapEvery! parameter,
+ranging from 1 to $k$, indicating that a snapshot, i.e., a ledger state, should
+be kept every \lstinline!snapEvery! ledger states or blocks. In practice, a
+value of 100 was used for this parameter, resulting in 21--22 ledger states in
+memory.
+
+The representation was much more complex, to account for these missing ledger
+states. More importantly, accessing a past ledger state or rewinding the
+\lstinline!LedgerDB! to a past ledger state had a very different cost model. As
+the requested ledger state might not be in memory, it would have to be
+\emph{reconstructed} by reapplying blocks to an older ledger state.
+
+Consider the example below using \lstinline!snapEvery! = 3. $L_i$ indicate
+ledger states and $\emptyset_i$ indicate skipped ledger states. $L_0$ corresponds to the
+most recent ledger state, at the tip of the chain.
+%
+\begin{center}
+\begin{tikzpicture}
+\draw (0, 0) -- (8, 0);
+\draw (0, 1) -- (8, 1);
+
+\draw (1, 0) -- (1, 1);
+\draw (2, 0) -- (2, 1);
+\draw (3, 0) -- (3, 1);
+\draw (4, 0) -- (4, 1);
+\draw (5, 0) -- (5, 1);
+\draw (6, 0) -- (6, 1);
+\draw (7, 0) -- (7, 1);
+\draw (8, 0) -- (8, 1);
+
+\draw (0.5, 0.5) node {\small \ldots};
+\draw (1.5, 0.5) node {\small $L_6$};
+\draw (2.5, 0.5) node {\small $\emptyset_5$};
+\draw (3.5, 0.5) node {\small $\emptyset_4$};
+\draw (4.5, 0.5) node {\small $L_3$};
+\draw (5.5, 0.5) node {\small $\emptyset_2$};
+\draw (6.5, 0.5) node {\small $\emptyset_1$};
+\draw (7.5, 0.5) node {\small $L_0$};
+
+\end{tikzpicture}
+\end{center}
+%
+When we need access to the ledger state at position $3$, we are in luck and can
+use the available $L_3$. However, when we need access to the skipped ledger
+state at position $1$, we have to do the following: we look for the most recent
+ledger state before $\emptyset_1$, i.e., $L_3$. Next, we need to reapply blocks $B_2$
+and $B_1$ to it, which means we have to read those from disk, deserialise them,
+and apply them again.
+
+This means that accessing a past ledger state is not a pure operation and might
+require disk access and extra computation. Consequently, switching to a fork
+might require reading and revalidating blocks that remain part of the chain, in
+addition to the new blocks.
+
+As mentioned at the start of this chapter, the chain sync client also needs
+access to past ledger view (\cref{consensus:class:ledgerview}), which it can
+obtain from past ledger states. A malicious peer might try to exploit it and
+create a chain that intersects with our chain right \emph{before} an in-memory
+ledger state snapshot. In the worst case, we have to read and reapply
+\lstinline!snapEvery! - 1 = 99 blocks. This is not acceptable as the costs are
+asymmetric and in the advantage of the attacker, i.e., creating and serving such
+a header is much cheaper than reconstructing the required snapshot. At the time,
+we solved this by requiring ledger states to store snapshots of past ledger
+views. The right past ledger view could then be obtained from the current ledger
+state, which was always available in memory. However, storing snapshots of
+ledger views within a single ledger state is more complex, as we are in fact
+storing snapshots \emph{within} snapshots. The switch to keep all $k$ past
+ledger states significantly simplified the code and sped up the look-ups.
+
+\paragraph{Future design}
+
+It is important to note that in the future, this design will have to change
+again. The UTxO and, consequently, the ledger state are expected to grow in size
+organically. This growth will be accelerated by new features added to the
+ledger, e.g., smart contracts. At some point, the ledger state will be so large
+that keeping it in its entirety in memory will no longer be feasible. Moreover,
+the cost of generating enough transactions to grow the current UTxO beyond the
+expected memory limit might be within reach for some attackers. Such an attack
+might cause a part of the network to be shut down because the nodes in question
+are no longer able to load the ledger state in memory without running against
+the memory limit.
+
+For these reasons, we plan to revise our design in the future, and start storing
+parts of the ledger state on disk again.
+
+\section{On-disk}
+\label{ledgerdb:on-disk}
+
+The \lstinline!LedgerDB.OnDisk! module provides functions to write a ledger
+state to disk and read a ledger state from disk. The \lstinline!EncodeDisk! and
+\lstinline!DecodeDisk! classes from \cref{serialisation:storage} are used to
+(de)serialise the ledger state to or from CBOR. Because of its large size, we
+read and deserialise the snapshot incrementally.
+
+\todo{TODO} which ledger state to take a snapshot from is determined by the
+Chain DB. I.e., the background thread that copies blocks from the Volatile DB to
+the Immutable DB will call the \lstinline!onDiskShouldTakeSnapshot! function,
+and if it returns \lstinline!True!, a snapshot will be taken. \todo{TODO}
+double-check whether we're actually taking a snapshot of the right ledger state.
+
+\subsection{Disk policy}
+\label{ledgerdb:on-disk:disk-policy}
+
+The disk policy determines how many snapshots should be stored on disk and when
+a new snapshot of the ledger state should be written to disk.
+
+\todo{TODO} worth discussing? We would just be duplicating the existing
+documentation.
+
+\subsection{Initialisation}
+\label{ledgerdb:on-disk:initialisation}
+
+During initialisation, the goal is to construct an initial \lstinline!LedgerDB!
+that is empty except for the ledger state at the anchor, which has to correspond
+to the immutable tip, i.e., the block at the tip of the Immutable DB
+(\cref{immutable}).
+
+Ideally, we can construct the initial \lstinline!LedgerDB! from a snapshot of
+the ledger state that we wrote to disk. Remember that updating a ledger state
+with a block is not inversible: we can apply a block to a ledger state, but we
+cannot ``unapply'' a block to a ledger state. This means the snapshot has to be
+at least as old as the anchor. A snapshot matching the anchor can be used as is.
+A snapshot older than the anchor can be used after reapplying the necessary
+blocks. A snapshot newer than the anchor can \emph{not} be used, as we cannot
+unapply blocks to get the ledger state corresponding to the anchor. This is the
+reason why we only take snapshots of an immutable ledger state, i.e., of the
+anchor of the \lstinline!LedgerDB! (or older).
+
+Constructing the initial \lstinline!LedgerDB! proceeds as follows:
+\begin{enumerate}
+\item If any on-disk snapshots are available, we try them from new to old. The
+  newer the snapshot, the fewer blocks will need to be reapplied.
+\item We deserialise the snapshot. If this fails, we try the next one.
+\item If the snapshot is of the ledger state corresponding to the immutable tip,
+  we can use the snapshot for the anchor of the \lstinline!LedgerDB! and are
+  done.
+\item If the snapshot is newer than the immutable tip, we cannot use it and try
+  the next one. This situation can arise not because we took a snapshot of a
+  ledger state newer than the immutable tip, but because the Immutable DB was
+  truncated.
+\item If the snapshot is older than the immutable tip, we will have to reapply
+  the blocks after the snapshot to obtain the ledger state at the immutable tip.
+  If there is no (more) snapshot to try, we will have to reapply \emph{all
+  blocks} starting from the beginning of the chain to obtain the ledger state at
+  the immutable tip, i.e., the entire immutable chain. The blocks to reapply are
+  streamed from the Immutable DB, using an iterator
+  (\cref{immutable:api:iterators}).
+
+  Note that we can \emph{reapply} these blocks, which is quicker than applying
+  them (see \cref{ledgerdb:lgrdb}), as the existence of a snapshot newer than
+  these blocks proves\footnote{Unless the on-disk database has been tampered
+  with, but this is not an attack we intend to protect against, as this would
+  mean the machine has already been compromised.} that they have been
+  successfully applied in the past.
+\end{enumerate}
+%
+Reading and applying blocks is costly. Typically, very few blocks need to be
+reapplied in practice. However, there is one exception: when the serialisation
+format of the ledger state changes, all snapshots (written using the old
+serialisation format) will fail to deserialise, and all blocks starting from
+genesis will have to be reapplied. To mitigate this, the ledger state decoder is
+typically written in a backwards-compatible way, i.e., it accepts both the old
+and new serialisation format.
+
+\section{Maintained by the Chain DB}
+\label{ledgerdb:lgrdb}
+
+The \lstinline!LedgerDB! is a pure data structure. The Chain DB (see
+\cref{chaindb}) maintains the current \lstinline!LedgerDB! in a
+\lstinline!StrictTVar!. The most recent element in the \lstinline!LedgerDB! is
+the current ledger state. Because it is stored in a \lstinline!StrictTVar!, the
+current ledger state can be read and updated in the same \lstinline!STM!
+transaction as the current chain, which is also stored in a
+\lstinline!StrictTVar!.
+
+The \lstinline!ChainDB.Impl.LgrDB!\footnote{In the past, we had similar modules
+for the \lstinline!VolatileDB! and \lstinline!ImmutableDB!, i.e.,
+\lstinline!VolDB! and \lstinline!ImmDB!. The former were agnostic of the
+\lstinline!blk! type and the latter instantiated the former with the
+\lstinline!blk! type. However, in hindsight, unifying the two proved to be
+simpler and was thus done. The reason why a separate \lstinline!LgrDB! still
+exists is mainly because it needs to wrap the pure \lstinline!LedgerDB! in a
+\lstinline!StrictTVar!.} is responsible for maintaining the current ledger
+state. Besides this responsibility, it also integrates the Ledger DB with other
+parts of the Chain DB.
+
+Moreover, it remembers which blocks have been successfully applied in the past.
+When such a block needs to be validated again, e.g., because we switch again to
+the same fork containing the block, we can \emph{reapply} the block instead of
+\emph{applying} it (see \cref{ledger:api:ApplyBlock}). Because the block has
+been successfully applied in the past, we know the block is valid, which means
+we can skip some of the more expensive checks, e.g., checking the hashes,
+speeding up the process of validating the block. Note that a block can only be
+applied to a single ledger state, i.e., the ledger state corresponding to the
+predecessor of the block. Consequently, it suffices to remember whether a block
+was valid or not, there is no need to remember with respect to which ledger
+state it was valid.
+
+To remember which blocks have been successfully applied in the past, we store
+the points of the respective blocks in a set. Before validating a block, we look
+up its point in the set, when present, we can reapply the block instead of
+applying it. To stop this set from growing without bound, we garbage collect it
+the same way the Volatile DB is garbage collected, see \cref{chaindb:gc}. When a
+block has a slot older than the slot number of the most recent immutable block,
+either the block is already immutable or it is part of a fork that we will never
+consider again, as it forks off before the immutable block.\todo{slot number vs
+  block number} The block in question will never have to be validated again, and
+so it is not necessary to remember whether we have already applied it or not.
diff --git a/ouroboros-consensus/docs/report/references.bib b/ouroboros-consensus/docs/report/references.bib
index adec80cb4f8..c010a1325f8 100644
--- a/ouroboros-consensus/docs/report/references.bib
+++ b/ouroboros-consensus/docs/report/references.bib
@@ -105,3 +105,20 @@ @misc{buterin2020combining
       archivePrefix={arXiv},
       primaryClass={cs.CR}
 }
+
+@article{fingertrees,
+    author = {Hinze, Ralf and Paterson, Ross},
+    title = {Finger Trees: A Simple General-Purpose Data Structure},
+    year = {2006},
+    issue_date = {March 2006},
+    publisher = {Cambridge University Press},
+    address = {USA},
+    volume = {16},
+    number = {2},
+    issn = {0956-7968},
+    doi = {10.1017/S0956796805005769},
+    journal = {J. Funct. Program.},
+    month = mar,
+    pages = {197–217},
+    numpages = {21}
+}

From ea72d29cf0063934e096a684970ff5875914dcd4 Mon Sep 17 00:00:00 2001
From: Thomas Winant <thomas@well-typed.com>
Date: Thu, 14 Jan 2021 10:32:23 +0100
Subject: [PATCH 4/5] report: add some sections to the Chain Database chapter

---
 .../docs/report/chapters/storage/chaindb.tex  | 191 ++++++++++++++++++
 .../chapters/storage/chainselection.tex       |   5 +-
 .../docs/report/chapters/storage/ledgerdb.tex |   2 +
 3 files changed, 196 insertions(+), 2 deletions(-)

diff --git a/ouroboros-consensus/docs/report/chapters/storage/chaindb.tex b/ouroboros-consensus/docs/report/chapters/storage/chaindb.tex
index 25af08d8592..7bbb0ab215f 100644
--- a/ouroboros-consensus/docs/report/chapters/storage/chaindb.tex
+++ b/ouroboros-consensus/docs/report/chapters/storage/chaindb.tex
@@ -3,6 +3,196 @@ \chapter{Chain Database}
 
 TODO\todo{TODO}: This is currently a disjoint collection of snippets.
 
+\section{Union of the Volatile DB and the Immutable DB}
+\label{chaindb:union}
+
+As discussed in \cref{storage:components}, the blocks in the Chain DB are
+divided between the Volatile DB (\cref{volatile}) and the Immutable DB
+(\cref{immutable}). Yet, it presents a unified view of the two databases.
+Whereas the Immutable DB only contains the immutable chain and the Volatile DB
+the volatile \emph{parts} of multiple forks, by combining the two, the Chain DB
+contains multiple forks.
+
+\subsection{Looking up blocks}
+\label{chaindb:union:lookup}
+
+Just like the two underlying databases the Chain DB allows looking up a
+\lstinline!BlockComponent! of a block by its point. By comparing the slot number
+of the point to the slot of the immutable tip, we could decide in which database
+to look up the block. However, this would not be correct: the point might have a
+slot older than the immutable tip, but refer to a block not in the Immutable DB,
+i.e., a block on an older fork. More importantly, there is a potential race
+condition: between the time at which the immutable tip was retrieved and the
+time the block is retrieved from the Volatile DB, the block might have been
+copied to the Immutable DB and garbage collected from the Volatile DB, resulting
+in a false negative. Nevertheless, the overlap between the two makes this
+scenario very unlikely.
+
+For these reasons, we look up a block in the Chain DB as follows. We first look
+up the given point in the Volatile DB. If the block is not in the Volatile DB,
+we fall back to the Immutable DB. This means that if, at the same, a block is
+copied from the Volatile DB to the Immutable DB and garbage collected from the
+Volatile DB, we will still find it in the Immutable DB. Note that failed lookups
+in the Volatile DB are cheap, as no disk access is required.
+
+\subsection{Iterators}
+\label{chaindb:union:iterators}
+
+Similar to the Immutable DB (\cref{immutable:api:iterators}), the Chain DB
+allows streaming blocks using iterators. We only support streaming blocks from
+the current chain or from a recent fork. We \emph{do not} support streaming from
+a fork that starts before the current immutable tip, as these blocks are likely
+to be garbage collected soon. Moreover, it is of no use to us to serve another
+node blocks from a fork we discarded.
+
+We might have to stream blocks from the Immutable DB, the Volatile DB, or from
+both. If the end bound is older or equal to the immutable tip, we simply try to
+open an Immutable DB iterator with the given bounds. If the end bound is newer
+than the immutable tip, we construct a path of points (see
+\lstinline!filterByPredecessor! in \cref{volatile:api}) connecting the end bound
+to the start bound. This path is either entirely in the Volatile DB or it is
+partial because a block is missing from the Volatile DB. If the missing block is
+the tip of the Immutable DB, we will have to stream from the Immutable DB in
+addition to the Volatile DB. If the missing block is not the tip of the
+Immutable DB, we consider the range to be invalid. In other words, we allow
+streaming from both databases, but only if the immutable tip is the transition
+point between the two, it cannot be a block before the tip, as that would mean
+the fork is too old.
+
+\todo{TODO} Image?
+
+To stream blocks from the Volatile DB, we maintain the constructed path of
+points as a list in memory and look up the corresponding block (component) in
+the Volatile DB one by one.
+
+Consider the following scenario: we open a Chain DB iterator to stream the
+beginning of the current volatile chain, i.e., the blocks in the Volatile DB
+right after the immutable tip. However, before streaming the iterator's first
+block, we switch to a long fork that forks off all the way back at our immutable
+tip. If that fork is longer than the previous chain, blocks from the start of
+our chain will be copied from the Volatile DB to the Immutable DB,\todo{link}
+advancing the immutable tip. This means the blocks the iterator will stream are
+now part of a fork older than $k$. In this new situation, we would not allow
+opening an iterator with the same range as the already-opened iterator. However,
+we do allow streaming these blocks using the already opened iterator, as the
+blocks to stream are unlikely to have already been garbage collected.
+Nevertheless, it is still theoretically possible\footnote{This is unlikely, as
+there is a delay between copying and garbage collection (see
+\cref{chaindb:gc:delay}) and there are network time-outs on the block fetch
+protocol, of which the server-side (see \cref{servers:blockfetch}) is the
+primary user of Chain DB iterators.} that such a block has already been garbage
+collected. For this reason, the Chain DB extends the Immutable DB's
+\lstinline!IteratorResult! type (see \cref{immutable:api:iterators}) with the
+\lstinline!IteratorBlockGCed! constructor:
+%
+\begin{lstlisting}
+data IteratorResult blk b =
+    IteratorExhausted
+  | IteratorResult b
+  | IteratorBlockGCed (RealPoint blk)
+\end{lstlisting}
+
+There is another scenario to consider: we stream the blocks from the start of
+the current volatile chain, just like in the previous scenario. However, in this
+case, we do not switch to a fork, but our chain is extended with new blocks,
+which means blocks from the start of our volatile chain are copied from the
+Volatile DB to the Immutable DB. If these blocks have been copied and garbage
+collected before the iterator is used to stream them from the Volatile DB (which
+is unlikely, as explained in the previous scenario), the iterator will
+incorrectly yield \lstinline!IteratorBlockGCed!. Instead, when a block that was
+planned to be streamed from the Volatile DB is missing, we first look in the
+Immutable DB for the block in case it has been copied there. After the block
+copied to the Immutable has been streamed, we continue with the remaining blocks
+to stream from the Volatile DB. It might be the case that the next block has
+also been copied and garbage collected, requiring another switch to the
+Immutable DB. In the theoretical worst case, we have to switch between the two
+databases for each block, but this is nearly impossible to happen in practice.
+
+\subsection{Followers}
+\label{chaindb:union:followers}
+
+In addition to iterators, the Chain DB also supports \emph{followers}. Unlike an
+iterator, which is used to request a static segment of the current chain or a
+recent fork, a follower is used to follow the \emph{current chain}. Either from
+the start of from a suggested more recent point. Unlike iterators, followers are
+dynamic, they will follow the chain when it grows or forks. A follower is
+pull-based, just like its primary user, the chain sync server (see
+\cref{servers:chainsync}). This avoids the need to have a growing queue of
+changes to the chain on the server side in case the client side is slower.
+
+The API of a follower is as follows:
+%
+\begin{lstlisting}
+data Follower m blk a = Follower {
+      followerInstruction         :: m (Maybe (ChainUpdate blk a))
+    , followerInstructionBlocking :: m (ChainUpdate blk a)
+    , followerForward             :: [Point blk] -> m (Maybe (Point blk))
+    , followerClose               :: m ()
+    }
+\end{lstlisting}
+%
+The \lstinline!a! parameter is the same \lstinline!a! as the one in
+\lstinline!BlockComponent! (see \cref{immutable:api:block-component}), as a
+follower for any block component \lstinline!a! can be opened.
+
+A follower always has an implicit position associated with it. The
+\lstinline!followerInstruction! operation and its blocking variant allow
+requesting the next instruction w.r.t.\ the follower's implicit position, i.e.,
+a \lstinline!ChainUpdate!:
+%
+\begin{lstlisting}
+data ChainUpdate block a =
+    AddBlock a
+  | RollBack (Point block)
+\end{lstlisting}
+%
+The \lstinline!AddBlock! constructor indicates that to follow the current chain,
+the follower should extend its chain with the given block (component). Switching
+to a fork is represented by first rolling back to a certain point
+(\lstinline!RollBack!), followed by at least as many new blocks
+(\lstinline!AddBlock!) as blocks that have been rolled back. If we were to
+represent switching to a fork using a constructor like:
+%
+\begin{lstlisting}
+  | SwitchToFork (Point block) [a]
+\end{lstlisting}
+%
+we would need to have many blocks or block components in memory at the same
+time.
+
+These operations are implemented as follows. In case the follower is looking at
+the immutable part of the chain, an Immutable DB iterator is used and no
+rollbacks will be encountered. When the follower has advanced into the volatile
+part of the chain, the in-memory fragment containing the last $k$ headers is
+used (see \cref{storage:inmemory}). Depending on the block component, the
+corresponding block might have to be read from the Volatile DB.
+
+When a new chain has been adopted during chain selection (see
+\cref{chainsel:addblock}), all open followers that are looking at the part of
+the current chain that was rolled back are updated so that their next
+instruction will be the correct \lstinline!RollBack!. By definition, followers
+looking at the immutable part of the chain will be unaffected.
+
+By default, a follower will start from the very start of the chain, i.e., at
+genesis. Accordingly, the first instruction will be an \lstinline!AddBlock! with
+the very first block of the chain. As mentioned, the primary user of a follower
+is the chain sync server, of which the clients in most cases already have large
+parts of the chain. The \lstinline!followerForward! operation can be used in
+these cases to find a more recent intersection from which the follower can
+start. The client will sent a few recent points from its chain and the follower
+will try to find the most recent of them that is on our current chain. This is
+implemented by looking up blocks by their point in the current chain fragment
+and the Immutable DB.
+
+Followers are affected by garbage collection similarly to how iterators are
+(\cref{chaindb:union:iterators}): when the implicit position of the follower is
+in the immutable part of the chain, an Immutable DB iterator with a static range
+is used. Such an iterator is not aware of blocks appended to the Immutable DB
+since the iterator was opened. This means that when the iterator reaches its
+end, we first have to check whether more blocks have been appended to the
+Immutable DB. If so, a new iterator is opened to stream these blocks. If not, we
+switch over to the in-memory fragment.
+
 \section{Block processing queue}
 \label{chaindb:queue}
 
@@ -100,6 +290,7 @@ \section{Garbage collection}
 refer here, though, not to the vol DB chapter.
 
 \subsection{GC delay}
+\label{chaindb:gc:delay}
 
 For performance reasons neither the immutable DB nor the volatile DB ever makes
 explicit \lstinline!fsync! calls to flush data to disk. This means that when the
diff --git a/ouroboros-consensus/docs/report/chapters/storage/chainselection.tex b/ouroboros-consensus/docs/report/chapters/storage/chainselection.tex
index 07656280089..e61504b3289 100644
--- a/ouroboros-consensus/docs/report/chapters/storage/chainselection.tex
+++ b/ouroboros-consensus/docs/report/chapters/storage/chainselection.tex
@@ -344,8 +344,9 @@ \section{Initialisation}
 
 \item
 \label{chaindb:init:imm}
-Initialise the immutable database, determine its tip $I$, and ask the
-ledger DB for the corresponding ledger state $L$.
+Initialise the immutable database, determine its tip $I$, and ask the ledger DB
+for the corresponding ledger state $L$ (see
+\cref{ledgerdb:on-disk:initialisation}).
 
 \item Compute the set of candidates anchored at the immutable database's tip
 \label{chaindb:init:compute}
diff --git a/ouroboros-consensus/docs/report/chapters/storage/ledgerdb.tex b/ouroboros-consensus/docs/report/chapters/storage/ledgerdb.tex
index 69ff5a05ec9..e1f47e176d5 100644
--- a/ouroboros-consensus/docs/report/chapters/storage/ledgerdb.tex
+++ b/ouroboros-consensus/docs/report/chapters/storage/ledgerdb.tex
@@ -333,6 +333,8 @@ \subsection{Initialisation}
 \section{Maintained by the Chain DB}
 \label{ledgerdb:lgrdb}
 
+\todo{TODO} move to Chain DB chapter?
+
 The \lstinline!LedgerDB! is a pure data structure. The Chain DB (see
 \cref{chaindb}) maintains the current \lstinline!LedgerDB! in a
 \lstinline!StrictTVar!. The most recent element in the \lstinline!LedgerDB! is

From f2a027da5a8f787d09d6b9be95466ca83f5cfc02 Mon Sep 17 00:00:00 2001
From: Thomas Winant <thomas@well-typed.com>
Date: Fri, 15 Jan 2021 12:08:37 +0100
Subject: [PATCH 5/5] report: Start the Mempool chapter

---
 .../docs/report/chapters/storage/mempool.tex  | 79 ++++++++++++++++++-
 1 file changed, 78 insertions(+), 1 deletion(-)

diff --git a/ouroboros-consensus/docs/report/chapters/storage/mempool.tex b/ouroboros-consensus/docs/report/chapters/storage/mempool.tex
index 5feadb7221e..af4c2d7353d 100644
--- a/ouroboros-consensus/docs/report/chapters/storage/mempool.tex
+++ b/ouroboros-consensus/docs/report/chapters/storage/mempool.tex
@@ -1,7 +1,84 @@
 \chapter{Mempool}
 \label{mempool}
 
+Whenever a block producing node is the leader of a slot
+(\cref{consensus:class:leaderselection}), it gets the chance to mint a block.
+For the Cardano blockchain to be useful, the minted block in the blockchain
+needs to contain \emph{transactions}. The \emph{mempool} is where we buffer
+transactions until we are able to mint a block containing those transactions.
+
+Transactions created by the user using the wallet enter the Mempool via the
+local transaction submission protocol (see \cref{servers:txsubmission}). As not
+every user will be running a block producing node or stakepool, these
+transactions should be broadcast over the network so that other, block
+producing, nodes can include these transactions in their next block, in order
+for the transactions to ends up in the blockchain as soon as possible. This is
+accomplished by the node-to-node transaction submission protocol\todo{link?},
+which exchanges the transactions between the mempool of the nodes in the
+network.
+
+Naturally, we only want to put transactions in a block that are valid
+w.r.t.\ the ledger state against which the block will be applied. Putting
+invalid transactions in a block will result in an invalid block, which will be
+rejected by other nodes. Consequently, the block along with its rewards is lost.
+Even for a node that is not a block producer, there is no point in flooding the
+network with invalid transactions. For these reasons, we validate the
+transactions in the mempool w.r.t.\ the current ledger state and remove
+transactions that are no longer valid.
+
 \section{Consistency}
 \label{mempool:consistency}
 
-Discuss that we insist on \emph{linear consistency}, and why.
+Transactions themselves affect the ledger state, consequently, the order in
+which transactions are applied matters. For example, two transactions might try
+to consume the same UTxO entries. The first of the two transactions to be
+applied determines will be valid, the second will be invalid. Transactions can
+also depend on each other, hence the transactions that are depended upon should
+be applied first. Consequently, the mempool needs to decide how transactions are
+ordered.
+
+We chose a simple approach: we maintain a list of transactions, ordered by the
+time at which they arrived. This has the following advantages:
+
+\begin{itemize}
+\item It's simple to implement and it's efficient. In particular, no search for
+  a valid subset is ever required.
+\item When minting a block, we can simply
+  take the longest possible prefix of transactions that fits in a block.
+\item It supports wallets that submit dependent transactions (where later
+  transaction depends on outputs from earlier ones).
+\end{itemize}
+
+We call this \emph{linear consistency}: transactions are ordered linearly and
+each transaction is valid w.r.t.\ the transactions before it and the ledger
+state against which the mempool was validated.
+
+The mempool has a background thread that watches the current ledger state
+exposed by the Chain DB (\cref{chaindb}). Whenever it changes, the mempool will
+revalidate its contents w.r.t.\ that ledger state. This ensures that we no
+longer keep broadcasting invalid transactions and that the next time we get to
+mint a block, we do not have to validate a bunch of invalid transactions,
+costing us more crucial time.
+
+\section{Caching}
+
+The mempool caches the ledger state resulting from applying all the transactions
+in the mempool to the current ledger state. This makes it quick and easy to
+validate incoming transactions, they can simply be validated against the cached
+ledger state without having to recompute it for each transaction. As discussed
+in \cref{ledgerdb:in-memory}, the memory cost of this is minimal. When the
+incoming transaction is valid w.r.t.\ the cached ledger state, we append the
+transaction to the mempool and we cache the resulting ledger state.
+
+\todo{TODO} talk about the slot for which we produce
+
+\section{TxSeq}
+
+\todo{TODO} efficiently get the first $x$ transactions that fit into the given size
+
+\todo{TODO} discuss \lstinline!TicketNo!
+
+\section{Capacity}
+
+\todo{TODO} discuss dynamic capacity, based on twice the max block (body?) size in the protocol parameters in the ledger
+\todo{TODO} add transactions one-by-one for better concurrency and fewer revalidation in case of retries