From ac335c91cc94aca0ad331dd0d52fcc9a54d2c395 Mon Sep 17 00:00:00 2001 From: Thomas Winant Date: Thu, 31 Dec 2020 09:49:53 +0100 Subject: [PATCH 1/5] report: Immutable Database --- .../report/chapters/storage/immutabledb.tex | 922 ++++++++++++++++++ ouroboros-consensus/docs/report/report.tex | 1 + 2 files changed, 923 insertions(+) diff --git a/ouroboros-consensus/docs/report/chapters/storage/immutabledb.tex b/ouroboros-consensus/docs/report/chapters/storage/immutabledb.tex index 54e0174c45b..61ee9eb3e73 100644 --- a/ouroboros-consensus/docs/report/chapters/storage/immutabledb.tex +++ b/ouroboros-consensus/docs/report/chapters/storage/immutabledb.tex @@ -1,2 +1,924 @@ +\newcommand{\chunkNumber}[1]{\ensuremath{\mathsf{chunkNumber}(#1)}} +\newcommand{\relativeSlot}[2]{\ensuremath{\mathsf{relativeSlot}(#1, #2)}} + \chapter{Immutable Database} \label{immutable} + +The Immutable DB is tasked with storing the blocks that are part of the +\emph{immutable} part of the chain. Because of the nature of this task, the +requirements and \emph{non-requirements} of this component are fairly specific: + +\begin{itemize} +\item \textbf{Append-only}: as it represents the immutable chain, blocks will + only be appended in the same order as they are ordered in the chain. Blocks + will never be \emph{modified} or \emph{deleted}. +\item \textbf{Reading}: the database should be able to return the block or + header stored at a given \emph{point} (combination of slot number and hash) + efficiently.\todo{define point somewhere?} +\item \textbf{Efficient streaming}: when serving blocks or headers to other + nodes, we need to be able to stream ranges of \emph{consecutive} blocks or + headers efficiently. As described in \cref{serialisation:network:serialised}, + it should be possible to stream \emph{raw} blocks and headers, without + serialising them. +\item \textbf{Recoverability}: it must be possible to validate the blocks stored + in the database. When a block in the database is corrupt or missing, it is + sufficient to truncate the database, representing an immutable chain, to the + last valid block before the corrupt or missing block. The truncated blocks can + simply be downloaded again. It is therefore not necessary to be able recover + the full database when blocks are missing. +\end{itemize} + +While we touched upon some of the non-requirements already above, it is useful +to highlight the following non-requirements. + +\begin{itemize} +\item \textbf{Queries}: besides looking up a single block by its point and + streaming ranges of consecutive blocks, the database does have to be able to + answer queries about blocks. No searching or filtering is needed. +\item \textbf{Durability}: the system does not require the durability guarantee + that traditional database systems provide (the D in ACID). If the system + crashes right after appending a block, it is acceptable that the block in + question is truncated when recovering the database. Because of the overlap + with the Volatile DB\todo{link}, such a truncation is likely to even go + unnoticed. In the worst case, the truncated blocks can simply be downloaded + again. +\end{itemize} + +Because of the specific requirements and non-requirements listed above, we +decided to write our own implementation, the \lstinline!ImmutableDB!, instead of +using an existing off-the-shelf database system. Traditional database systems +provide guarantees that are not needed and, conversely, do not take advantage of +the requirements to optimise certain operations. For example, there is no need +for a journal or flushing (\lstinline!fsync!) the buffers after each write +because of our unique durability and recoverability (non-)requirements. + +\section{API} +\label{immutable:api} + +Before we describe the implementation of the Immutable DB, we first describe its +functionality. The Immutable DB has the following API: + +\begin{lstlisting} +data ImmutableDB m blk = ImmutableDB { + closeDB :: m () + + , getTip :: STM m (WithOrigin (Tip blk)) + + , appendBlock :: blk -> m () + + , getBlockComponent :: + forall b. + BlockComponent blk b + -> RealPoint blk + -> m (Either (MissingBlock blk) b) + + , stream :: + forall b. + ResourceRegistry m + -> BlockComponent blk b + -> StreamFrom blk + -> StreamTo blk + -> m (Either (MissingBlock blk) (Iterator m blk b)) + } +\end{lstlisting} + +The database is parameterised over the block type \lstinline!blk! and the monad +\lstinline!m!, like most of the consensus layer.\todo{mention io-sim} +\todo{TODO} Mention our use of records for components? + +The \lstinline!closeDB! operation closes the database, allowing all opened +resources, including open file handles, to be released. This is typically only +used when shutting down the entire system. Calling any other operation on an +already-closed database should result in an exception. + +The \lstinline!getTip! operation returns the current tip of the Immutable DB. +The \lstinline!Tip! type contains information about the block at the tip like +the slot number, the block number, the hash, etc. The \lstinline!WithOrigin! +type is isomorphic to \lstinline!Maybe! and is used to account for the +possibility of an empty database, i.e., when the tip is at the ``origin'' of the +chain. This operation is an \lstinline!STM! operation, which allows it to be +combined with other \lstinline!STM! operations in a single transaction, to +obtain a consistent view on them. This also implies that no IO or disk access is +needed to obtain the current tip. + +The \lstinline!appendBlock! operation appends a block to the Immutable DB. As +slot numbers increase monotonically in the blockchain, the block's slot must be +greater than the current tip's slot (or equal when the tip points at an EBB, see +\cref{ebbs}). It is not required that each slot is filled,\todo{link?} so there +can certainly be gaps in the slot numbers. + +The \lstinline!getBlockComponent! operation allows reading one or more +components of the block in the database at the given point. We discuss what +block components are in \cref{immutable:api:block-component}. The +\lstinline!RealPoint! type represents a point that can only refer to a block, +not to genesis (the empty chain), which the larger \lstinline!Point! type +allows. As the given point might not be in the Immutable DB, this operation can +also return a \lstinline!MissingBlock! error instead of the requested block +component. + +The \lstinline!stream! operation returns an iterator to efficiently stream the +blocks between the two given bounds. The bounds are defined as such: +\begin{lstlisting} +data StreamFrom blk = + StreamFromInclusive !(RealPoint blk) + | StreamFromExclusive !(Point blk) + +newtype StreamTo blk = + StreamToInclusive (RealPoint blk) +\end{lstlisting} +Lower bounds can be either inclusive or exclusive, but exclusive upper bounds +were omitted because they were not needed in practice. An inclusive bound must +refer to a block, not genesis, hence the use of \lstinline!RealPoint!. The +exclusive lower bound \emph{can} refer to genesis, hence the use of +\lstinline!Point!, in particular to begin streaming from the start of the chain. +As one or both of the bounds might not be in the Immutable DB, this operation +can return a \lstinline!MissingBlock! error. We discuss what block components +are in \cref{immutable:api:block-component}. The \lstinline!ResourceRegistry! +will be used to allocate all the resources the iterator opens during its +lifetime, e.g., file handles. By closing the registry in case of an exception +(using \lstinline!bracket!), all open resources are released and nothing is +leaked.\todo{link} More discussion about iterators follows in +\cref{immutable:api:iterators}. + +\subsection{Block Component} +\label{immutable:api:block-component} + +\todo{TODO} move to ChainDB? + +Besides reading or streaming blocks from the Immutable DB, it must be possible +to read or stream headers, raw blocks (see +\cref{serialisation:network:serialised}), but in some cases also \emph{nested +contexts} (see \cref{serialisation:storage:nested-contents}) or even block +sizes. Adding an operation to the API for each of these would result in too much +duplication. We handle this with the \lstinline!BlockComponent! abstraction: +when reading or streaming, one can choose which \emph{components} of a block +should be returned, e.g., the block itself, the header of the block, the size of +the block, the raw block, the raw header, etc. We model this with the following +GADT: + +\begin{lstlisting} +data BlockComponent blk a where + GetVerifiedBlock :: BlockComponent blk blk + GetBlock :: BlockComponent blk blk + GetRawBlock :: BlockComponent blk ByteString + GetHeader :: BlockComponent blk (Header blk) + GetRawHeader :: BlockComponent blk ByteString + GetHash :: BlockComponent blk (HeaderHash blk) + GetSlot :: BlockComponent blk SlotNo + GetIsEBB :: BlockComponent blk IsEBB + GetBlockSize :: BlockComponent blk Word32 + GetHeaderSize :: BlockComponent blk Word16 + GetNestedCtxt :: BlockComponent blk (SomeSecond (NestedCtxt Header) blk) + .. +\end{lstlisting} +The \lstinline!a! type index determines the type of the block component. +Additionally, we have \lstinline!Functor! and \lstinline!Applicative! instances. +The latter allows combining multiple \lstinline!BlockComponent!s into one. This +can be considered a small DSL for querying components of a block. + +\subsection{Iterators} +\label{immutable:api:iterators} + +The following API can be used to interact with an iterator: + +\begin{lstlisting} +data Iterator m blk b = Iterator { + iteratorNext :: m (IteratorResult b) + , iteratorHasNext :: STM m (Maybe (RealPoint blk)) + , iteratorClose :: m () + } + +data IteratorResult b = + IteratorExhausted + | IteratorResult b +\end{lstlisting} + +The \lstinline!iteratorNext! operation returns the current +\lstinline!IteratorResult! and advances the iterator to the next block in the +stream. When the iterator has reached its upper bound, it is exhausted. Remember +that the \lstinline!b! type argument corresponds to the requested block +component. + +The \lstinline!iteratorHasNext! operation returns the point corresponding to the +block the next call to \lstinline!iteratorNext! will return. When exhausted, +\lstinline!Nothing! is returned. + +As an open iterator can hold onto resources, e.g., open file handles, it should +be explicitly closed using the \lstinline!iteratorClose! operation. Interacting +with a closed iterator should result in an exception, except for calling +\lstinline!iteratorClose!, which is idempotent. + +\section{Implementation} +\label{immutable:implementation} + +We will now give a high-level overview of our custom implementation of the +Immutable DB that satisfies the requirements and the API. + +\begin{itemize} +\item We store blocks sequentially in a file, called a \emph{chunk file}. We + append each raw block, without any extra information before or after it, to + the chunk file. This will facilitate efficient binary streaming of blocks. In + principle, it is a matter of copying bytes from one buffer to another, without + any additional processing needed. + +\item Every $x$ \emph{slots}, where $x$ is the configurable chunk size, we start + a new chunk file to avoid storing all blocks in a single file. + +\item To facilitate looking up a block by a point, which consists of the hash + and the slot number, we ``index'' our database by slot numbers. One can then + look up the block in the given slot and compare its hash against the point's + hash. No searching will be needed. + + Blocks are stored sequentially in chunk files, but slot numbers do \emph{not} + increase one-by-one; they are \emph{sparse}. This means we need a mapping from + the slot number to the offset and size of the block in the chunk file. We + store this mapping in the on-disk \emph{primary index}, one per chunk file, + which we discuss in more detail in \cref{immutable:implementation:indices}. + +\item As mentioned above, when looking up a block by a point, we will compare + the hash of the block at the point's slot in the Immutable DB with the point's + hash. We should be able to do this without first having to read and + deserialise the entire block in order to know its hash. + + Moreover, it should be possible to read just the header of the block without + first having to read the entire block. As described in + \cref{serialisation:storage:nested-contents}, we can do this if have access to + the header offset, header size, and nested context of the block. + + For these reasons, we store the aforementioned extra information, which should + be available without having to read and deserialise the entire block, + separately in the on-disk \emph{secondary index}, one per chunk file + (\cref{immutable:implementation:indices}). + +\item All the information stored in the primary and secondary indices can be + recovered from the blocks in the chunk files. This is described in + \cref{immutable:implementation:recovery}. + +\item Whenever a file-system operation fails, or a file is missing or corrupted, + we shut down the Immutable DB and consequently the whole system. When this + happens, either the system's file system is no longer reliable (e.g., disk + corruption), manual intervention (e.g., disk is full) is required, or there is + a bug in the system. In all cases, there is no point in trying to continue + operating. We shut down the system and flag the shutdown as \emph{dirty}, + triggering a full validation on the next start-up, see + \cref{immutable:implementation:recovery}. + + Not all forms of disk corruption can easily be detected. For example, when + some bytes in a block stored in a chunk file have been flipped on disk, this + can easily go unnoticed. Deserialising the block might fail if the + serialisation format is no longer valid, but the bitflip could also happen in, + e.g., the amount of a transaction, which will not be detected by the + deserialiser. In fact, the majority of blocks read will not even be + deserialised, as blocks served to other nodes are read and sent in their raw, + still serialised format. However, sending a corrupted block must be avoided, + as nodes receiving it will consider it invalid and can blacklist us, mistaking + us for an adversary. + + To detect such forms of silent corruption, we store CRC32 checksums in the + secondary index (\cref{immutable:implementation:indices}) which we verify when + reading the block, which we can do even when not deserialising the block. Note + that we could use the block's own hash for this purpose,\footnote{To be + precise: we would have to check the block body against the body hash stored in + the header, and verify the signature of the header.} but because computing + such a cryptographic hash is much more expensive, we opted for a separate + CRC32 checksum, which is much more efficient to compute and designed for + exactly this purpose. + +\item We store the state of the current chunk, including its indices, in memory. + We store this state, a pure data type, in a \lstinline!StrictMVar!. Besides + avoiding space leaks by forcing its contents to WHNF, this + \lstinline!StrictMVar! type has another useful ability that its standard + non-strict variant is lacks: while it is locked when being modified, the + previous, \emph{stale} value can still be read. + + This is convenient for the Immutable DB: we can support multiple concurrent + reads even when at most one append operation is in progress, as it is safe to + read a block based on the stale state because data will only be appended, not + modified. + + To append a block to the Immutable DB, we lock the state to avoid concurrent + append operations. We append the block to the chunk file, and append the + necessary information to the primary and secondary indices. We unlock the + state, updated with the information about the newly appended block. + +\item We \emph{do not flush} any writes to disk, as discussed in the + introduction of this chapter. This makes appending a block quite cheap: the + serialised block is copied to an OS buffer, which is then asynchronously + flushed in the background. + +\item To avoid repeatedly reading and deserialising the same primary and + secondary indices of older chunks, we cache them in a LRU-cache that is + bounded in size. + +\item To open an iterator, we check its bounds using the (cached) indices. The + bounds are valid when both correspond to blocks present in the Immutable DB. + Next, a file handle is opened for the chunk file containing the first block to + stream. The same chunk file's indices are read (from the cache) and the + iterator will maintain a list of secondary index entries, one for each block + to stream from the chunk file. By having this list of entries in memory, the + indices will not have to be accessed for each streamed block. + + When a block component is requested from the iterator, it is read from the + chunk file and/or extracted from the corresponding in-memory entry. + Afterwards, the entry is dropped from the in-memory list so that the next + entry is ready to be read. When the list of entries is exhausted without + reaching the end bound, we move on to the next chunk file. This process + repeats until the end bound is reached. +\end{itemize} + +\subsection{Chunk layout} +\label{immutable:implementation:chunk-layout} + +Each block in the block chain has a unique slot number (except for EBBs, which +we discuss below). Slot numbers increase in the blockchain, but not all slots +have to be filled. For example, in the Byron era (using the Permissive BFT +consensus algorithm), nearly every slot will be filled, but in the Shelley era +(using the Praos consensus algorithm), on average only one in twenty slots will +be filled. + +As mentioned above, we want to group blocks into chunk files. Because we need to +be able to look up blocks in the Immutable DB based on their slot number, we +group blocks into chunk files based on their slot numbers so that the chunk file +containing a block can be determined by looking at the slot number of the block. + +Internally, we translate \emph{absolute} slot numbers into \emph{chunk numbers} +and \emph{relative slot numbers} (relative w.r.t.\ the chunk). As EBBs +(\cref{ebbs}) have the same slot number as their successor, this translation is +not injective. To restore injectivity, we include ``whether the block is an EBB +or not'' as an input to the translation. + +\todo{TODO} how should this be formatted? + +\begin{definition}[Chunk number] + Let $s$ be the absolute slot number of a block. Using a chunk size of + $\mathit{sz}$: + + \[ + \chunkNumber{s} = \lfloor s / \mathit{sz} \rfloor + \] + Naturally, chunks are zero-indexed. + +\end{definition} + +\begin{definition}[Relative slot number] + Let $s$ be the absolute slot number of a block. Using a chunk size of + $\mathit{sz}$: + + \[ + \relativeSlot{s}{\mathit{isEBB}} = + \begin{cases} + 0 & \text{if}\,\mathit{isEBB} \\ + (s \bmod \mathit{sz}) + 1 & \text{otherwise} + \end{cases} + \] + We reserve the very first relative slot for an EBB, hence the need to make + room for it by incrementing by one in the non-EBB case. +\end{definition} + +In the example below, we show a chunk with chunk number 1 using a chunk size of +100: + +\begin{center} +\begin{tikzpicture} +\draw (0, 0) -- (10, 0); +\draw (0, 1) -- (10, 1); + +\draw ( 0, 0) -- ( 0, 1); +\draw ( 1, 0) -- ( 1, 1); +\draw ( 2, 0) -- ( 2, 1); +\draw ( 3, 0) -- ( 3, 1); +\draw ( 4, 0) -- ( 4, 1); +\draw ( 8, 0) -- ( 8, 1); +\draw ( 9, 0) -- ( 9, 1); +\draw (10, 0) -- (10, 1); + +\draw (0.5, 0.5) node {\small EBB}; +\draw (1.5, 0.5) node {\small Block}; +\draw (2.5, 0.5) node {\small Block}; +\draw (3.5, 0.5) node {\small Block}; +\draw (6.0, 0.5) node {\small \ldots}; +\draw (8.5, 0.5) node {\small Block}; +\draw (9.5, 0.5) node {\small Block}; + +\draw (-2.0, -0.5) node {\small Absolute slot numbers}; +\draw ( 0.5, -0.5) node {\small 100}; +\draw ( 1.5, -0.5) node {\small 100}; +\draw ( 2.5, -0.5) node {\small 101}; +\draw ( 3.5, -0.5) node {\small 103}; +\draw ( 8.5, -0.5) node {\small 197}; +\draw ( 9.5, -0.5) node {\small 199}; + +\draw (-2.0, -1.2) node {\small Relative slot numbers}; +\draw ( 0.5, -1.2) node {\small 0}; +\draw ( 1.5, -1.2) node {\small 1}; +\draw ( 2.5, -1.2) node {\small 2}; +\draw ( 3.5, -1.2) node {\small 4}; +\draw ( 8.5, -1.2) node {\small 98}; +\draw ( 9.5, -1.2) node {\small 100}; +\end{tikzpicture} +\end{center} +Note that some slots are empty, e.g., 102 and 198 are missing. The first and +lasts slots can be empty too. In practice, it will never be the case that an +entire chunk is empty, but the implementation allows for it. + +If we were to pick a chunk size of 1 and store each block in its own file, we +would need millions of files, as there are millions of blocks. When serving +blocks to peer, we would constantly open and close individual block files, which +is very inefficient. + +If we pick a very large or even unbounded chunk size, the resulting chunk file +would be several gigabytes in size and keep growing. This would make the +recovery process (\cref{immutable:implementation:recovery}) more complicated and +potentially much slower, as more data might have to be read and +validated.\todo{Other arguments?} Moreover, our current approach of caching +indices per chunk would have to be revised. + +In practice, a chunk size of \num{21600} is used, which matches the \emph{epoch +size} of Byron. It is no coincidence that there is (at most) one EBB at the +start of each Byron epoch, fitting nicely in the first relative slot that we +reserve for it. Originally, the Immutable DB called these chunk files +\emph{epoch files}. With the advent of Shelley, which has a different epoch size +than Byron, we decoupled the two and introduced the name ``chunk''. + +\paragraph{Dynamic chunk size} + +The \emph{chunking} scheme was designed with the possibility of a non-uniform +chunk size in mind. Originally, the goal was to make the chunk size configurable +such that the number of slots per chunk could change after a certain slot. +Similarly, the reserving an extra slot for an EBB would be optional and could +stop after a certain slot, i.e., when the production of EBBs stopped. The +reasoning behind this was to allow the chunk size to change near the transition +from Byron to Shelley. As the slot density goes down by a factor of twenty when +transitioning to Shelley, the number of blocks per chunk file and, consequently, +the chunk size would go down by the same factor, leading to too many, smaller +chunk files. The intention was to configure the chunk size to increase by the +same factor at the start of the Shelley era. + +The transition to another era, e.g., Shelley, is dynamic: the slot at which it +happens is determined by on-chain voting and is not certain up to a number of +hours in advance. Making the mapping from slot number to chunk and relative slot +number rely on the actual slot at which the transition happened would complicate +things significantly. It would make the mapping depend on the ledger state, +which determines the current era. This would make an unwanted coupling between +the Immutable DB, storing \emph{blocks}, to the ledger state obtained by +applying these blocks. A reasonable compromise would be to hard-code the change +in chunk size to the estimated transition slot. When the estimate is incorrect, +only a few Byron chunks would contain more blocks than intended or only a few +Shelley chunks would contain fewer blocks than intended. + +Unfortunately, due to lack of time, dynamic chunk sizes were not implemented in +time for the transition to Shelley. This means the same chunk size is being used +for the \emph{entire chain}, resulting in fewer blocks per Shelley chunk file +than ideal, and, consequently, more chunk files than ideal. + +\paragraph{Indexing by block number} + +The problem of too small, too many chunk files described in the paragraph above +is caused by the fact that slot numbers can be sparse and do not have to be +consecutive. \emph{Block numbers} do not have the same problem: they are +consecutive and thus dense, regardless the era of the blockchain. If instead of +indexing the Immutable DB by slot numbers, we indexed it by \emph{block +numbers}, we would not have the problem. Unfortunately, the point type, which is +used throughout the network layer and the consensus layer to identify and look +up blocks, consists of a hash and \emph{slot number}, not a block number. + +Either we would have to maintain another index from slot number to block number, +which would require its own chunking scheme based on slot numbers or just one +big file, which has its own downsides. Or, points should be based on block +numbers instead of slot numbers. As points are omnipresent, this change is very +far-reaching. The latter approach carries our preference, but is currently out +of the question. The former is more localised, but the complexity and risks +involved in migrating deployed on-disk databases to the new format do not +outweigh the uncertain benefits. + +\subsection{Indices} +\label{immutable:implementation:indices} + +As mentioned before, we have on-disk indices for the chunk files for two +purposes: +\begin{enumerate} +\item To map the sparse slot numbers to the blocks that are densely stored in + the chunk files. +\item To store the information about a block that should be available without + having to read and deserialise the actual block, e.g., the header offset, the + header size, the CRC32 checksum, etc. +\end{enumerate} +We use a separate index for each task: the \emph{primary index} for the first +task and the \emph{secondary index} for the second task. Each chunk file has a +corresponding primary index file and secondary index file. Because of a +dependency of the primary index on the secondary index, we first discuss the +latter. + +\paragraph{Secondary index} + +In the secondary index, we store the information about a block that is needed +before or without having to read and deserialise the block. The secondary index +is an append-only file, like the chunk file, and contains a \emph{secondary +index entry} for each block. For simplicity and robustness, a secondary index +merely contains a series of densely stored secondary index entries with no extra +information between, before, or after them. This avoids needing to initialise or +finalise such a file, which also makes the recovery process simpler +(\cref{immutable:implementation:recovery}). A secondary index entry consists of +the following fields: + +\begin{center} +\begin{tabular}{l r} + field & size [bytes] \\ + \hline + block offset & 8 \\ + header offset & 2 \\ + header size & 2 \\ + checksum & 4 \\ + header hash & X \\ + block or EBB & 8 \\ +\end{tabular} +\end{center} + +\begin{itemize} +\item The block offset is used to determine at which offset in the corresponding + chunk file the raw block can be read. + + As blocks are variable-sized, the size of the block also needs to be known in + order to read it. Instead of spending another 8 bytes to store the block size + as an additional field, we read the block offset of the \emph{next entry} in + the secondary index, which corresponds to the block after it. The block size + can then be computed by subtracting the latter's block offset from the + former's. + + In case the block is the final block in the chunk file, there is no next + entry. Instead, the final block's size can be derived from the chunk file's + size. When reading the final block $B_n$ of the current chunk file, it is + important to obtain the chunk file size at the right time, before any more + blocks ($B_{n+1}, B_{n+2}, \ldots$) are appended to the same file, increasing the + chunk file size. Otherwise, we risk reading the bytes corresponding to not + just the previously final block $B_n$, but also $B_{n+1}, B_{n+2}, + \ldots$\footnote{In hindsight, storing the block size as a separate field would + have simplified the implementation.} + + The reasoning behind using 8 bytes for the block offset is the following. The + maximum block header and block body sizes permitted by the blockchain itself + are dynamic parameters that can change through on-chain voting. At the time of + writing, the maximum header size is 1100 bytes and the maximum body size is + \num{65536} bytes. By multiplying this theoretical maximum block size of + $\num{1100} + \num{65536} = \num{66636}$ bytes by the chunk size used in + practice, i.e., \num{21600}, assuming a maximal density of 1.0 in the Byron + era, we get \num{1439337600} as the maximal file size for a chunk file. An + offset into a file of that size fits tightly in 4 bytes, but this would not + support any future maximum block size increases, hence the decision to use 8 + bytes. + +\item The header offset and header size are needed to extract the header from a + block without first having to read and deserialise the entire block, as + discussed in \cref{serialisation:storage:nested-contents}. These are stored + per block, as the header size can differ from block to block. The nested + context is reconstructed by reading bytes from the start of the block, as + explained in our discussion of the \lstinline!ReconstructNestedCtxt! class in + \cref{serialisation:storage:nested-contents}. + + Using 2 bytes for the header offset and header size is enough when taking the + following into account: (so far all types of) blocks start with their header, + the current maximum header size is \num{1100} bytes, and the header offset is + relative to the start of the block. + +\item As discussed before, to detect silent corruption of blocks, we store a CRC + checksum of each block, against which the block is verified after reading it. + This verification can be done without deserialising the block. + + Note that we do not store a checksum of the raw header, which means we do not + check for silent corruption when streaming headers.\todo{Maybe we should?} + +\item The header hash is used for lookups and bounds checking, i.e., to check + whether the given point's hash matches the hash of the block as the same slot + in the Immutable DB. By storing it separately, we do not have to read and + compute the hash of the block's header just to check whether it has the right + hash. + + The header hash field's size depends on the concrete instantiation of the + \lstinline!HeaderHash blk! type family. In practice, a 32-byte hash is used. + +\item The ``block or EBB'' field is represented in memory as follows: + + \begin{lstlisting} + data BlockOrEBB = + Block !SlotNo + | EBB !EpochNo + \end{lstlisting} + + The former constructor represents a regular block with an absolute slot number + and the latter an EBB (\cref{ebbs}) with an epoch number (since there is only + a single EBB per epoch). The main reason this field is part of the secondary + index entry is to implement the \lstinline!iteratorHasNext! method of the + iterator API (see \cref{immutable:api:iterators}) without having to read the + next block from disk, as the iterator will keep these secondary index entries + in memory. + + Both the \lstinline!SlotNo! and \lstinline!EpochNo! types are newtypes around + a \lstinline!Word64!, hence the 8 on-disk bytes. We omit the tag + distinguishing between the two constructors in the serialisation because in + nearly all cases, this information has already been retrieved from the primary + index, i.e., whether the first filled slot in a chunk is an EBB or + not.\footnote{In hindsight, having the tag in the serialisation would have + simplified the implementation.} + +\item Because of the fixed size of each field, it was originally decided to + (de)serialise the corresponding data type using the \lstinline!Binary! class. + Using CBOR would be more flexible to future changes. This would make the + encoding variable-sized, which is not necessarily an issue, which will become + clear in our description of the primary index. + +\end{itemize} + +\paragraph{Primary index} + +The primary index maps the sparse slot numbers to the secondary index entries of +the corresponding blocks in the dense secondary index. As discussed above, the +secondary index entry of a block tells us the offset in the chunk file of the +corresponding block. + +The format of the primary index is as follows. The primary index start with a +byte indicating its version number. Next, for each slot, empty or not, we store +the offset at which its secondary index entry starts in the secondary index. +This same offset will correspond to the \emph{end} of the previous secondary +index entry. When a slot is empty, its offset will be the same as the offset of +the slot before it, indicating that the corresponding secondary index entry is +empty. + +When appending a new block, we append the previous offset as many times as the +number of slots that was skipped, indicating that they are empty. Next, we +append the offset after the newly appended secondary index entry corresponding +to the new block. + +We use a fixed size of 4 bytes to store each offset. As this is an offset in the +secondary index, it should be at least large enough to address the maximal size +of a secondary index file. We can compute this by multiplying the used chunk +size by the size of a secondary index entry: $\num{21600} * (8 + 2 + 2 + 4 + 32 ++ 8) = \num{1209600}$, which requires more than 2 bytes to address. + +To look up the secondary index entry for a certain slot, we compute the +corresponding chunk number and relative slot number using $\mathsf{chunkNumber}$ +and $\mathsf{relativeSlot}$ (we discuss how we deal with EBBs later). Because we +use a fixed size for each offset, based on the relative slot number, we can +compute exactly at which bytes should be read at which offset in the primary +index, i.e., the 4 + 4 bytes corresponding to the offset at the relative slot +and the offset after it. When both offsets are equal, the slot is empty. When +not equal, we now know which bytes to read from the secondary index to obtain +the secondary index entry corresponding to the block in question. + +However, as mentioned in \cref{immutable:implementation}, we maintain a cache of +primary indices, which means that they are always read from disk in their +entirety. After a cache hit, looking up a relative slot in the cached primary +index corresponds to a constant-time lookup in a vector. + +We illustrate this format with an example primary index below, which matches the +chunk out of the example from \cref{immutable:implementation:chunk-layout}. The +offsets correspond to the blocks on the line below them, where $\emptyset$ indicates an +empty slot. We assume a fixed size of 10 bytes for each secondary index entry. +The offset $X$ corresponds the final size of the secondary index. + +\begin{center} +\begin{tikzpicture} +\draw (0, 0) -- (12, 0); +\draw (0, 1) -- (12, 1); + +\draw ( 0, 0) -- ( 0, 1); +\draw ( 1, 0) -- ( 1, 1); +\draw ( 2, 0) -- ( 2, 1); +\draw ( 3, 0) -- ( 3, 1); +\draw ( 4, 0) -- ( 4, 1); +\draw ( 5, 0) -- ( 5, 1); +\draw ( 6, 0) -- ( 6, 1); +\draw ( 8, 0) -- ( 8, 1); +\draw ( 9, 0) -- ( 9, 1); +\draw (10, 0) -- (10, 1); +\draw (11, 0) -- (11, 1); +\draw (12, 0) -- (12, 1); + +\draw (-2.0, 0.5) node {\small Offsets}; +\draw ( 0.5, 0.5) node {\scriptsize 0}; +\draw ( 1.5, 0.5) node {\scriptsize 10}; +\draw ( 2.5, 0.5) node {\scriptsize 20}; +\draw ( 3.5, 0.5) node {\scriptsize 30}; +\draw ( 4.5, 0.5) node {\scriptsize 30}; +\draw ( 5.5, 0.5) node {\scriptsize 40}; +\draw ( 7.0, 0.5) node {\scriptsize \ldots}; +\draw ( 8.5, 0.5) node {\scriptsize $X - 20$}; +\draw ( 9.5, 0.5) node {\scriptsize $X - 10$}; +\draw (10.5, 0.5) node {\scriptsize $X - 10$}; +\draw (11.5, 0.5) node {\scriptsize $X$}; + +\draw[dashed] (0.5, -1) -- (11.5, -1); + +\draw[dashed] ( 0.5, -1) -- ( 0.5, 0); +\draw[dashed] ( 1.5, -1) -- ( 1.5, 0); +\draw[dashed] ( 2.5, -1) -- ( 2.5, 0); +\draw[dashed] ( 3.5, -1) -- ( 3.5, 0); +\draw[dashed] ( 4.5, -1) -- ( 4.5, 0); +\draw[dashed] ( 5.5, -1) -- ( 5.5, 0); +\draw[dashed] ( 8.5, -1) -- ( 8.5, 0); +\draw[dashed] ( 9.5, -1) -- ( 9.5, 0); +\draw[dashed] (10.5, -1) -- (10.5, 0); +\draw[dashed] (11.5, -1) -- (11.5, 0); + +\draw (-2.0, -0.5) node {\small Blocks}; +\draw ( 1.0, -0.5) node {\scriptsize EBB}; +\draw ( 2.0, -0.5) node {\scriptsize Block}; +\draw ( 3.0, -0.5) node {\scriptsize Block}; +\draw ( 4.0, -0.5) node {\scriptsize $\emptyset$}; +\draw ( 5.0, -0.5) node {\scriptsize Block}; +\draw ( 7.0, -0.5) node {\scriptsize \ldots}; +\draw ( 9.0, -0.5) node {\scriptsize Block}; +\draw (10.0, -0.5) node {\scriptsize $\emptyset$}; +\draw (11.0, -0.5) node {\scriptsize Block}; + +\draw (-2.0, -1.5) node {\small Absolute slot numbers}; +\draw ( 1.0, -1.5) node {\scriptsize 100}; +\draw ( 2.0, -1.5) node {\scriptsize 100}; +\draw ( 3.0, -1.5) node {\scriptsize 101}; +\draw ( 4.0, -1.5) node {\scriptsize 102}; +\draw ( 5.0, -1.5) node {\scriptsize 103}; +\draw ( 7.0, -1.5) node {\scriptsize \ldots}; +\draw ( 9.0, -1.5) node {\scriptsize 197}; +\draw (10.0, -1.5) node {\scriptsize 198}; +\draw (11.0, -1.5) node {\scriptsize 199}; + +\draw (-2.0, -2.2) node {\small Relative slot numbers}; +\draw ( 1.0, -2.2) node {\scriptsize 0}; +\draw ( 2.0, -2.2) node {\scriptsize 1}; +\draw ( 3.0, -2.2) node {\scriptsize 2}; +\draw ( 4.0, -2.2) node {\scriptsize 3}; +\draw ( 5.0, -2.2) node {\scriptsize 4}; +\draw ( 7.0, -2.2) node {\scriptsize \ldots}; +\draw ( 9.0, -2.2) node {\scriptsize 98}; +\draw (10.0, -2.2) node {\scriptsize 99}; +\draw (11.0, -2.2) node {\scriptsize 100}; + +\end{tikzpicture} +\end{center} + +The version number we mentioned above can be used to migrate indices in the old +format to a newer format, when the need would arise in the future. We do not +include a version number in the secondary index, as both index formats are +tightly coupled, which means that both index files should be migrated together. + +One might realise that because the size of a secondary index entry is static, +the primary index could be represented more compactly using a bitmap. This is +indeed the case and the reason for it not being a bitmap is mostly a historical +accident. However, this accident has the upside that migrating to variable-sized +secondary index entries, e.g., serialised using CBOR instead of +\lstinline!Binary! is straightforward. + +\paragraph{Lookup} + +Having discussed both index formats, we can now finally detail the process of +looking up a block by a point. Given a point with slot $s$ and hash $h$, we need +to go through the following steps to read the corresponding block: + +\begin{enumerate} +\item Determine the chunk number $c = \chunkNumber{s}$. +\item Determine the relative slot $\mathit{rs}$ within chunk $c$ corresponding + to $s$: $\mathit{rs} = \relativeSlot{s}{\mathit{isEBB}}$. + + Note the $\mathit{isEBB}$ argument, which is unknown at this point. Just by + looking at the slot and the static chunk size, we can tell whether the block + \emph{could} be an EBB or not: only the very first slot in a chunk (which has + the same size as a Byron epoch) could correspond to an EBB \emph{or} the + regular block after it. For all other slots we are certain they cannot + correspond to an EBB. + + In case the slot $s$ corresponds to the very first slot in the chunk, we will + have to use the hash $h$ to determine whether the point corresponds to the EBB + or the regular block in slot $s$. +\item We lookup the offset at $\mathit{rs}$ and the offset after it in the + primary index of chunk $c$. As discussed, these lookups go through a cache and + are cheap. We now have the offsets in the secondary index file corresponding + to the start and end of the secondary index entry we are interested in. If + both offsets are equal, the slot is empty, and the lookup process terminates. + + In case of a potential EBB, we have to do two such lookups: once for relative + slot 0 and once for relative slot 1. + +\item We read the secondary index entry from the secondary index file. The + secondary indices are also cached on a per chunk basis. The secondary index + entry contains the header hash, which we can now compare against $h$. In case + of a match, we can read the block from the chunk file using the block offset + contained in the secondary index entry. When the hash does not match, the + lookup process terminates. + + In case of a potential EBB, the hash comparisons finally tell us whether the + point corresponds to the EBB or the regular block in slot $s$, or + \emph{neither} in case both hashes do not match $h$. +\end{enumerate} + +\subsection{Recovery} +\label{immutable:implementation:recovery} + +Because of the specific requirements of the Immutable DB and the expected write +patterns, we can use a much simpler recovery scheme than traditional database +systems. Only the immutable, append-only part of the chain is stored, which +means that data inconsistencies (e.g., because of a hard shutdown) are most +likely to happen at the end of the chain, i.e., in the last chunk and its +indices. We can simply truncate the chain in such cases. As we maintain some +overlap with the Volatile DB\todo{link}, blocks truncated from the end of the +chain are likely to still be in the Volatile DB, making the recovery process +unnoticeable. If the overlap is not enough and the truncated blocks are not in +the Volatile DB, they can simply be downloaded again. + +There are two modes of recovery: +\begin{enumerate} +\item Validate the last chunk: this is the default mode when opening the + Immutable DB. The last chunk file and its indices are validated. This will + detect and truncate append operations that did not go through entirely, e.g., + a block that was only partially appended to a chunk file, or a block that was + appended to one or both of the indices, but not to the chunk file. + + When after truncating a chunk file, the chunk file becomes empty, we validate + the chunk file before it. In the unlikely case that that chunk file has to be + truncated and ends up empty too, we validate the chunk file before it and so + on, until we reach a valid block or the database is empty. + +\item Validate all chunks: this is the full recovery mode that is triggered by a + dirty shut down, caused by a missing or corrupted file (e.g., a checksum + mismatch while reading), or because the node itself was not shut down + properly.\todo{In the latter case, validating the last would be enough} We + validate all chunk files and their indices, from oldest to newest. When a + corrupt or missing block is encountered, we truncate the chain to the last + valid block before it. Trying to recover from a chain with holes in it would + be terribly complex, we therefore do not even try it. + +\end{enumerate} +In both recovery modes, chunks are validated the same way, which we will shortly +describe. When in full recovery mode, we also check whether the last block in a +chunk is the predecessor of the first block in the next chunk, by comparing the +hashes. This helps sniff out a truncated chunk file that is not the final one, +causing a gap in the chain. + +Validating a chunk proceeds as follows: +\begin{itemize} +\item In the common case, the chunk file and the corresponding primary and + secondary index files will be present and all valid. We optimise for this + case.\footnote{Unlike in other areas, where we try to maintain that the + average case is equal to the worst case.} + +\item The secondary index contains a CRC32 checksum of each block in the + corresponding chunk (see \cref{immutable:implementation:indices}), we extract + these checksums and pass them to the \emph{chunk file parser}. + +\item The chunk file parser will try to deserialise all blocks in a chunk file. + When a block fails to deserialise, it is treated as corrupt and we truncate + the chain to the last valid block before it. Each raw block is also checked + against the CRC32 checksum from the secondary index, to detect corruptions + that are not caught by deserialising, e.g., flipping a bit in a + \lstinline!Word64!, which can remain a valid, yet corrupt + \lstinline!Word64!.\footnote{One might think that deserialising the blocks is + not necessary if the checksums all match. However, the chunk file parser also + the corresponding secondary index, which is used to validate the on-disk one. + For this process deserialisation is required.} + + When the CRC32 checksum is not available, because of a missing or partial + secondary index file, we fall back to the more expensive validation of the + block based on its cryptographic hashes to detect silent corruption. This type + of validation is block-dependent and provided in the form of the + \lstinline!nodeCheckIntegrity! method of the \lstinline!NodeInitStorage! + class. This validation is implemented by hashing the body of the block and + comparing it against the body hash stored in the header, and by verifying the + signature of the header. + + When the CRC32 checksum \emph{is} available, but does not match the one + computed from the raw block, we also fall back to this validation, as we do + not know whether the checksum or the block was corrupted (although the latter + is far more likely). + + The chunk file parser also verifies that the hashes line up within a chunk, to + detect missing blocks. It does this by comparing the ``previous hash'' of each + block with the previous block's hash. + + The chunk file parser returns a list of secondary index entries, forming + together the corresponding secondary index. + +\item The chunk file containing the blocks is our source of truth. To check the + validity of the secondary index, we check whether it matches the secondary + index returned by the chunk file parser. If there is a mismatch, we overwrite + the entire secondary index file using the secondary index returned by the + chunk file parser. + +\item We can reconstruct the primary index from the secondary index returned by + the chunk file parser. When the on-disk primary index is missing or it is not + equal to the reconstructed one, we (over)write it using the reconstructed one. + +\item When truncating the chain, we always make sure that the resulting chain + ends with a block, i.e., a filled slot, not an empty slot. Even if this means + going back to the previous chunk. + +\end{itemize} + +We test the recovery process in our \lstinline!quickcheck-state-machine! tests +of the Immutable DB. In various states of the Immutable DB, we generate one or +more random corruptions for any of the on-disk files: either a simple deletion +of the file, a truncation, or a random bitflip. We verify that after restarting +the Immutable DB, it has recovered to the last valid block before the +truncation. + +Additionally, in those same tests, we simulate file-system errors during +operations. For example, while appending a block, we let the second disk write +fail. This is another way of testing whether we can correctly recover from a +write that was aborted half-way through. diff --git a/ouroboros-consensus/docs/report/report.tex b/ouroboros-consensus/docs/report/report.tex index 625736b3e0b..a614ebb5e84 100644 --- a/ouroboros-consensus/docs/report/report.tex +++ b/ouroboros-consensus/docs/report/report.tex @@ -8,6 +8,7 @@ \usepackage{listings} \usepackage[nameinlink]{cleveref} \usepackage{microtype} +\usepackage[group-separator={,}]{siunitx} \hypersetup{ pdftitle={The Cardano Consensus and Storage Layer}, From ccadc7a9d903138e84e0de284b00be4c61a31c78 Mon Sep 17 00:00:00 2001 From: Thomas Winant Date: Wed, 6 Jan 2021 11:16:55 +0100 Subject: [PATCH 2/5] report: Volatile Database --- .../report/chapters/storage/volatiledb.tex | 398 ++++++++++++++++++ 1 file changed, 398 insertions(+) diff --git a/ouroboros-consensus/docs/report/chapters/storage/volatiledb.tex b/ouroboros-consensus/docs/report/chapters/storage/volatiledb.tex index 46aa1786c53..e54c34ee9c2 100644 --- a/ouroboros-consensus/docs/report/chapters/storage/volatiledb.tex +++ b/ouroboros-consensus/docs/report/chapters/storage/volatiledb.tex @@ -1,2 +1,400 @@ \chapter{Volatile Database} \label{volatile} + +The Volatile DB is tasked with storing the blocks that are part of the +\emph{volatile} part of the chain. Do not be misled by its name, the Volatile DB +\emph{should persist} blocks stored to disk. The volatile part of the chain +consists of the last $k$ (the security parameter, see +\cref{consensus:overview:k}) blocks of the chain, which can still be rolled back +when switching to a fork. This means that unlike the Immutable DB, which stores +the immutable prefix of \emph{the} chosen chain, the Volatile DB can store +potentially multiple chains, one of which will be the current chain. It will +also store forks that we have switched away from, or will still switch to, when +they grow longer and become preferable to our current chain. Moreover, the +Volatile DB can contain disconnected blocks, as the block fetch +client\todo{link} might download or receive blocks out of order. + +We list the requirements and non-requirements of this component in no particular +order. Note that some of these requirements were defined in response to the +requirements of the Immutable DB (see \cref{immutable}), and vice versa. + +\begin{itemize} +\item \textbf{Add-only}: new blocks are always added, never modified. +\item \textbf{Out-of-order}: new blocks can be added in any order, i.e., + consecutive blocks on a chain are not necessarily added consecutively. They + can arrive in any order and can be interspersed with blocks from other chains. +\item \textbf{Garbage-collected}: blocks in the current chain that become older + than $k$, i.e., there are at least $k$ more recent blocks in the current chain + after them, are copied from the Volatile DB to the Immutable DB, as they move + from the volatile to the immutable part of the chain. After copying them to + the Immutable DB, they can be \emph{garbage collected} from the Volatile DB. + + Blocks that are not part of the chain but are too old to switch to, should + also be garbage collected. +\item \textbf{Overlap}: by allowing an \emph{overlap} of blocks between the + Immutable DB and the Volatile DB, i.e., by delaying garbage collection so that + it does not happen right after copying the block to the Immutable DB, we can + weaken the durability requirement on the Immutable DB. Blocks truncated from + the end of the Immutable DB will likely still be in the Volatile DB, and can + simply be copied again.\todo{done by ChainDB} +\item \textbf{Durability}: similar to the Immutable DB's durability + \emph{non-requirement}, losing a block because of a crash in the middle or + right after appending a block is inconsequential. The block can be downloaded + again. +\item \textbf{Size}: because of garbage collection, there is a bound on the size + of the Volatile DB in terms of blocks: in the order of $k$, which is 2160 for + mainnet (we give a more detailed estimate of the size in + \cref{volatile:implementation:gc}). This makes the size of the Volatile DB + relatively small, allowing for some information to be kept in memory instead + of on disk. +\item \textbf{Reading}: the database should be able to return the block or + header corresponding to the given hash efficiently. Unlike the Immutable DB, + we do not index by slot numbers, as multiple blocks, from different forks, can + have the same slot number. Instead, we use header hashes. +\item \textbf{Queries}: it should be possible to query information about blocks. + For example, we need to be able to efficiently tell which blocks are stored in + the Volatile DB, or construct a path through the Volatile DB connecting a + block to another one by chasing its predecessors.\footnote{Note that + implementing this efficiently using SQL is not straightforward.} Such + operations should produce consistent results, even while blocks are being + added and garbage collected concurrently. +\item \textbf{Recoverability}: because of its small size and it being acceptable + to download missing blocks again, it is not of paramount importance to be able + to recover as many blocks as possible in case of a corruption. + + However, corrupted blocks should be detected and deleted from the Volatile DB. +\item \textbf{Efficient streaming}: while blocks will be streamed from the + Volatile DB, this requirement is not as important as it is for the Immutable + DB. Only a small number of blocks will reside in the Volatile DB, hence fewer + blocks will be streamed. Most commonly, the block at the tip of the chain will + be streamed from the Volatile DB (and possibly some of its predecessors). In + this case, efficiently being able to read a single block will suffice. +\end{itemize} + +\section{API} +\label{volatile:api} + +Before we describe the implementation of the Volatile DB, we first describe its +functionality. The Volatile DB has the following API: + +\begin{lstlisting} +data VolatileDB m blk = VolatileDB { + closeDB :: m () + + , putBlock :: blk -> m () + + , getBlockComponent :: + forall b. + BlockComponent blk b + -> HeaderHash blk + -> m (Maybe b) + + , garbageCollect :: SlotNo -> m () + + , getBlockInfo :: STM m (HeaderHash blk -> Maybe (BlockInfo blk)) + + , filterByPredecessor :: STM m (ChainHash blk -> Set (HeaderHash blk)) + + , getMaxSlotNo :: STM m MaxSlotNo + } +\end{lstlisting} + +The database is parameterised over the block type \lstinline!blk! and the monad +\lstinline!m!, like most of the consensus layer.\todo{mention io-sim} +\todo{TODO} Mention our use of records for components? + +The \lstinline!closeDB! operation closes the database, allowing all opened +resources, including open file handles, to be released. This is typically only +used when shutting down the entire system. Calling any other operation on an +already-closed database should result in an exception. + +The \lstinline!putBlock! operation adds a block to the Volatile DB. There are no +requirements on this block. This operation is idempotent, as duplicate blocks +are ignored. + +The \lstinline!getBlockComponent! operation allows reading one or more +components of the block in the database with the given hash. See +\cref{immutable:api:block-component} for a discussion about block components. As +no block with the given hash might be in the Volatile DB, this operation returns +a \lstinline!Maybe!. + +The \lstinline!garbageCollect! operation will try to garbage collect all blocks +with a slot number less than the given one. This will be called after copying a +block with the given slot number to the Immutable DB. Note that the condition is +``less than'', not ``less than or equal to'', even though after a block with +slot $s$ has become immutable, any other blocks produced in the same slot $s$ +can never be adopted again and can thus safely be garbage collected. Moreover, +the block we have just copied to the Immutable DB will not even be garbage +collected from the Volatile DB (that will be done after copying its successor +and triggering a garbage collection for the successor's slot number). + +The reason for ``less than'' is because of EBBs (\cref{ebbs}). An EBB has the +same slot number as its successor. This means that if an EBB has become +immutable, and we were to garbage collected all blocks with a slot less than or +\emph{equal} to its slot number, we would garbage collect its successor block +too, before having copied it to the Immutable DB. + +The next two operations, \lstinline!getBlockInfo! and +\lstinline!filterByPredecessor!, allow querying the Volatile DB. Both operations +are \lstinline!STM!-transactions that return a function. This means that they +can both be called in the same transaction to ensure they produce results that +are consistent w.r.t.\ each other. + +The \lstinline!getBlockInfo! operation returns a function to look up the +\lstinline!BlockInfo! corresponding to a block's hash. The \lstinline!BlockInfo! +data type is defined as follows: +\begin{lstlisting} +data BlockInfo blk = BlockInfo { + biHash :: !(HeaderHash blk) + , biSlotNo :: !SlotNo + , biBlockNo :: !BlockNo + , biPrevHash :: !(ChainHash blk) + , biIsEBB :: !IsEBB + , biHeaderOffset :: !Word16 + , biHeaderSize :: !Word16 + } +\end{lstlisting} +This is similar to the information stored in the Immutable DB's on-disk indices, +see \cref{immutable:implementation:indices}. However, in this case, the +information has to be retrieved from an in-memory index, as the function +returned from the \lstinline!STM! transaction is pure. + +The \lstinline!filterByPredecessor! operation returns a function to look up the +successors of a given \lstinline!ChainHash!. The \lstinline!ChainHash! data type +is defined as follows:\todo{Explain somewhere else and link?} +\begin{lstlisting} +data ChainHash b = + GenesisHash + | BlockHash !(HeaderHash b) +\end{lstlisting} +This extends the header hash type with a case for genesis, which is needed to +look up the blocks that fit onto genesis. As the Volatile DB can store multiple +forks, multiple blocks can have the same predecessor, hence a \emph{set} of +header hashes is returned. This mapping is derived from the ``previous hash'' +stored in each block's header. Consequently, the set will only contain the +header hashes of blocks that are currently in the Volatile DB. Hence the choice +for the \lstinline!filterByPredecessor! name instead of the slightly misleading +\lstinline!getSuccessors!. This operation can be used to efficiently construct a +path between two blocks in the Volatile DB. Note that only a single access to +the Volatile DB is need to retrieve the function instead of an access \emph{per +lookup}. + +The final operation, \lstinline!getMaxSlotNo!, is also an STM query, returning +the highest slot number stored in the Volatile DB so far. The +\lstinline!MaxSlotNo! data type is defined as follows: +\begin{lstlisting} +data MaxSlotNo = + NoMaxSlotNo + | MaxSlotNo !SlotNo +\end{lstlisting} +This is used as an optimisation of fragment filtering in the block fetch +client\todo{link}, look up the \lstinline!filterWithMaxSlotNo! function for more +information. + +\section{Implementation} +\label{volatile:implementation} + +We will now give a high-level overview of our custom implementation of the +Volatile DB that satisfies the requirements and the API. + +\begin{itemize} +\item We append each new block, without any extra information before or after + it, to a file. When $x$ blocks have been appended to the file, the file is + closed and a new file is created. + + The smaller $x$, the more files are created.\todo{mention downsides} The + higher $x$, the longer it will take for a block to be garbage collected, as + explained in \cref{volatile:implementation:gc}. The default value for $x$ is + currently \num{1000}. + + For each file, we track the following information: + \begin{lstlisting} + data FileInfo blk = FileInfo { + maxSlotNo :: !MaxSlotNo + , hashes :: !(Set (HeaderHash blk)) + } + \end{lstlisting} + The \lstinline!maxSlotNo! field caches the highest slot number stored in the + file. To compute the global \lstinline!MaxSlotNo!, we simply take the maximum + of these \lstinline!maxSlotNo! fields. + +\item We \emph{do not flush} any writes to disk, as discussed in the + introduction of this chapter. This makes writing a block quite cheap: the + serialised block is copied to an OS buffer, which is then asynchronously + flushed in the background. + +\item Besides tracking some information per file, we also maintain two in-memory + indices to implement the \lstinline!getBlockInfo! and + \lstinline!filterByPredecessor! operations. + + The first index, called the \lstinline!ReverseIndex!\footnote{In a sense, this + is the reverse of the mapping from file to \lstinline!FileInfo!, hence the + name \lstinline!ReverseIndex!.} is defined as follows: + \begin{lstlisting} + type ReverseIndex blk = Map (HeaderHash blk) (InternalBlockInfo blk) + + data InternalBlockInfo blk = InternalBlockInfo { + ibiFile :: !FsPath + , ibiBlockOffset :: !BlockOffset + , ibiBlockSize :: !BlockSize + , ibiBlockInfo :: !(BlockInfo blk) + , ibiNestedCtxt :: !(SomeSecond (NestedCtxt Header) blk) + } + \end{lstlisting} + In addition to the \lstinline!BlockInfo! that \lstinline!getBlockInfo! should + return, we also store in which file the block is stored, the offset in the + file, the size of the block, and the nested context (see + \cref{serialisation:storage:nested-contents}). + + The second index, called the \lstinline!SuccessorsIndex! is defined as + follows: + \begin{lstlisting} + type SuccessorsIndex blk = Map (ChainHash blk) (Set (HeaderHash blk)) + \end{lstlisting} + + Both indices are updated when new blocks are added and when blocks are removed + due to garbage collection, see \cref{volatile:implementation:gc}. + + The \lstinline!Map! type used is a strict ordered map from the standard + \lstinline!containers! package. As for any data that is stored as long-lived + state, we use strict data types to avoid space leaks. We opt for an ordered + map, i.e., a sized balanced binary tree, instead of a hashing-based map to + avoid hash collisions. If an attacker manages to feed us blocks that are + hashed to the same bucket in the hash map, the performance will deteriorate. + An ordered map is not vulnerable to this type of attack. + +\item Besides the mappings we discussed above, the in-memory state of the + Volatile DB consists of the path, file handle, and offset into the file to + which new blocks will be appended. We store this state, a pure data type, in a + \emph{read-append-write lock}, which we discuss in + \cref{volatile:implementation:rawlock}. + +\item To read a block, header, or any other block component from the Volatile + DB, we obtain read access to the state (see + \cref{volatile:implementation:rawlock}) and look up the + \lstinline!InternalBlockInfo! corresponding to the hash in the + \lstinline!ReverseIndex!. The found \lstinline!InternalBlockInfo! contains the + file path, the block offset, and the block size, which is all what is needed + to read the block. To read the header, we can use the file path, the block + offset, the nested context (see \cref{serialisation:storage:nested-contents}), + the header offset, and header size. The other block components can also be + derived from the \lstinline!InternalBlockInfo!. + +\item Note that unlike the Immutable DB, the Volatile DB does not maintain CRC32 + checksums of the stored blocks to detect corruption. Instead, after reading a + block from the Volatile DB and before copying it to the Immutable DB, we + validate the block using the \lstinline!nodeCheckIntegrity! method, as + described in \cref{immutable:implementation:recovery}. + +\end{itemize} + +\subsection{Garbage collection} +\label{volatile:implementation:gc} + +\todo{TODO} Sync with \cref{chaindb:gc}. + +As mentioned above, when a garbage collection for slot $s$ is triggered, all +blocks with a slot less than $s$ should be removed from the Volatile DB. + +For simplicity and following our robust append-only approach, we do not modify +files in-place during garbage collection. Either all the blocks in a file have a +slot number less than $s$ and it can be deleted atomically, or at least one +block has a slot number greater or equal to $s$ and we do \emph{not} delete the +file. Checking whether a file can be garbage collected is simple and happens in +constant time: the \lstinline!maxSlotNo! field of \lstinline!FileInfo! is +compared against $s$. + +The default for blocks per file is currently \num{1000}. Let us now calculate +what the effect of this number is on garbage collection. We will call blocks +that with a slot older than $s$ \emph{garbage}. Garbage blocks that can be +deleted because they are in a file only containing garbage are \emph{collected +garbage}. Garbage blocks that cannot yet be deleted because there is a +non-garbage block in the same file are \emph{uncollected garbage}. + +The lower the number of blocks per file, the fewer uncollected garbage there +will be, and vice versa. In the extreme case, a single block is stored per file, +resulting in no uncollected garbage, i.e., a garbage collection rate of 100\%. +The downside is that for each new block to add, a new file will have to be +created, which is less efficient than appending to an already open file. It will +also result in lots of tiny files. + +The other extreme is to have no bound on the number of blocks per file, which +will result in one single file containing all blocks. This means no garbage will +ever be collected, i.e., a garbage collection rate of 0\%, which is of course +not acceptable. + +During normal operation, roughly one block will be added every 20 +seconds.\footnote{When using the PBFT consensus protocol (\cref{bft}), exactly +one block will be produced every 20 seconds. However, when using the Praos +consensus protocol (\cref{praos}), on average there will be one block every 20 +seconds, but it is natural to have a fork now and then, leading to one or more +extra blocks. For the purposes of this calculation, the difference is +negligible.} The security parameter $k$ used for mainnet is \num{2160}. This +mean that if a linear chain of \num{2161} blocks has been added, the oldest +block has become immutable and can be copied to the Immutable DB, after which it +can be garbage collected. If we assume no delay between copying and garbage +collection, it will take $\num{1000} + \num{2160} = \num{3160}$ blocks before +the first file containing \num{1000} blocks will be garbage collected. + +This means that in the above scenario, starting from a Volatile DB containing +$k$ blocks, after every $\mathsf{blocksPerFile}$ new blocks and thus +corresponding garbage collections, $\mathsf{blocksPerFile}$ blocks will be +garbage collected.\todo{expand calculation} + +In practice we allow for overlap by delaying the garbage collection, which has +an impact on the effective size of the Volatile DB, which we discuss in +\todo{link ChainDB}. + +\subsection{Read-Append-Write lock} +\label{volatile:implementation:rawlock} + +We use a \emph{read-append-write} lock to store the state of the Volatile DB. +This is an extension of the more common read-write lock. A RAW lock allows +multiple concurrent readers, at most one appender, which is allowed to run +concurrently with the readers, and at most one writer, which has exclusive +access to the lock. + +The \lstinline!getBlockComponent! operation corresponds to \emph{reading}, the +\lstinline!putBlock! operation to \emph{appending}, and the +\lstinline!garbageCollect! operation to \emph{writing}. Adding a new block can +safely happen at the same time as blocks are being read. The new block will be +appended to the current file or a new file will be started. This does not affect +any concurrent reads of other blocks in the Volatile DB. At most one block can +be added at a time, as blocks are appended one-by-one to the current file. To +garbage collect the Volatile DB, we must obtain an exclusive lock on the state, +as we might be deleting a file while trying to read from it at the same time. +During garbage collection, we ignore the current file and will thus never try to +delete it. This means that, strictly speaking, it would be possible to safely +append blocks and garbage collect blocks concurrently. However, for simplicity +(how should the concurrent changes to the indices be resolved?), we did not +pursue this. + +As mentioned in \cref{volatile:implementation:gc}, it is often the case that no +files can be garbage collected. As a (premature) optimisation, we first check +(which is cheap) whether any files can be garbage collected before trying to +obtain the corresponding, more expensive lock on the state. + +\subsection{Recovery} +\label{volatile:implementation:recovery} + +Whenever a file-system operation fails, or a file is missing or corrupted, we +shut down the Volatile DB and consequently the whole system. When this happens, +either the system's file system is no longer reliable (e.g., disk corruption), +manual intervention (e.g., disk is full) is required, or there is a bug in the +system. In all cases, there is no point in trying to continue operating. We shut +down the system and flag the shutdown as \emph{dirty}, triggering a full +validation on the next start-up. + +When opening the Volatile DB, the previous in-memory state, including the +indices, is reconstructed based on the on-disk files. The block in each file are +read and deserialised. There are two validation modes: a standard validation and +a full validation. The difference between the two is that during a full +validation, the integrity of each block is verified to detect silent corruption +using the \lstinline!nodeCheckIntegrity! method, as described in +\cref{immutable:implementation:recovery}. + +When a block fails to deserialise or it is detected as a corrupt block when the +full validation mode is enabled, the file is truncated to the last valid block +before it. As mentioned at the start of this chapter, it is not crucial to +recover every single block. Therefore, we do not try to deserialise the blocks +after a corrupt one. From 3c7f02677211de88ec0cc2cb0bef727746c1642e Mon Sep 17 00:00:00 2001 From: Thomas Winant Date: Mon, 11 Jan 2021 17:24:16 +0100 Subject: [PATCH 3/5] report: Ledger Database --- .../report/chapters/consensus/protocol.tex | 12 +- .../chapters/consensus/serialisation.tex | 4 +- .../docs/report/chapters/storage/ledgerdb.tex | 376 +++++++++++++++++- .../docs/report/references.bib | 17 + 4 files changed, 397 insertions(+), 12 deletions(-) diff --git a/ouroboros-consensus/docs/report/chapters/consensus/protocol.tex b/ouroboros-consensus/docs/report/chapters/consensus/protocol.tex index 454dce70f50..f4b94980ed9 100644 --- a/ouroboros-consensus/docs/report/chapters/consensus/protocol.tex +++ b/ouroboros-consensus/docs/report/chapters/consensus/protocol.tex @@ -284,12 +284,12 @@ \subsection{Protocol state management} Re-applying previously-validated blocks happens when we are replaying blocks from the immutable database when initialising the in-memory ledger state -(\cref{ledgerdb:initialisation}). It is also useful during chain selection -(\cref{chainsel}): depending on the consensus protocol, we may end up switching -relatively frequently between short-lived forks; when this happens, skipping -expensive checks can improve the performance of the node. \todo{How does this -relate to the best case == worst case thing? Or to the asymptotic -attacker/defender costs?} +(\cref{ledgerdb:on-disk:initialisation}). It is also useful during chain +selection (\cref{chainsel}): depending on the consensus protocol, we may end up +switching relatively frequently between short-lived forks; when this happens, +skipping expensive checks can improve the performance of the node. \todo{How + does this relate to the best case == worst case thing? Or to the asymptotic + attacker/defender costs?} \subsection{Leader selection} \label{consensus:class:leaderselection} diff --git a/ouroboros-consensus/docs/report/chapters/consensus/serialisation.tex b/ouroboros-consensus/docs/report/chapters/consensus/serialisation.tex index 7f0af8845df..3025fe39d1c 100644 --- a/ouroboros-consensus/docs/report/chapters/consensus/serialisation.tex +++ b/ouroboros-consensus/docs/report/chapters/consensus/serialisation.tex @@ -47,8 +47,8 @@ \section{Serialising for storage} \begin{itemize} \item Blocks -\item The extended ledger state (\cref{storage:extledgerstate}) which is the - combination of: +\item The extended ledger state (see \cref{storage:extledgerstate} and + \cref{ledgerdb:on-disk}) which is the combination of: \begin{itemize} \item The header state (\cref{storage:headerstate}) \item The ledger state\todo{link?} diff --git a/ouroboros-consensus/docs/report/chapters/storage/ledgerdb.tex b/ouroboros-consensus/docs/report/chapters/storage/ledgerdb.tex index 5841c8df397..69ff5a05ec9 100644 --- a/ouroboros-consensus/docs/report/chapters/storage/ledgerdb.tex +++ b/ouroboros-consensus/docs/report/chapters/storage/ledgerdb.tex @@ -1,8 +1,376 @@ \chapter{Ledger Database} \label{ledgerdb} -\section{Initialisation} -\label{ledgerdb:initialisation} +The Ledger DB is responsible for the following tasks: -Describe why it is important that we store a single snapshot and then replay -ledger events to construct the ledger DB. +\begin{enumerate} +\item \textbf{Maintaining the ledger state at the tip}: Maintaining the ledger + state corresponding to the current tip in memory. When we try to extend our + chain with a new block fitting onto our tip, the block must first be validated + using the right ledger state, i.e., the ledger state corresponding to the tip. + The current ledger state is needed for various other purposes. + +\item \textbf{Maintaining the past $k$ ledger states}: As discussed in + \cref{consensus:overview:k}, we might roll back up to $k$ blocks when + switching to a more preferable fork. Consider the example below: + % + \begin{center} + \begin{tikzpicture} + \draw (0, 0) -- (50pt, 0) coordinate (I); + \draw (I) -- ++(20pt, 20pt) coordinate (C1) -- ++(20pt, 0) coordinate (C2); + \draw (I) -- ++(20pt, -20pt) coordinate (F1) -- ++(20pt, 0) coordinate (F2) -- ++(20pt, 0) coordinate (F3); + \node at (I) {$\bullet$}; + \node at (C1) {$\bullet$}; + \node at (C2) {$\bullet$}; + \node at (F1) {$\bullet$}; + \node at (F2) {$\bullet$}; + \node at (F3) {$\bullet$}; + \node at (I) [above left] {$I$}; + \node at (C1) [above] {$C_1$}; + \node at (C2) [above] {$C_2$}; + \node at (F1) [below] {$F_1$}; + \node at (F2) [below] {$F_2$}; + \node at (F3) [below] {$F_3$}; + \draw (60pt, 50pt) node {$\overbrace{\hspace{60pt}}$}; + \draw (60pt, 60pt) node[fill=white] {$k$}; + \draw [dashed] (30pt, -40pt) -- (30pt, 45pt); + \end{tikzpicture} + \end{center} + % + Our current chain's tip is $C_2$, but the fork containing blocks $F_1$, $F_2$, + and $F_3$ is more preferable. We roll back our chain to the intersection point + of the two chains, $I$, which must be not more than $k$ blocks back from our + current tip. Next, we must validate block $F_1$ using the ledger state at + block $I$, after which we can validate $F_2$ using the resulting ledger state, + and so on. + + This means that we need access to all ledger states of the past $k$ blocks, + i.e., the ledger states corresponding to the volatile part of the current + chain.\footnote{Applying a block to a ledger state is not an invertible + operation, so it is not possible to simply ``unapply'' $C_1$ and $C_2$ to + obtain $I$.} + + Access to the last $k$ ledger states is not only needed for validating candidate + chains, but also by the: + \begin{itemize} + \item \textbf{Local state query server}: To query any of the past $k$ ledger + states (\cref{servers:lsq}). + \item \textbf{Chain sync client}: To validate headers of a chain that + intersects with any of the past $k$ blocks + (\cref{chainsyncclient:validation}). + \end{itemize} + +\item \textbf{Storing on disk}: To obtain a ledger state for the current tip of + the chain, one has to apply \emph{all blocks in the chain} one-by-one to the + initial ledger state. When starting up the system with an on-disk chain + containing millions of blocks, all of them would have to be read from disk and + applied. This process can take tens of minutes, depending on the storage and + CPU speed, and is thus too costly to perform on each startup. + + For this reason, a recent snapshot of the ledger state should be periodically + written to disk. Upon the next startup, that snapshot can be read and used to + restore the current ledger state, as well as the past $k$ ledger states. +\end{enumerate} + +Note that whenever we say ``ledger state'', we mean the +\lstinline!ExtLedgerState blk! type described in \cref{storage:extledgerstate}. + +The above duties are divided across the following modules: + +\begin{itemize} +\item \lstinline!LedgerDB.InMemory!: this module defines a pure data structure, + named \lstinline!LedgerDB!, to represent the last $k$ ledger states in memory. + Operations to validate and append blocks, to switch to forks, to look up + ledger states, \ldots{} are provided. +\item \lstinline!LedgerDB.OnDisk!: this module contains the functionality to + write a snapshot of the \lstinline!LedgerDB! to disk and how to restore a + \lstinline!LedgerDB! from a snapshot. +\item \lstinline!LedgerDB.DiskPolicy!: this module contains the policy that + determines when a snapshot of the \lstinline!LedgerDB! is written to disk. +\item \lstinline!ChainDB.Impl.LgrDB!: this module is part of the Chain DB, and + is responsible for maintaining the pure \lstinline!LedgerDB! in a + \lstinline!StrictTVar!. +\end{itemize} + +We will now discuss the modules listed above. + +\section{In-memory representation} +\label{ledgerdb:in-memory} + +The \lstinline!LedgerDB!, capable of represent the last $k$ ledger states, is an +instantiation of the \lstinline!AnchoredSeq! data type. This data type is +implemented using the \emph{finger tree} data structure~\cite{fingertrees} and +has the following time complexities: + +\begin{itemize} +\item Appending a new ledger state to the end in constant time. +\item Rolling back to a previous ledger state in logarithmic time. +\item Looking up a past ledger state by its point in logarithmic time. +\end{itemize} + +One can think of a \lstinline!AnchoredSeq! as a \lstinline!Seq! from +\lstinline!Data.Sequence! with a custom \emph{finger tree measure}, allowing for +efficient lookups by point, combined with an \emph{anchor}. When fully +\emph{saturated}, the sequence will contain $k$ ledger states. In case of a +complete rollback of all $k$ blocks and thus ledger states, the sequence will +become empty. A ledger state is still needed, i.e., one corresponding to the +most recent immutable block that cannot be rolled back. The ledger state at the +anchor plays this role. + +When a new ledger state is appended to a fully saturated \lstinline!LedgerDB!, +the ledger state at the anchor is dropped and the oldest element in the sequence +becomes the new anchor, as it has become immutable. This maintains the invariant +that only the last $k$ ledger states are stored, \emph{excluding} the ledger +state at the anchor. This means that in practice, $k + 1$ ledger states will be +kept in memory. When fewer the \lstinline!LedgerDB! contains fewer than $k$ +elements, new ones are appended without shifting the anchor until it is +saturated. + +\todo{TODO} figure? + +The \lstinline!LedgerDB! is parameterised over the ledger state $l$. +Conveniently, the \lstinline!LedgerDB! can implement the same abstract interface +(described in \cref{ledger:api}) that the ledger state itself implements. I.e., +the \lstinline!GetTip!, \lstinline!IsLedger!, and \lstinline!ApplyBlock! +classes. This means that in most places, wherever a ledger state can be used, it +is also possible to wrap it in a \lstinline!LedgerDB!, causing it to +automatically maintain a history of the last $k$ ledger states. + +\todo{TODO} discuss \lstinline!Ap! and \lstinline!applyBlock!? These are +actually orthogonal to \lstinline!LedgerDB! and should be separated. + + +\paragraph{Memory usage} + +The ledger state is a big data structure that contains, amongst other things, +the entire UTxO. Recent measurements\footnote{Using the ledger state at the +block with slot number \num{16976076} and hash \lstinline!af0e6cb8ead39a86!.} +show that the heap size of an Allegra ledger state is around \num{361}~MB. +Fortunately, storing $k = \num{2160}$ ledger states in memory does \emph{not} +require $\num{2160} * \num{361}~\textrm{MB} = \num{779760}~\textrm{MB} = +\num{761}~\textrm{GB}$. The ledger state is defined using standard Haskell data +structures, e.g., \lstinline!Data.Map.Strict!, which are \emph{persistent} data +structures. This means that when we update a ledger state by applying a block to +it, we only need extra memory for the new and the modified data. The majority of +the data will stay the same and will be \emph{shared} with the previous ledger +state. + +The memory required for storing the last $k$ ledger state is thus proportional +to: the size of the oldest in-memory ledger state \emph{and} the changes caused +by the last $k$ blocks, e.g., the number of transactions in those blocks. +Compared to the \num{361}~MB required for a single ledger state, keeping the +last $k$ ledger states in memory requires only \num{375}~MB in total. This is +only \num{14}~MB or 3.8\% more memory. Which is a very small cost. + +\paragraph{Past design} + +In the past, before measuring this cost, we did not keep all $k$ past ledger +states because of an ungrounded fear for the extra memory usage. The +\lstinline!LedgerDB! data structure had a \lstinline!snapEvery! parameter, +ranging from 1 to $k$, indicating that a snapshot, i.e., a ledger state, should +be kept every \lstinline!snapEvery! ledger states or blocks. In practice, a +value of 100 was used for this parameter, resulting in 21--22 ledger states in +memory. + +The representation was much more complex, to account for these missing ledger +states. More importantly, accessing a past ledger state or rewinding the +\lstinline!LedgerDB! to a past ledger state had a very different cost model. As +the requested ledger state might not be in memory, it would have to be +\emph{reconstructed} by reapplying blocks to an older ledger state. + +Consider the example below using \lstinline!snapEvery! = 3. $L_i$ indicate +ledger states and $\emptyset_i$ indicate skipped ledger states. $L_0$ corresponds to the +most recent ledger state, at the tip of the chain. +% +\begin{center} +\begin{tikzpicture} +\draw (0, 0) -- (8, 0); +\draw (0, 1) -- (8, 1); + +\draw (1, 0) -- (1, 1); +\draw (2, 0) -- (2, 1); +\draw (3, 0) -- (3, 1); +\draw (4, 0) -- (4, 1); +\draw (5, 0) -- (5, 1); +\draw (6, 0) -- (6, 1); +\draw (7, 0) -- (7, 1); +\draw (8, 0) -- (8, 1); + +\draw (0.5, 0.5) node {\small \ldots}; +\draw (1.5, 0.5) node {\small $L_6$}; +\draw (2.5, 0.5) node {\small $\emptyset_5$}; +\draw (3.5, 0.5) node {\small $\emptyset_4$}; +\draw (4.5, 0.5) node {\small $L_3$}; +\draw (5.5, 0.5) node {\small $\emptyset_2$}; +\draw (6.5, 0.5) node {\small $\emptyset_1$}; +\draw (7.5, 0.5) node {\small $L_0$}; + +\end{tikzpicture} +\end{center} +% +When we need access to the ledger state at position $3$, we are in luck and can +use the available $L_3$. However, when we need access to the skipped ledger +state at position $1$, we have to do the following: we look for the most recent +ledger state before $\emptyset_1$, i.e., $L_3$. Next, we need to reapply blocks $B_2$ +and $B_1$ to it, which means we have to read those from disk, deserialise them, +and apply them again. + +This means that accessing a past ledger state is not a pure operation and might +require disk access and extra computation. Consequently, switching to a fork +might require reading and revalidating blocks that remain part of the chain, in +addition to the new blocks. + +As mentioned at the start of this chapter, the chain sync client also needs +access to past ledger view (\cref{consensus:class:ledgerview}), which it can +obtain from past ledger states. A malicious peer might try to exploit it and +create a chain that intersects with our chain right \emph{before} an in-memory +ledger state snapshot. In the worst case, we have to read and reapply +\lstinline!snapEvery! - 1 = 99 blocks. This is not acceptable as the costs are +asymmetric and in the advantage of the attacker, i.e., creating and serving such +a header is much cheaper than reconstructing the required snapshot. At the time, +we solved this by requiring ledger states to store snapshots of past ledger +views. The right past ledger view could then be obtained from the current ledger +state, which was always available in memory. However, storing snapshots of +ledger views within a single ledger state is more complex, as we are in fact +storing snapshots \emph{within} snapshots. The switch to keep all $k$ past +ledger states significantly simplified the code and sped up the look-ups. + +\paragraph{Future design} + +It is important to note that in the future, this design will have to change +again. The UTxO and, consequently, the ledger state are expected to grow in size +organically. This growth will be accelerated by new features added to the +ledger, e.g., smart contracts. At some point, the ledger state will be so large +that keeping it in its entirety in memory will no longer be feasible. Moreover, +the cost of generating enough transactions to grow the current UTxO beyond the +expected memory limit might be within reach for some attackers. Such an attack +might cause a part of the network to be shut down because the nodes in question +are no longer able to load the ledger state in memory without running against +the memory limit. + +For these reasons, we plan to revise our design in the future, and start storing +parts of the ledger state on disk again. + +\section{On-disk} +\label{ledgerdb:on-disk} + +The \lstinline!LedgerDB.OnDisk! module provides functions to write a ledger +state to disk and read a ledger state from disk. The \lstinline!EncodeDisk! and +\lstinline!DecodeDisk! classes from \cref{serialisation:storage} are used to +(de)serialise the ledger state to or from CBOR. Because of its large size, we +read and deserialise the snapshot incrementally. + +\todo{TODO} which ledger state to take a snapshot from is determined by the +Chain DB. I.e., the background thread that copies blocks from the Volatile DB to +the Immutable DB will call the \lstinline!onDiskShouldTakeSnapshot! function, +and if it returns \lstinline!True!, a snapshot will be taken. \todo{TODO} +double-check whether we're actually taking a snapshot of the right ledger state. + +\subsection{Disk policy} +\label{ledgerdb:on-disk:disk-policy} + +The disk policy determines how many snapshots should be stored on disk and when +a new snapshot of the ledger state should be written to disk. + +\todo{TODO} worth discussing? We would just be duplicating the existing +documentation. + +\subsection{Initialisation} +\label{ledgerdb:on-disk:initialisation} + +During initialisation, the goal is to construct an initial \lstinline!LedgerDB! +that is empty except for the ledger state at the anchor, which has to correspond +to the immutable tip, i.e., the block at the tip of the Immutable DB +(\cref{immutable}). + +Ideally, we can construct the initial \lstinline!LedgerDB! from a snapshot of +the ledger state that we wrote to disk. Remember that updating a ledger state +with a block is not inversible: we can apply a block to a ledger state, but we +cannot ``unapply'' a block to a ledger state. This means the snapshot has to be +at least as old as the anchor. A snapshot matching the anchor can be used as is. +A snapshot older than the anchor can be used after reapplying the necessary +blocks. A snapshot newer than the anchor can \emph{not} be used, as we cannot +unapply blocks to get the ledger state corresponding to the anchor. This is the +reason why we only take snapshots of an immutable ledger state, i.e., of the +anchor of the \lstinline!LedgerDB! (or older). + +Constructing the initial \lstinline!LedgerDB! proceeds as follows: +\begin{enumerate} +\item If any on-disk snapshots are available, we try them from new to old. The + newer the snapshot, the fewer blocks will need to be reapplied. +\item We deserialise the snapshot. If this fails, we try the next one. +\item If the snapshot is of the ledger state corresponding to the immutable tip, + we can use the snapshot for the anchor of the \lstinline!LedgerDB! and are + done. +\item If the snapshot is newer than the immutable tip, we cannot use it and try + the next one. This situation can arise not because we took a snapshot of a + ledger state newer than the immutable tip, but because the Immutable DB was + truncated. +\item If the snapshot is older than the immutable tip, we will have to reapply + the blocks after the snapshot to obtain the ledger state at the immutable tip. + If there is no (more) snapshot to try, we will have to reapply \emph{all + blocks} starting from the beginning of the chain to obtain the ledger state at + the immutable tip, i.e., the entire immutable chain. The blocks to reapply are + streamed from the Immutable DB, using an iterator + (\cref{immutable:api:iterators}). + + Note that we can \emph{reapply} these blocks, which is quicker than applying + them (see \cref{ledgerdb:lgrdb}), as the existence of a snapshot newer than + these blocks proves\footnote{Unless the on-disk database has been tampered + with, but this is not an attack we intend to protect against, as this would + mean the machine has already been compromised.} that they have been + successfully applied in the past. +\end{enumerate} +% +Reading and applying blocks is costly. Typically, very few blocks need to be +reapplied in practice. However, there is one exception: when the serialisation +format of the ledger state changes, all snapshots (written using the old +serialisation format) will fail to deserialise, and all blocks starting from +genesis will have to be reapplied. To mitigate this, the ledger state decoder is +typically written in a backwards-compatible way, i.e., it accepts both the old +and new serialisation format. + +\section{Maintained by the Chain DB} +\label{ledgerdb:lgrdb} + +The \lstinline!LedgerDB! is a pure data structure. The Chain DB (see +\cref{chaindb}) maintains the current \lstinline!LedgerDB! in a +\lstinline!StrictTVar!. The most recent element in the \lstinline!LedgerDB! is +the current ledger state. Because it is stored in a \lstinline!StrictTVar!, the +current ledger state can be read and updated in the same \lstinline!STM! +transaction as the current chain, which is also stored in a +\lstinline!StrictTVar!. + +The \lstinline!ChainDB.Impl.LgrDB!\footnote{In the past, we had similar modules +for the \lstinline!VolatileDB! and \lstinline!ImmutableDB!, i.e., +\lstinline!VolDB! and \lstinline!ImmDB!. The former were agnostic of the +\lstinline!blk! type and the latter instantiated the former with the +\lstinline!blk! type. However, in hindsight, unifying the two proved to be +simpler and was thus done. The reason why a separate \lstinline!LgrDB! still +exists is mainly because it needs to wrap the pure \lstinline!LedgerDB! in a +\lstinline!StrictTVar!.} is responsible for maintaining the current ledger +state. Besides this responsibility, it also integrates the Ledger DB with other +parts of the Chain DB. + +Moreover, it remembers which blocks have been successfully applied in the past. +When such a block needs to be validated again, e.g., because we switch again to +the same fork containing the block, we can \emph{reapply} the block instead of +\emph{applying} it (see \cref{ledger:api:ApplyBlock}). Because the block has +been successfully applied in the past, we know the block is valid, which means +we can skip some of the more expensive checks, e.g., checking the hashes, +speeding up the process of validating the block. Note that a block can only be +applied to a single ledger state, i.e., the ledger state corresponding to the +predecessor of the block. Consequently, it suffices to remember whether a block +was valid or not, there is no need to remember with respect to which ledger +state it was valid. + +To remember which blocks have been successfully applied in the past, we store +the points of the respective blocks in a set. Before validating a block, we look +up its point in the set, when present, we can reapply the block instead of +applying it. To stop this set from growing without bound, we garbage collect it +the same way the Volatile DB is garbage collected, see \cref{chaindb:gc}. When a +block has a slot older than the slot number of the most recent immutable block, +either the block is already immutable or it is part of a fork that we will never +consider again, as it forks off before the immutable block.\todo{slot number vs + block number} The block in question will never have to be validated again, and +so it is not necessary to remember whether we have already applied it or not. diff --git a/ouroboros-consensus/docs/report/references.bib b/ouroboros-consensus/docs/report/references.bib index adec80cb4f8..c010a1325f8 100644 --- a/ouroboros-consensus/docs/report/references.bib +++ b/ouroboros-consensus/docs/report/references.bib @@ -105,3 +105,20 @@ @misc{buterin2020combining archivePrefix={arXiv}, primaryClass={cs.CR} } + +@article{fingertrees, + author = {Hinze, Ralf and Paterson, Ross}, + title = {Finger Trees: A Simple General-Purpose Data Structure}, + year = {2006}, + issue_date = {March 2006}, + publisher = {Cambridge University Press}, + address = {USA}, + volume = {16}, + number = {2}, + issn = {0956-7968}, + doi = {10.1017/S0956796805005769}, + journal = {J. Funct. Program.}, + month = mar, + pages = {197–217}, + numpages = {21} +} From ea72d29cf0063934e096a684970ff5875914dcd4 Mon Sep 17 00:00:00 2001 From: Thomas Winant Date: Thu, 14 Jan 2021 10:32:23 +0100 Subject: [PATCH 4/5] report: add some sections to the Chain Database chapter --- .../docs/report/chapters/storage/chaindb.tex | 191 ++++++++++++++++++ .../chapters/storage/chainselection.tex | 5 +- .../docs/report/chapters/storage/ledgerdb.tex | 2 + 3 files changed, 196 insertions(+), 2 deletions(-) diff --git a/ouroboros-consensus/docs/report/chapters/storage/chaindb.tex b/ouroboros-consensus/docs/report/chapters/storage/chaindb.tex index 25af08d8592..7bbb0ab215f 100644 --- a/ouroboros-consensus/docs/report/chapters/storage/chaindb.tex +++ b/ouroboros-consensus/docs/report/chapters/storage/chaindb.tex @@ -3,6 +3,196 @@ \chapter{Chain Database} TODO\todo{TODO}: This is currently a disjoint collection of snippets. +\section{Union of the Volatile DB and the Immutable DB} +\label{chaindb:union} + +As discussed in \cref{storage:components}, the blocks in the Chain DB are +divided between the Volatile DB (\cref{volatile}) and the Immutable DB +(\cref{immutable}). Yet, it presents a unified view of the two databases. +Whereas the Immutable DB only contains the immutable chain and the Volatile DB +the volatile \emph{parts} of multiple forks, by combining the two, the Chain DB +contains multiple forks. + +\subsection{Looking up blocks} +\label{chaindb:union:lookup} + +Just like the two underlying databases the Chain DB allows looking up a +\lstinline!BlockComponent! of a block by its point. By comparing the slot number +of the point to the slot of the immutable tip, we could decide in which database +to look up the block. However, this would not be correct: the point might have a +slot older than the immutable tip, but refer to a block not in the Immutable DB, +i.e., a block on an older fork. More importantly, there is a potential race +condition: between the time at which the immutable tip was retrieved and the +time the block is retrieved from the Volatile DB, the block might have been +copied to the Immutable DB and garbage collected from the Volatile DB, resulting +in a false negative. Nevertheless, the overlap between the two makes this +scenario very unlikely. + +For these reasons, we look up a block in the Chain DB as follows. We first look +up the given point in the Volatile DB. If the block is not in the Volatile DB, +we fall back to the Immutable DB. This means that if, at the same, a block is +copied from the Volatile DB to the Immutable DB and garbage collected from the +Volatile DB, we will still find it in the Immutable DB. Note that failed lookups +in the Volatile DB are cheap, as no disk access is required. + +\subsection{Iterators} +\label{chaindb:union:iterators} + +Similar to the Immutable DB (\cref{immutable:api:iterators}), the Chain DB +allows streaming blocks using iterators. We only support streaming blocks from +the current chain or from a recent fork. We \emph{do not} support streaming from +a fork that starts before the current immutable tip, as these blocks are likely +to be garbage collected soon. Moreover, it is of no use to us to serve another +node blocks from a fork we discarded. + +We might have to stream blocks from the Immutable DB, the Volatile DB, or from +both. If the end bound is older or equal to the immutable tip, we simply try to +open an Immutable DB iterator with the given bounds. If the end bound is newer +than the immutable tip, we construct a path of points (see +\lstinline!filterByPredecessor! in \cref{volatile:api}) connecting the end bound +to the start bound. This path is either entirely in the Volatile DB or it is +partial because a block is missing from the Volatile DB. If the missing block is +the tip of the Immutable DB, we will have to stream from the Immutable DB in +addition to the Volatile DB. If the missing block is not the tip of the +Immutable DB, we consider the range to be invalid. In other words, we allow +streaming from both databases, but only if the immutable tip is the transition +point between the two, it cannot be a block before the tip, as that would mean +the fork is too old. + +\todo{TODO} Image? + +To stream blocks from the Volatile DB, we maintain the constructed path of +points as a list in memory and look up the corresponding block (component) in +the Volatile DB one by one. + +Consider the following scenario: we open a Chain DB iterator to stream the +beginning of the current volatile chain, i.e., the blocks in the Volatile DB +right after the immutable tip. However, before streaming the iterator's first +block, we switch to a long fork that forks off all the way back at our immutable +tip. If that fork is longer than the previous chain, blocks from the start of +our chain will be copied from the Volatile DB to the Immutable DB,\todo{link} +advancing the immutable tip. This means the blocks the iterator will stream are +now part of a fork older than $k$. In this new situation, we would not allow +opening an iterator with the same range as the already-opened iterator. However, +we do allow streaming these blocks using the already opened iterator, as the +blocks to stream are unlikely to have already been garbage collected. +Nevertheless, it is still theoretically possible\footnote{This is unlikely, as +there is a delay between copying and garbage collection (see +\cref{chaindb:gc:delay}) and there are network time-outs on the block fetch +protocol, of which the server-side (see \cref{servers:blockfetch}) is the +primary user of Chain DB iterators.} that such a block has already been garbage +collected. For this reason, the Chain DB extends the Immutable DB's +\lstinline!IteratorResult! type (see \cref{immutable:api:iterators}) with the +\lstinline!IteratorBlockGCed! constructor: +% +\begin{lstlisting} +data IteratorResult blk b = + IteratorExhausted + | IteratorResult b + | IteratorBlockGCed (RealPoint blk) +\end{lstlisting} + +There is another scenario to consider: we stream the blocks from the start of +the current volatile chain, just like in the previous scenario. However, in this +case, we do not switch to a fork, but our chain is extended with new blocks, +which means blocks from the start of our volatile chain are copied from the +Volatile DB to the Immutable DB. If these blocks have been copied and garbage +collected before the iterator is used to stream them from the Volatile DB (which +is unlikely, as explained in the previous scenario), the iterator will +incorrectly yield \lstinline!IteratorBlockGCed!. Instead, when a block that was +planned to be streamed from the Volatile DB is missing, we first look in the +Immutable DB for the block in case it has been copied there. After the block +copied to the Immutable has been streamed, we continue with the remaining blocks +to stream from the Volatile DB. It might be the case that the next block has +also been copied and garbage collected, requiring another switch to the +Immutable DB. In the theoretical worst case, we have to switch between the two +databases for each block, but this is nearly impossible to happen in practice. + +\subsection{Followers} +\label{chaindb:union:followers} + +In addition to iterators, the Chain DB also supports \emph{followers}. Unlike an +iterator, which is used to request a static segment of the current chain or a +recent fork, a follower is used to follow the \emph{current chain}. Either from +the start of from a suggested more recent point. Unlike iterators, followers are +dynamic, they will follow the chain when it grows or forks. A follower is +pull-based, just like its primary user, the chain sync server (see +\cref{servers:chainsync}). This avoids the need to have a growing queue of +changes to the chain on the server side in case the client side is slower. + +The API of a follower is as follows: +% +\begin{lstlisting} +data Follower m blk a = Follower { + followerInstruction :: m (Maybe (ChainUpdate blk a)) + , followerInstructionBlocking :: m (ChainUpdate blk a) + , followerForward :: [Point blk] -> m (Maybe (Point blk)) + , followerClose :: m () + } +\end{lstlisting} +% +The \lstinline!a! parameter is the same \lstinline!a! as the one in +\lstinline!BlockComponent! (see \cref{immutable:api:block-component}), as a +follower for any block component \lstinline!a! can be opened. + +A follower always has an implicit position associated with it. The +\lstinline!followerInstruction! operation and its blocking variant allow +requesting the next instruction w.r.t.\ the follower's implicit position, i.e., +a \lstinline!ChainUpdate!: +% +\begin{lstlisting} +data ChainUpdate block a = + AddBlock a + | RollBack (Point block) +\end{lstlisting} +% +The \lstinline!AddBlock! constructor indicates that to follow the current chain, +the follower should extend its chain with the given block (component). Switching +to a fork is represented by first rolling back to a certain point +(\lstinline!RollBack!), followed by at least as many new blocks +(\lstinline!AddBlock!) as blocks that have been rolled back. If we were to +represent switching to a fork using a constructor like: +% +\begin{lstlisting} + | SwitchToFork (Point block) [a] +\end{lstlisting} +% +we would need to have many blocks or block components in memory at the same +time. + +These operations are implemented as follows. In case the follower is looking at +the immutable part of the chain, an Immutable DB iterator is used and no +rollbacks will be encountered. When the follower has advanced into the volatile +part of the chain, the in-memory fragment containing the last $k$ headers is +used (see \cref{storage:inmemory}). Depending on the block component, the +corresponding block might have to be read from the Volatile DB. + +When a new chain has been adopted during chain selection (see +\cref{chainsel:addblock}), all open followers that are looking at the part of +the current chain that was rolled back are updated so that their next +instruction will be the correct \lstinline!RollBack!. By definition, followers +looking at the immutable part of the chain will be unaffected. + +By default, a follower will start from the very start of the chain, i.e., at +genesis. Accordingly, the first instruction will be an \lstinline!AddBlock! with +the very first block of the chain. As mentioned, the primary user of a follower +is the chain sync server, of which the clients in most cases already have large +parts of the chain. The \lstinline!followerForward! operation can be used in +these cases to find a more recent intersection from which the follower can +start. The client will sent a few recent points from its chain and the follower +will try to find the most recent of them that is on our current chain. This is +implemented by looking up blocks by their point in the current chain fragment +and the Immutable DB. + +Followers are affected by garbage collection similarly to how iterators are +(\cref{chaindb:union:iterators}): when the implicit position of the follower is +in the immutable part of the chain, an Immutable DB iterator with a static range +is used. Such an iterator is not aware of blocks appended to the Immutable DB +since the iterator was opened. This means that when the iterator reaches its +end, we first have to check whether more blocks have been appended to the +Immutable DB. If so, a new iterator is opened to stream these blocks. If not, we +switch over to the in-memory fragment. + \section{Block processing queue} \label{chaindb:queue} @@ -100,6 +290,7 @@ \section{Garbage collection} refer here, though, not to the vol DB chapter. \subsection{GC delay} +\label{chaindb:gc:delay} For performance reasons neither the immutable DB nor the volatile DB ever makes explicit \lstinline!fsync! calls to flush data to disk. This means that when the diff --git a/ouroboros-consensus/docs/report/chapters/storage/chainselection.tex b/ouroboros-consensus/docs/report/chapters/storage/chainselection.tex index 07656280089..e61504b3289 100644 --- a/ouroboros-consensus/docs/report/chapters/storage/chainselection.tex +++ b/ouroboros-consensus/docs/report/chapters/storage/chainselection.tex @@ -344,8 +344,9 @@ \section{Initialisation} \item \label{chaindb:init:imm} -Initialise the immutable database, determine its tip $I$, and ask the -ledger DB for the corresponding ledger state $L$. +Initialise the immutable database, determine its tip $I$, and ask the ledger DB +for the corresponding ledger state $L$ (see +\cref{ledgerdb:on-disk:initialisation}). \item Compute the set of candidates anchored at the immutable database's tip \label{chaindb:init:compute} diff --git a/ouroboros-consensus/docs/report/chapters/storage/ledgerdb.tex b/ouroboros-consensus/docs/report/chapters/storage/ledgerdb.tex index 69ff5a05ec9..e1f47e176d5 100644 --- a/ouroboros-consensus/docs/report/chapters/storage/ledgerdb.tex +++ b/ouroboros-consensus/docs/report/chapters/storage/ledgerdb.tex @@ -333,6 +333,8 @@ \subsection{Initialisation} \section{Maintained by the Chain DB} \label{ledgerdb:lgrdb} +\todo{TODO} move to Chain DB chapter? + The \lstinline!LedgerDB! is a pure data structure. The Chain DB (see \cref{chaindb}) maintains the current \lstinline!LedgerDB! in a \lstinline!StrictTVar!. The most recent element in the \lstinline!LedgerDB! is From f2a027da5a8f787d09d6b9be95466ca83f5cfc02 Mon Sep 17 00:00:00 2001 From: Thomas Winant Date: Fri, 15 Jan 2021 12:08:37 +0100 Subject: [PATCH 5/5] report: Start the Mempool chapter --- .../docs/report/chapters/storage/mempool.tex | 79 ++++++++++++++++++- 1 file changed, 78 insertions(+), 1 deletion(-) diff --git a/ouroboros-consensus/docs/report/chapters/storage/mempool.tex b/ouroboros-consensus/docs/report/chapters/storage/mempool.tex index 5feadb7221e..af4c2d7353d 100644 --- a/ouroboros-consensus/docs/report/chapters/storage/mempool.tex +++ b/ouroboros-consensus/docs/report/chapters/storage/mempool.tex @@ -1,7 +1,84 @@ \chapter{Mempool} \label{mempool} +Whenever a block producing node is the leader of a slot +(\cref{consensus:class:leaderselection}), it gets the chance to mint a block. +For the Cardano blockchain to be useful, the minted block in the blockchain +needs to contain \emph{transactions}. The \emph{mempool} is where we buffer +transactions until we are able to mint a block containing those transactions. + +Transactions created by the user using the wallet enter the Mempool via the +local transaction submission protocol (see \cref{servers:txsubmission}). As not +every user will be running a block producing node or stakepool, these +transactions should be broadcast over the network so that other, block +producing, nodes can include these transactions in their next block, in order +for the transactions to ends up in the blockchain as soon as possible. This is +accomplished by the node-to-node transaction submission protocol\todo{link?}, +which exchanges the transactions between the mempool of the nodes in the +network. + +Naturally, we only want to put transactions in a block that are valid +w.r.t.\ the ledger state against which the block will be applied. Putting +invalid transactions in a block will result in an invalid block, which will be +rejected by other nodes. Consequently, the block along with its rewards is lost. +Even for a node that is not a block producer, there is no point in flooding the +network with invalid transactions. For these reasons, we validate the +transactions in the mempool w.r.t.\ the current ledger state and remove +transactions that are no longer valid. + \section{Consistency} \label{mempool:consistency} -Discuss that we insist on \emph{linear consistency}, and why. +Transactions themselves affect the ledger state, consequently, the order in +which transactions are applied matters. For example, two transactions might try +to consume the same UTxO entries. The first of the two transactions to be +applied determines will be valid, the second will be invalid. Transactions can +also depend on each other, hence the transactions that are depended upon should +be applied first. Consequently, the mempool needs to decide how transactions are +ordered. + +We chose a simple approach: we maintain a list of transactions, ordered by the +time at which they arrived. This has the following advantages: + +\begin{itemize} +\item It's simple to implement and it's efficient. In particular, no search for + a valid subset is ever required. +\item When minting a block, we can simply + take the longest possible prefix of transactions that fits in a block. +\item It supports wallets that submit dependent transactions (where later + transaction depends on outputs from earlier ones). +\end{itemize} + +We call this \emph{linear consistency}: transactions are ordered linearly and +each transaction is valid w.r.t.\ the transactions before it and the ledger +state against which the mempool was validated. + +The mempool has a background thread that watches the current ledger state +exposed by the Chain DB (\cref{chaindb}). Whenever it changes, the mempool will +revalidate its contents w.r.t.\ that ledger state. This ensures that we no +longer keep broadcasting invalid transactions and that the next time we get to +mint a block, we do not have to validate a bunch of invalid transactions, +costing us more crucial time. + +\section{Caching} + +The mempool caches the ledger state resulting from applying all the transactions +in the mempool to the current ledger state. This makes it quick and easy to +validate incoming transactions, they can simply be validated against the cached +ledger state without having to recompute it for each transaction. As discussed +in \cref{ledgerdb:in-memory}, the memory cost of this is minimal. When the +incoming transaction is valid w.r.t.\ the cached ledger state, we append the +transaction to the mempool and we cache the resulting ledger state. + +\todo{TODO} talk about the slot for which we produce + +\section{TxSeq} + +\todo{TODO} efficiently get the first $x$ transactions that fit into the given size + +\todo{TODO} discuss \lstinline!TicketNo! + +\section{Capacity} + +\todo{TODO} discuss dynamic capacity, based on twice the max block (body?) size in the protocol parameters in the ledger +\todo{TODO} add transactions one-by-one for better concurrency and fewer revalidation in case of retries