vectordotdev · tobz · Apr 11, 2023 · Apr 10, 2023 · Apr 10, 2023 · Apr 11, 2023
@@ -1,14 +1,15 @@
-//! # Disk Buffer v2: Sequential File I/O Boogaloo.
+//! # Disk buffer v2.
 //!
-//! This disk buffer implementation is a departure from the LevelDB-based disk buffer
-//! implementation, referred to internally as `disk` or `disk_v1`. It focuses on avoiding external
-//! C/C++ dependencies, as well as optimizing itself for the job at hand to provide more consistent
-//! in both throughput and latency.
+//! This disk buffer implementation focuses on a simplistic on-disk format with minimal
+//! reader/writer coordination, and no exotic I/O techniques, such that the buffer is easy to write
+//! to and read from and can provide simplistic, but reliable, recovery mechanisms when errors or
+//! corruption are encountered.
 //!
 //! ## Design constraints
 //!
 //! These constraints, or more often, invariants, are the groundwork for ensuring that the design
 //! can stay simple and understandable:
+//!
 //! - data files do not exceed 128MB
 //! - no more than 65,536 data files can exist at any given time
 //! - buffer can grow to a maximum of ~8TB in total size (65k files * 128MB)
@@ -18,42 +19,93 @@
 //! - endianness of the files is based on the host system (we don't support loading the buffer files
 //!   on a system with different endianness)
 //!
-//! ## On-disk layout
+//! ## High-level design
+//!
+//! ### Records
+//!
+//! A record is an length-prefixed payload, where an arbitrary number of bytes are contained,
+//! alongside a monotonically increasing ID, and protected by a CRC32C checksum. Since a record
+//! simply stores opaque bytes, one or more events can be stored per record.
 //!
-//! At a high-level, records that are written end up in one of many underlying data files, while the
-//! ledger file -- number of records, writer and reader positions, etc -- is stored in a separate
-//! file. Data files function primarily with a "last process who touched it" ownership model: the
-//! writer always creates new files, and the reader deletes files when they have been fully read.
+//! The writer assigns record IDs based on number of events written to a record, such that a record
+//! ID of N can be determined to contain M-N events, where M is the record ID of the next record.
 //!
-//! ### Record structure
+//! #### On-disk format
 //!
-//! Records are packed together with a relatively simple pseudo-structure:
+//! Records are represented by the following pseudo-structure:
 //!
 //!   record:
 //!     `record_len`: uint64
 //!     `checksum`:   uint32(CRC32C of `record_id` + `payload`)
 //!     `record_id`:  uint64
-//!     `payload`:    uint8[]
+//!     `payload`:    uint8[record_len]
+//!
+//!
+//! We say "pseudo-structure" as a helper serialization library, `rkyv`, is used to handle
+//! serialization, and zero-copy deserialization, of records. This effectively adds some amount of
+//! padding to record fields, due to the need to structure record field data in a way that makes it
+//! transparent to access during zero-copy deserialization, when the raw buffer of a record that was
+//! read is able to be accessed as if it was a native Rust type/value.
+//!
+//! While this padding/overhead is small, and fixed, we do not quantify it here as it can
+//! potentially changed based on the payload that a record contains. The only safe way to access the
+//! records in a disk buffer should be through the reader/writer interface in this module.
+//!
+//! ### Data files
+//!
+//! Data files contain the buffered records and nothing else. Records are written
+//! sequentially/contiguously, and are not padded out to meet a minimum block/write size, except for
+//! internal padding requirements of the serialization library used.
+//!
+//! Data files have a maximum size, configured statically within a given Vector binary, which can
+//! never be exceeded: if a write would cause a data file to grow past the maximum file size, it
+//! must be written to the next data file.
+//!
+//! A maximum number of 65,536 data files can exist at any given time, due to the inclusion of a
+//! file ID in the data file name, which is represented by a 16-bit unsigned integer.
+//!
+//! ### Ledger
+//!
+//! The ledger is a small file which tracks two important items for both the reader and writer:
+//! which data file they're currently reading or writing to, and what record ID they left off on.
+//!
+//! The ledger is read during buffer initialization to determine a reader should pick up reading
+//! from, but is also used to attempt to detect where a writer left off, and if records are missing
+//! from the current writer data file according to what the writer believes it did (as in
+//! write/flush bytes to disk) and what the reality is, based on the actual data in the current
+//! writer data file.
+//!
+//! The ledger is a memory-mapped file that is updated atomically in terms of its fields, but is not
+//! updated atomically in terms of reader/writer activity.
+//!
+//! #### On-disk format
+//!
+//! Like records, the ledger file consists of a simplified structure that is optimized for being shared
+//! via a memory-mapped file interface between the reader and writer.
+//!
+//!   buffer.db:
+//!     `writer_next_record_id`:       uint64
+//!     `writer_current_data_file_id`: uint16
+//!     `reader_current_data_file_id`: uint16
+//!     `reader_last_record_id`:       uint64
 //!
-//! We say pseudo-structure because we serialize these records to disk using `rkyv`, a zero-copy
-//! deserialization library which focuses on the speed of reading values by writing them to storage
-//! in a way that allows them to be "deserialized" without any copies, which means the layout of
-//! struct fields matches their in-memory representation rather than the intuitive, packed structure
-//! we might expect to see if we wrote only the bytes needed for each field, without any extra
-//! padding or alignment.
+//! As the disk buffer structure is meant to emulate a ring buffer, most of the bookkeeping resolves
+//! around the writer and reader being able to quickly figure out where they left off. Record and
+//! data file IDs are simply rolled over when they reach the maximum of their data type, and are
+//! incremented monotonically as new data files are created, rather than trying to always allocate
+//! from the lowest available ID.
 //!
-//! This represents a small amount of extra space overhead per record, but is beneficial to us as we
-//! avoid a more formal deserialization step, with scratch buffers and memory copies.
+//! ### Buffer operation
 //!
 //! ## Writing records
 //!
-//! Records are added to a data file sequentially, and contiguously, with no gaps or data alignment
-//! adjustments, excluding the padding/alignment used by `rkyv` itself to allow for zero-copy
-//! deserialization. This continues until adding another would exceed the configured data file size
-//! limit. When this occurs, the current data file is flushed and synchronized to disk, and a new
-//! data file will be open.
+//! As mentioned above, records are added to a data file sequentially, and contiguously, with no
+//! gaps or data alignment adjustments, excluding the padding/alignment used by `rkyv` itself to
+//! allow for zero-copy deserialization. This continues until adding another would exceed the
+//! configured data file size limit. When this occurs, the current data file is flushed and
+//! synchronized to disk, and a new data file will be opened.
 //!
-//! If the number of data files open exceeds the maximum (65,536), or if the total data file size
+//! If the number of data files on disk exceeds the maximum (65,536), or if the total data file size
 //! limit is exceeded, the writer will wait until enough space has been freed such that the record
 //! can be written. As data files are only deleted after being read entirely, this means that space
 //! is recovered in increments of the target data file size, which is 128MB. Thus, the minimum size
@@ -77,10 +129,10 @@
 //! When all records from a data file have been fully acknowledged, the data file is scheduled for
 //! deletion. We only delete entire data files, rather than truncating them piecemeal, which reduces
 //! the I/O burden of the buffer. This does mean, however, that a data file will stick around until
-//! it's entirely processed. We compensate for this fact in the buffer configuration by adjusting
-//! the logical buffer size based on when records are acknowledged, so that the writer can make
-//! progress as records are acknowledged, even if the buffer is close to, or at the maximum buffer
-//! size limit.
+//! it is entirely processed and acknowledged. We compensate for this fact in the buffer
+//! configuration by adjusting the logical buffer size based on when records are acknowledged, so
+//! that the writer can make progress as records are acknowledged, even if the buffer is close to,
+//! or at the maximum buffer size limit.
 //!
 //! ## Record ID generation, and its relation of events
 //!
@@ -111,29 +163,6 @@
 //! We make sure to track enough information such that when we encounter a corrupted record, or if
 //! we skip records due to missing data, we can figure out how many events we've dropped or lost,
 //! and handle the necessary adjustments to the buffer accounting.
-//!
-//! ## Ledger structure
-//!
-//! Likewise, the ledger file consists of a simplified structure that is optimized for being shared
-//! via a memory-mapped file interface between the writer and reader. Like the record structure, the
-//! below is a pseudo-structure as we use `rkyv` for the ledger, and so the on-disk layout will be
-//! slightly different:
-//!
-//!   buffer.db:
-//!     writer next record ID:       uint64
-//!     writer current data file ID: uint16
-//!     reader current data file ID: uint16
-//!     reader last record ID:       uint64
-//!
-//! As the disk buffer structure is meant to emulate a ring buffer, most of the bookkeeping resolves
-//! around the writer and reader being able to quickly figure out where they left off. Record and
-//! data file IDs are simply rolled over when they reach the maximum of their data type, and are
-//! incremented monotonically as new data files are created, rather than trying to always allocate
-//! from the lowest available ID.
-//!
-//! Additionally, record IDs are allocated in the same way: monotonic, sequential, and will wrap
-//! when they reach the maximum value for the data type. For record IDs, however, this would mean
-//! reaching 2^64, which will take a really, really, really long time.
 
 use core::fmt;
 use std::{