Skip to content

Commit

Permalink
chore(buffer): tidy up some of the module level docs for disk_v2 (#…
Browse files Browse the repository at this point in the history
  • Loading branch information
tobz authored Apr 11, 2023
1 parent a791595 commit edaa612
Show file tree
Hide file tree
Showing 2 changed files with 96 additions and 61 deletions.
1 change: 1 addition & 0 deletions .github/actions/spelling/expect.txt
Original file line number Diff line number Diff line change
Expand Up @@ -866,6 +866,7 @@ norg
norgle
norknoog
norknork
no_run
nosync
notext
notls
Expand Down
156 changes: 95 additions & 61 deletions lib/vector-buffers/src/variants/disk_v2/mod.rs
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
//! # Disk Buffer v2: Sequential File I/O Boogaloo.
//! # Disk buffer v2.
//!
//! This disk buffer implementation is a departure from the LevelDB-based disk buffer
//! implementation, referred to internally as `disk` or `disk_v1`. It focuses on avoiding external
//! C/C++ dependencies, as well as optimizing itself for the job at hand to provide more consistent
//! in both throughput and latency.
//! This disk buffer implementation focuses on a simplistic on-disk format with minimal
//! reader/writer coordination, and no exotic I/O techniques, such that the buffer is easy to write
//! to and read from and can provide simplistic, but reliable, recovery mechanisms when errors or
//! corruption are encountered.
//!
//! ## Design constraints
//!
//! These constraints, or more often, invariants, are the groundwork for ensuring that the design
//! can stay simple and understandable:
//!
//! - data files do not exceed 128MB
//! - no more than 65,536 data files can exist at any given time
//! - buffer can grow to a maximum of ~8TB in total size (65k files * 128MB)
Expand All @@ -18,42 +19,96 @@
//! - endianness of the files is based on the host system (we don't support loading the buffer files
//! on a system with different endianness)
//!
//! ## On-disk layout
//! ## High-level design
//!
//! ### Records
//!
//! A record is an length-prefixed payload, where an arbitrary number of bytes are contained,
//! alongside a monotonically increasing ID, and protected by a CRC32C checksum. Since a record
//! simply stores opaque bytes, one or more events can be stored per record.
//!
//! The writer assigns record IDs based on number of events written to a record, such that a record
//! ID of N can be determined to contain M-N events, where M is the record ID of the next record.
//!
//! #### On-disk format
//!
//! Records are represented by the following pseudo-structure:
//!
//! ```text
//! record:
//! record_len: uint64
//! checksum: uint32(CRC32C of record_id + payload)
//! record_id: uint64
//! payload: uint8[record_len]
//! ```
//!
//! We say "pseudo-structure" as a helper serialization library, [`rkyv`][rkyv], is used to handle
//! serialization, and zero-copy deserialization, of records. This effectively adds some amount of
//! padding to record fields, due to the need to structure record field data in a way that makes it
//! transparent to access during zero-copy deserialization, when the raw buffer of a record that was
//! read is able to be accessed as if it was a native Rust type/value.
//!
//! While this padding/overhead is small, and fixed, we do not quantify it here as it can
//! potentially changed based on the payload that a record contains. The only safe way to access the
//! records in a disk buffer should be through the reader/writer interface in this module.
//!
//! ### Data files
//!
//! Data files contain the buffered records and nothing else. Records are written
//! sequentially/contiguously, and are not padded out to meet a minimum block/write size, except for
//! internal padding requirements of the serialization library used.
//!
//! Data files have a maximum size, configured statically within a given Vector binary, which can
//! never be exceeded: if a write would cause a data file to grow past the maximum file size, it
//! must be written to the next data file.
//!
//! A maximum number of 65,536 data files can exist at any given time, due to the inclusion of a
//! file ID in the data file name, which is represented by a 16-bit unsigned integer.
//!
//! ### Ledger
//!
//! The ledger is a small file which tracks two important items for both the reader and writer:
//! which data file they're currently reading or writing to, and what record ID they left off on.
//!
//! At a high-level, records that are written end up in one of many underlying data files, while the
//! ledger file -- number of records, writer and reader positions, etc -- is stored in a separate
//! file. Data files function primarily with a "last process who touched it" ownership model: the
//! writer always creates new files, and the reader deletes files when they have been fully read.
//! The ledger is read during buffer initialization to determine a reader should pick up reading
//! from, but is also used to attempt to detect where a writer left off, and if records are missing
//! from the current writer data file according to what the writer believes it did (as in
//! write/flush bytes to disk) and what the reality is, based on the actual data in the current
//! writer data file.
//!
//! ### Record structure
//! The ledger is a memory-mapped file that is updated atomically in terms of its fields, but is not
//! updated atomically in terms of reader/writer activity.
//!
//! Records are packed together with a relatively simple pseudo-structure:
//! #### On-disk format
//!
//! record:
//! `record_len`: uint64
//! `checksum`: uint32(CRC32C of `record_id` + `payload`)
//! `record_id`: uint64
//! `payload`: uint8[]
//! Like records, the ledger file consists of a simplified structure that is optimized for being shared
//! via a memory-mapped file interface between the reader and writer.
//!
//! We say pseudo-structure because we serialize these records to disk using `rkyv`, a zero-copy
//! deserialization library which focuses on the speed of reading values by writing them to storage
//! in a way that allows them to be "deserialized" without any copies, which means the layout of
//! struct fields matches their in-memory representation rather than the intuitive, packed structure
//! we might expect to see if we wrote only the bytes needed for each field, without any extra
//! padding or alignment.
//! ```text
//! buffer.db:
//! writer_next_record_id: uint64
//! writer_current_data_file_id: uint16
//! reader_current_data_file_id: uint16
//! reader_last_record_id: uint64
//! ```
//!
//! This represents a small amount of extra space overhead per record, but is beneficial to us as we
//! avoid a more formal deserialization step, with scratch buffers and memory copies.
//! As the disk buffer structure is meant to emulate a ring buffer, most of the bookkeeping resolves
//! around the writer and reader being able to quickly figure out where they left off. Record and
//! data file IDs are simply rolled over when they reach the maximum of their data type, and are
//! incremented monotonically as new data files are created, rather than trying to always allocate
//! from the lowest available ID.
//!
//! ## Buffer operation
//!
//! ## Writing records
//! ### Writing records
//!
//! Records are added to a data file sequentially, and contiguously, with no gaps or data alignment
//! adjustments, excluding the padding/alignment used by `rkyv` itself to allow for zero-copy
//! deserialization. This continues until adding another would exceed the configured data file size
//! limit. When this occurs, the current data file is flushed and synchronized to disk, and a new
//! data file will be open.
//! As mentioned above, records are added to a data file sequentially, and contiguously, with no
//! gaps or data alignment adjustments, excluding the padding/alignment used by `rkyv` itself to
//! allow for zero-copy deserialization. This continues until adding another would exceed the
//! configured data file size limit. When this occurs, the current data file is flushed and
//! synchronized to disk, and a new data file will be opened.
//!
//! If the number of data files open exceeds the maximum (65,536), or if the total data file size
//! If the number of data files on disk exceeds the maximum (65,536), or if the total data file size
//! limit is exceeded, the writer will wait until enough space has been freed such that the record
//! can be written. As data files are only deleted after being read entirely, this means that space
//! is recovered in increments of the target data file size, which is 128MB. Thus, the minimum size
Expand All @@ -62,13 +117,13 @@
//! wrap around at 65,536 (2^16), the maximum data file size in total for a given buffer is ~8TB (6
//! 5k files * 128MB).
//!
//! ## Reading records
//! ### Reading records
//!
//! Due to the on-disk layout, reading records is an incredibly straight-forward progress: we open a
//! file, read it until there's no more data and we know the writer is done writing to the file, and
//! then we open the next one, and repeat the process.
//!
//! ## Deleting acknowledged records
//! ### Deleting acknowledged records
//!
//! As the reader emits records, we cannot yet consider them fully processed until they are
//! acknowledged. The acknowledgement process is tied into the normal acknowledgement machinery, and
Expand All @@ -77,12 +132,12 @@
//! When all records from a data file have been fully acknowledged, the data file is scheduled for
//! deletion. We only delete entire data files, rather than truncating them piecemeal, which reduces
//! the I/O burden of the buffer. This does mean, however, that a data file will stick around until
//! it's entirely processed. We compensate for this fact in the buffer configuration by adjusting
//! the logical buffer size based on when records are acknowledged, so that the writer can make
//! progress as records are acknowledged, even if the buffer is close to, or at the maximum buffer
//! size limit.
//! it is entirely processed and acknowledged. We compensate for this fact in the buffer
//! configuration by adjusting the logical buffer size based on when records are acknowledged, so
//! that the writer can make progress as records are acknowledged, even if the buffer is close to,
//! or at the maximum buffer size limit.
//!
//! ## Record ID generation, and its relation of events
//! ### Record ID generation, and its relation of events
//!
//! While the buffer talks a lot about writing "records", records are ostensibly a single event, or
//! collection of events. We manage the organization and grouping of events at at a higher level
Expand Down Expand Up @@ -112,28 +167,7 @@
//! we skip records due to missing data, we can figure out how many events we've dropped or lost,
//! and handle the necessary adjustments to the buffer accounting.
//!
//! ## Ledger structure
//!
//! Likewise, the ledger file consists of a simplified structure that is optimized for being shared
//! via a memory-mapped file interface between the writer and reader. Like the record structure, the
//! below is a pseudo-structure as we use `rkyv` for the ledger, and so the on-disk layout will be
//! slightly different:
//!
//! buffer.db:
//! writer next record ID: uint64
//! writer current data file ID: uint16
//! reader current data file ID: uint16
//! reader last record ID: uint64
//!
//! As the disk buffer structure is meant to emulate a ring buffer, most of the bookkeeping resolves
//! around the writer and reader being able to quickly figure out where they left off. Record and
//! data file IDs are simply rolled over when they reach the maximum of their data type, and are
//! incremented monotonically as new data files are created, rather than trying to always allocate
//! from the lowest available ID.
//!
//! Additionally, record IDs are allocated in the same way: monotonic, sequential, and will wrap
//! when they reach the maximum value for the data type. For record IDs, however, this would mean
//! reaching 2^64, which will take a really, really, really long time.
//! [rkyv]: https://docs.rs/rkyv
use core::fmt;
use std::{
Expand Down

0 comments on commit edaa612

Please sign in to comment.