From 947f9340f0e874e7f19a0d3e853c861920dc4e96 Mon Sep 17 00:00:00 2001 From: Marcel Lauhoff Date: Fri, 5 May 2023 17:18:08 +0200 Subject: [PATCH 1/3] adr: SFS high level design decisions Signed-off-by: Marcel Lauhoff --- ...llection-of-high-level-design-decisions.md | 23 +++++++++++++++++++ docs/dicts/s3gw.dict | 1 + 2 files changed, 24 insertions(+) create mode 100644 docs/decisions/0009-sfs-collection-of-high-level-design-decisions.md diff --git a/docs/decisions/0009-sfs-collection-of-high-level-design-decisions.md b/docs/decisions/0009-sfs-collection-of-high-level-design-decisions.md new file mode 100644 index 00000000..48dff25d --- /dev/null +++ b/docs/decisions/0009-sfs-collection-of-high-level-design-decisions.md @@ -0,0 +1,23 @@ +# Collection of High Level Design Decisions + +Use soft deletion. Mark versions deleted, let a garbage collector hard +delete *later*. + +Non-versioned objects are a special case of versioned objects. They +generally follow the same logic. + +The SFS SQLite database is the source of truth. Example: If we delete +an object, we first modify the database then the filesystem. Example: +Serve metadata (object size, mtime, ..) from the database rather +than stat() the file. + +SQLite transactions are atomic. Filesystem operations maybe. Both combined are not. +Orphaned files on the filesystem are acceptable and countered by an offline fsck tool. + +Use Ceph bufferlist encodings of data structures in the database where +we don't have to query individual fields. Example: object attrs is +bufferlist encoded, deletion time is not. With this leverage Ceph data +structure versioning support as much as possible. + +Use negative return value style error handling not exceptions. This +follows from the Google C++ style guide and the general RGW style. diff --git a/docs/dicts/s3gw.dict b/docs/dicts/s3gw.dict index b89958e6..1298e5db 100644 --- a/docs/dicts/s3gw.dict +++ b/docs/dicts/s3gw.dict @@ -53,3 +53,4 @@ pytest proxying COSI objectstorage +bufferlist From 40bc1bb3960dd47798ce19214bd5b165568b6d45 Mon Sep 17 00:00:00 2001 From: Marcel Lauhoff Date: Fri, 5 May 2023 17:22:55 +0200 Subject: [PATCH 2/3] adr: SFS versioning Notes from the design review session 2023-05-05 Signed-off-by: Marcel Lauhoff --- docs/decisions/0010-sfs-versioning.md | 142 ++++++++++++++++++++++++++ docs/dicts/s3gw.dict | 1 + 2 files changed, 143 insertions(+) create mode 100644 docs/decisions/0010-sfs-versioning.md diff --git a/docs/decisions/0010-sfs-versioning.md b/docs/decisions/0010-sfs-versioning.md new file mode 100644 index 00000000..d64ba20c --- /dev/null +++ b/docs/decisions/0010-sfs-versioning.md @@ -0,0 +1,142 @@ +# SFS Versioning + +Refines ADR 0003-SFS on the object state machine and database columns. + +## Database + +In this document we only look at tables Objects and Versioned Objects. + +An *object* has a name, id, and reference to a *bucket* + +A *object version* has an id, checksum, {create, commit, delete}_time, +mtime, size, **state**, **type**, ETag, serialized attributes, etc. + +An *object* is a group of versions identified by bucket and name. + +An *object* has one or more *object versions*. Regardless of the bucket versioning setting. + +Object **state** is an enum of Open, Committed, Deleted. See Object State Machine. + +Object **type** is an enum of Regular, Delete Marker. + +## Object State Machine + +```mermaid +stateDiagram-v2 + [*] --> Open : Create new version + Open --> Committed : Success + Open --> Open : Writing + Committed --> Deleted : Delete + Open --> [*] : Failure + Deleted --> [*] : Garbage Collection +``` + +*Open* - Initial state. Data is in flight. Data on disk is dirty. A +version may stay in this state if, for example, a client fails during +a PUT operation. A to be defined GC process cleans this up after a +time. + +*Committed* - Upload finished. Data persisted. GETs and LIST +will return this object. + +*Deleted* - Soft deletion state, but in terms of S3 permanent deletion. + +*terminal state* - the row no longer exists in the database. + +An object will never move back from Deleted to Committed. Deleting a +version is treated as permanent, even though it is only permanent after +the GC made it so. + +State changes also change timestamps: + +## Timestamps + +(See ADR 11 for information on the data type we store) + +Versioned objects store the following timestamps: + +*commit_time* - When the object changes to *Committed* state. + +*delete_time* - When the object changes to *Deleted* state. + +*create_time* - Set when the row is first created in *Open* state. + +*mtime* - A modification time passed to SFS. We follow the RADOS + logic: Passed a `set_mtime` and `mtime`, we persist `set_mtime`. If + `is_zero(set_mtime)` we take `real_clock::now()`. Return the `mtime` + we persisted. + +## Object Version Types + +An object version may be of special type *delete marker*, representing +an S3 delete marker in versioned buckets. + +## Garbage Collection + +Input: object versions in deleted state. + +Either delete single versions or delete whole objects. + +In case whole objects, delete the object row if all versions are deleted. +In a transaction delete all versions, then the object. + +Should a concurrent transaction add a new version either will fail on +the foreign key constraint between version and object. In the GC we +continue with the next object; in the create path we retry or let the +client retry. + +## Operations + +### Deletion + +Deleting a version is not to be confused with hiding objects by adding +deletion markers. Deleting an object sets the state to DELETED. + +### Hiding / Deletion Marker + +A delete to a versioned bucket without a version creates a delete +marker. This hides the object and all its version from unversioned GETs +and LISTS. + +SFS implements this with object versions having type delete marker. If +one exists on an object, it is excluded. + +Deletion markers are object versions and follow the same state machine +and garbage collection as regular versions do. They however don't have +a size and data on the filesystem. + +### Create / Update + +(versioned, unversioned buckets) In a transaction: + +1. Find or create an object row (id, name, bucket) +2. Create an object version in state OPEN with creation time of now. + If versioned, use the version id provided. Otherwise generate one. + +Receive and write data. + +On completion: +(versioned buckets) set state COMMITTED of the previously created version. Record +checksum, size, etc. + +(unversioned buckets) set state COMMITTED of the previously created version +AND set all other versions DELETED. + +### Access: Listing, GETs + +Many operations expect a head, null, last or "canonical" version of +an object. We define this as the *last* committed version, having the *latest* +*commit_time* timestamp. + +If the timestamp resolution is too low to distinguish versions the highest id wins. + +(unversioned) Since we rely on versions to implement updates, more +than one committed version may exist. We make this less likely by the commit rule in +Create / Update above. Should multiple exists, the latest is the one we use. + +### Out of scope (for now): Versioning Suspended Buckets + +For now SFS won't support versioning suspended buckets. Object +versions created while the bucket was unversioned or in versioning +suspended state have a 'null' version id. This is not directly +supported by this design and requires refinement in the future. diff --git a/docs/dicts/s3gw.dict b/docs/dicts/s3gw.dict index 1298e5db..a962bdd6 100644 --- a/docs/dicts/s3gw.dict +++ b/docs/dicts/s3gw.dict @@ -54,3 +54,4 @@ proxying COSI objectstorage bufferlist +ETag From 33a3d266049eb3ca59b301466f7373b523ee818d Mon Sep 17 00:00:00 2001 From: Marcel Lauhoff Date: Mon, 15 May 2023 11:40:39 +0200 Subject: [PATCH 3/3] adr: SFS timestamps Signed-off-by: Marcel Lauhoff --- docs/decisions/0011-sfs-timestamps.md | 62 +++++++++++++++++++++++++++ docs/dicts/s3gw.dict | 1 + 2 files changed, 63 insertions(+) create mode 100644 docs/decisions/0011-sfs-timestamps.md diff --git a/docs/decisions/0011-sfs-timestamps.md b/docs/decisions/0011-sfs-timestamps.md new file mode 100644 index 00000000..2a31c5dc --- /dev/null +++ b/docs/decisions/0011-sfs-timestamps.md @@ -0,0 +1,62 @@ + +# SFS Timestamps + +## Context and Problem Statement + +We need to create, handle and persist timestamps. + +In the RGW space we have ceph_time.h and std::chrono. + +SQLite represents time as ISO-8601, Julian day or unix timestamps. +Stored in TEXT, REAL or INTEGER data types. [SQLite doc: +data types](https://www.sqlite.org/datatype3.html). It has functions to +work with these data types. [SQLite doc: +datefunc](https://www.sqlite.org/lang_datefunc.html). + +This doc is about the conversion between RGW/sfs and SQLite space. + +Summary of discussion in weeks 19, 20 2023. [GH +Comments](https://github.com/aquarist-labs/s3gw/pull/497) + +## Ceph time + +We use `ceph::real_time` as timestamps. + +`ceph::real_time` is a `uint64_t` nanosecond count since epoch. + +## How to store time in SQLite + +### Requirements + +Minimum microsecond resolution. + +We can leverage SQLIte range queries, sorting, indices. Ideally we can +leverage date / time functions with minor conversion. + +### Considered Options + +Options, that don't fit our requirements: + +- *ISO8601* strings. Not enough resolution. +- *UNIX time* - Not enough resolution + +- *multicol* - Store seconds as *ISO8601* or *UNIX time* (SQLite + functions work directly). Store nanoseconds in a second column. + Queries awkward. +- *int64 ns* - Store as int64, nanosecond resolution, convert from + uint64. Max value up to year 2262. Queries work. +- *int64 us* - Store as int64, microsecond resolution, convert from + uint64. Max value up to year 2554. Queries work. +- *hex* - Store as 16 char hex string. Full `ceph::real_time` range. + Conversion cost. Queries work. +- *uint64* - Squeeze `uint64_t` into SQLite `INTEGER` type + representation. No queries. +- *blob* - Store `ceph::real_time` as an SQLite blob. Can't query and + index as easily as *hex* or *int64* options. + +### Decision Outcome + +We choose *int64 ns*, because it meets all criteria. + +It requires minimal conversion. We can live with the year 2262 +limitation. diff --git a/docs/dicts/s3gw.dict b/docs/dicts/s3gw.dict index a962bdd6..6ed21f7c 100644 --- a/docs/dicts/s3gw.dict +++ b/docs/dicts/s3gw.dict @@ -55,3 +55,4 @@ COSI objectstorage bufferlist ETag +chrono