s3gw-tech · irq0 · May 15, 2023 · May 5, 2023 · May 5, 2023 · May 15, 2023
diff --git a/docs/decisions/0009-sfs-collection-of-high-level-design-decisions.md b/docs/decisions/0009-sfs-collection-of-high-level-design-decisions.md
@@ -0,0 +1,23 @@
+# Collection of High Level Design Decisions
+
+Use soft deletion. Mark versions deleted, let a garbage collector hard
+delete *later*.
+
+Non-versioned objects are a special case of versioned objects. They
+generally follow the same logic.
+
+The SFS SQLite database is the source of truth. Example: If we delete
+an object, we first modify the database then the filesystem. Example:
+Serve metadata (object size, mtime, ..) from the database rather
+than stat() the file.
+
+SQLite transactions are atomic. Filesystem operations maybe. Both combined are not.
+Orphaned files on the filesystem are acceptable and countered by an offline fsck tool.
+
+Use Ceph bufferlist encodings of data structures in the database where
+we don't have to query individual fields. Example: object attrs is
+bufferlist encoded, deletion time is not. With this leverage Ceph data
+structure versioning support as much as possible.
+
+Use negative return value style error handling not exceptions. This
+follows from the Google C++ style guide and the general RGW style.
diff --git a/docs/decisions/0010-sfs-versioning.md b/docs/decisions/0010-sfs-versioning.md
@@ -0,0 +1,142 @@
+# SFS Versioning
+
+Refines ADR 0003-SFS on the object state machine and database columns.
+
+## Database
+
+In this document we only look at tables Objects and Versioned Objects.
+
+An *object* has a name, id, and reference to a *bucket*
+
+A *object version* has an id, checksum, {create, commit, delete}_time,
+mtime, size, **state**, **type**, ETag, serialized attributes, etc.
+
+An *object* is a group of versions identified by bucket and name.
+
+An *object* has one or more *object versions*. Regardless of the bucket versioning setting.
+
+Object **state** is an enum of Open, Committed, Deleted. See Object State Machine.
+
+Object **type** is an enum of Regular, Delete Marker.
+
+## Object State Machine
+
+```mermaid
+stateDiagram-v2
+    [*] --> Open : Create new version
+    Open --> Committed : Success
+    Open --> Open : Writing
+    Committed --> Deleted : Delete
+    Open --> [*] : Failure
+    Deleted --> [*] : Garbage Collection
+```
+
+*Open* - Initial state. Data is in flight. Data on disk is dirty. A
+version may stay in this state if, for example, a client fails during
+a PUT operation. A to be defined GC process cleans this up after a
+time.
+
+*Committed* - Upload finished. Data persisted. GETs and LIST
+will return this object.
+
+*Deleted* - Soft deletion state, but in terms of S3 permanent deletion.
+
+*terminal state* - the row no longer exists in the database.
+
+An object will never move back from Deleted to Committed. Deleting a
+version is treated as permanent, even though it is only permanent after
+the GC made it so.
+
+State changes also change timestamps:
+
+## Timestamps
+
+(See ADR 11 for information on the data type we store)
+
+Versioned objects store the following timestamps:
+
+*commit_time* - When the object changes to *Committed* state.
+
+*delete_time* - When the object changes to *Deleted* state.
+
+*create_time* - Set when the row is first created in *Open* state.
+
+*mtime* - A modification time passed to SFS. We follow the RADOS
+  logic: Passed a `set_mtime` and `mtime`, we persist `set_mtime`. If
+  `is_zero(set_mtime)` we take `real_clock::now()`. Return the `mtime`
+  we persisted.
+
+## Object Version Types
+
+An object version may be of special type *delete marker*, representing
+an S3 delete marker in versioned buckets.
+
+## Garbage Collection
+
+Input: object versions in deleted state.
+
+Either delete single versions or delete whole objects.
+
+In case whole objects, delete the object row if all versions are deleted.
+In a transaction delete all versions, then the object.
+
+Should a concurrent transaction add a new version either will fail on
+the foreign key constraint between version and object. In the GC we
+continue with the next object; in the create path we retry or let the
+client retry.
+
+## Operations
+
+### Deletion
+
+Deleting a version is not to be confused with hiding objects by adding
+deletion markers. Deleting an object sets the state to DELETED.
+
+### Hiding / Deletion Marker
+
+A delete to a versioned bucket without a version creates a delete
+marker. This hides the object and all its version from unversioned GETs
+and LISTS.
+
+SFS implements this with object versions having type delete marker. If
+one exists on an object, it is excluded.
+
+Deletion markers are object versions and follow the same state machine
+and garbage collection as regular versions do. They however don't have
+a size and data on the filesystem.
+
+### Create / Update
+
+(versioned, unversioned buckets) In a transaction:
+
+1. Find or create an object row (id, name, bucket)
+2. Create an object version in state OPEN with creation time of now.
+    If versioned, use the version id provided. Otherwise generate one.
+
+Receive and write data.
+
+On completion:
+(versioned buckets) set state COMMITTED of the previously created version. Record
+checksum, size, etc.
+
+(unversioned buckets) set state COMMITTED of the previously created version
+AND set all other versions DELETED.
+
+### Access: Listing, GETs
+
+Many operations expect a head, null, last or "canonical" version of
+an object. We define this as the *last* committed version, having the *latest*
+*commit_time* timestamp.
+
+If the timestamp resolution is too low to distinguish versions the highest id wins.
+
+(unversioned) Since we rely on versions to implement updates, more
+than one committed version may exist. We make this less likely by the commit rule in
+Create / Update above. Should multiple exists, the latest is the one we use.
+
+### Out of scope (for now): Versioning Suspended Buckets
+
+For now SFS won't support versioning suspended buckets. Object
+versions created while the bucket was unversioned or in versioning
+suspended state have a 'null' version id. This is not directly
+supported by this design and requires refinement in the future.
diff --git a/docs/decisions/0011-sfs-timestamps.md b/docs/decisions/0011-sfs-timestamps.md
@@ -0,0 +1,62 @@
+<!-- #cSpell:words multicol datefunc -->
+# SFS Timestamps
+
+## Context and Problem Statement
+
+We need to create, handle and persist timestamps.
+
+In the RGW space we have ceph_time.h and std::chrono.
+
+SQLite represents time as ISO-8601, Julian day or unix timestamps.
+Stored in TEXT, REAL or INTEGER data types. [SQLite doc:
+data types](https://www.sqlite.org/datatype3.html). It has functions to
+work with these data types. [SQLite doc:
+datefunc](https://www.sqlite.org/lang_datefunc.html).
+
+This doc is about the conversion between RGW/sfs and SQLite space.
+
+Summary of discussion in weeks 19, 20 2023. [GH
+Comments](https://github.com/aquarist-labs/s3gw/pull/497)
+
+## Ceph time
+
+We use `ceph::real_time` as timestamps.
+
+`ceph::real_time` is a `uint64_t` nanosecond count since epoch.
+
+## How to store time in SQLite
+
+### Requirements
+
+Minimum microsecond resolution.
+
+We can leverage SQLIte range queries, sorting, indices. Ideally we can
+leverage date / time functions with minor conversion.
+
+### Considered Options
+
+Options, that don't fit our requirements:
+
+- *ISO8601* strings. Not enough resolution.
+- *UNIX time* - Not enough resolution
+
+- *multicol* - Store seconds as *ISO8601* or *UNIX time* (SQLite
+  functions work directly). Store nanoseconds in a second column.
+  Queries awkward.
+- *int64 ns* - Store as int64, nanosecond resolution, convert from
+  uint64. Max value up to year 2262. Queries work.
+- *int64 us* - Store as int64, microsecond resolution, convert from
+  uint64. Max value up to year 2554. Queries work.
+- *hex* - Store as 16 char hex string. Full `ceph::real_time` range.
+  Conversion cost. Queries work.
+- *uint64* - Squeeze `uint64_t` into SQLite `INTEGER` type
+  representation. No queries.
+- *blob* - Store `ceph::real_time` as an SQLite blob. Can't query and
+  index as easily as *hex* or *int64* options.
+
+### Decision Outcome
+
+We choose *int64 ns*, because it meets all criteria.
+
+It requires minimal conversion. We can live with the year 2262
+limitation.
diff --git a/docs/dicts/s3gw.dict b/docs/dicts/s3gw.dict
@@ -53,3 +53,6 @@ pytest
 proxying
 COSI
 objectstorage
+bufferlist
+ETag
+chrono