Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SFS ADRs based on Design Review Session 2023-05-05 #497

Merged
merged 3 commits into from
May 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Collection of High Level Design Decisions

Use soft deletion. Mark versions deleted, let a garbage collector hard
delete *later*.

Non-versioned objects are a special case of versioned objects. They
generally follow the same logic.

The SFS SQLite database is the source of truth. Example: If we delete
an object, we first modify the database then the filesystem. Example:
Serve metadata (object size, mtime, ..) from the database rather
than stat() the file.

SQLite transactions are atomic. Filesystem operations maybe. Both combined are not.
Orphaned files on the filesystem are acceptable and countered by an offline fsck tool.

Use Ceph bufferlist encodings of data structures in the database where
we don't have to query individual fields. Example: object attrs is
bufferlist encoded, deletion time is not. With this leverage Ceph data
structure versioning support as much as possible.

Use negative return value style error handling not exceptions. This
follows from the Google C++ style guide and the general RGW style.
142 changes: 142 additions & 0 deletions docs/decisions/0010-sfs-versioning.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move this to docs/design ? Makes more sense I think.

Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# SFS Versioning

Refines ADR 0003-SFS on the object state machine and database columns.

## Database

In this document we only look at tables Objects and Versioned Objects.

An *object* has a name, id, and reference to a *bucket*

A *object version* has an id, checksum, {create, commit, delete}_time,
mtime, size, **state**, **type**, ETag, serialized attributes, etc.

An *object* is a group of versions identified by bucket and name.

An *object* has one or more *object versions*. Regardless of the bucket versioning setting.

Object **state** is an enum of Open, Committed, Deleted. See Object State Machine.

Object **type** is an enum of Regular, Delete Marker.

## Object State Machine

```mermaid
stateDiagram-v2
[*] --> Open : Create new version
Open --> Committed : Success
Open --> Open : Writing
Committed --> Deleted : Delete
Open --> [*] : Failure
Deleted --> [*] : Garbage Collection
```

*Open* - Initial state. Data is in flight. Data on disk is dirty. A
version may stay in this state if, for example, a client fails during
a PUT operation. A to be defined GC process cleans this up after a
time.

*Committed* - Upload finished. Data persisted. GETs and LIST
will return this object.

*Deleted* - Soft deletion state, but in terms of S3 permanent deletion.

*terminal state* - the row no longer exists in the database.

An object will never move back from Deleted to Committed. Deleting a
version is treated as permanent, even though it is only permanent after
the GC made it so.

State changes also change timestamps:

## Timestamps

(See ADR 11 for information on the data type we store)

Versioned objects store the following timestamps:

*commit_time* - When the object changes to *Committed* state.

*delete_time* - When the object changes to *Deleted* state.

*create_time* - Set when the row is first created in *Open* state.

*mtime* - A modification time passed to SFS. We follow the RADOS
logic: Passed a `set_mtime` and `mtime`, we persist `set_mtime`. If
`is_zero(set_mtime)` we take `real_clock::now()`. Return the `mtime`
we persisted.

## Object Version Types

An object version may be of special type *delete marker*, representing
an S3 delete marker in versioned buckets.

## Garbage Collection

Input: object versions in deleted state.

Either delete single versions or delete whole objects.

In case whole objects, delete the object row if all versions are deleted.
In a transaction delete all versions, then the object.

Should a concurrent transaction add a new version either will fail on
the foreign key constraint between version and object. In the GC we
continue with the next object; in the create path we retry or let the
client retry.

## Operations

### Deletion

Deleting a version is not to be confused with hiding objects by adding
deletion markers. Deleting an object sets the state to DELETED.

### Hiding / Deletion Marker

A delete to a versioned bucket without a version creates a delete
marker. This hides the object and all its version from unversioned GETs
and LISTS.

SFS implements this with object versions having type delete marker. If
one exists on an object, it is excluded.

Deletion markers are object versions and follow the same state machine
and garbage collection as regular versions do. They however don't have
a size and data on the filesystem.

### Create / Update

(versioned, unversioned buckets) In a transaction:

1. Find or create an object row (id, name, bucket)
2. Create an object version in state OPEN with creation time of now.
If versioned, use the version id provided. Otherwise generate one.

Receive and write data.

On completion:
(versioned buckets) set state COMMITTED of the previously created version. Record
checksum, size, etc.

(unversioned buckets) set state COMMITTED of the previously created version
AND set all other versions DELETED.

### Access: Listing, GETs

Many operations expect a head, null, last or "canonical" version of
an object. We define this as the *last* committed version, having the *latest*
*commit_time* timestamp.

If the timestamp resolution is too low to distinguish versions the highest id wins.

(unversioned) Since we rely on versions to implement updates, more
than one committed version may exist. We make this less likely by the commit rule in
Create / Update above. Should multiple exists, the latest is the one we use.

### Out of scope (for now): Versioning Suspended Buckets

For now SFS won't support versioning suspended buckets. Object
versions created while the bucket was unversioned or in versioning
suspended state have a 'null' version id. This is not directly
supported by this design and requires refinement in the future.
62 changes: 62 additions & 0 deletions docs/decisions/0011-sfs-timestamps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
<!-- #cSpell:words multicol datefunc -->
# SFS Timestamps

## Context and Problem Statement

We need to create, handle and persist timestamps.

In the RGW space we have ceph_time.h and std::chrono.

SQLite represents time as ISO-8601, Julian day or unix timestamps.
Stored in TEXT, REAL or INTEGER data types. [SQLite doc:
data types](https://www.sqlite.org/datatype3.html). It has functions to
work with these data types. [SQLite doc:
datefunc](https://www.sqlite.org/lang_datefunc.html).

This doc is about the conversion between RGW/sfs and SQLite space.

Summary of discussion in weeks 19, 20 2023. [GH
Comments](https://github.com/aquarist-labs/s3gw/pull/497)

## Ceph time

We use `ceph::real_time` as timestamps.

`ceph::real_time` is a `uint64_t` nanosecond count since epoch.

## How to store time in SQLite

### Requirements

Minimum microsecond resolution.

We can leverage SQLIte range queries, sorting, indices. Ideally we can
leverage date / time functions with minor conversion.

### Considered Options

Options, that don't fit our requirements:

- *ISO8601* strings. Not enough resolution.
- *UNIX time* - Not enough resolution

- *multicol* - Store seconds as *ISO8601* or *UNIX time* (SQLite
functions work directly). Store nanoseconds in a second column.
Queries awkward.
- *int64 ns* - Store as int64, nanosecond resolution, convert from
uint64. Max value up to year 2262. Queries work.
- *int64 us* - Store as int64, microsecond resolution, convert from
uint64. Max value up to year 2554. Queries work.
- *hex* - Store as 16 char hex string. Full `ceph::real_time` range.
Conversion cost. Queries work.
- *uint64* - Squeeze `uint64_t` into SQLite `INTEGER` type
representation. No queries.
- *blob* - Store `ceph::real_time` as an SQLite blob. Can't query and
index as easily as *hex* or *int64* options.

### Decision Outcome

We choose *int64 ns*, because it meets all criteria.

It requires minimal conversion. We can live with the year 2262
limitation.
3 changes: 3 additions & 0 deletions docs/dicts/s3gw.dict
Original file line number Diff line number Diff line change
Expand Up @@ -53,3 +53,6 @@ pytest
proxying
COSI
objectstorage
bufferlist
ETag
chrono