-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SFS ADRs based on Design Review Session 2023-05-05 #497
Merged
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
23 changes: 23 additions & 0 deletions
23
docs/decisions/0009-sfs-collection-of-high-level-design-decisions.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Collection of High Level Design Decisions | ||
|
||
Use soft deletion. Mark versions deleted, let a garbage collector hard | ||
delete *later*. | ||
|
||
Non-versioned objects are a special case of versioned objects. They | ||
generally follow the same logic. | ||
|
||
The SFS SQLite database is the source of truth. Example: If we delete | ||
an object, we first modify the database then the filesystem. Example: | ||
Serve metadata (object size, mtime, ..) from the database rather | ||
than stat() the file. | ||
|
||
SQLite transactions are atomic. Filesystem operations maybe. Both combined are not. | ||
Orphaned files on the filesystem are acceptable and countered by an offline fsck tool. | ||
|
||
Use Ceph bufferlist encodings of data structures in the database where | ||
we don't have to query individual fields. Example: object attrs is | ||
bufferlist encoded, deletion time is not. With this leverage Ceph data | ||
structure versioning support as much as possible. | ||
|
||
Use negative return value style error handling not exceptions. This | ||
follows from the Google C++ style guide and the general RGW style. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
# SFS Versioning | ||
|
||
Refines ADR 0003-SFS on the object state machine and database columns. | ||
|
||
## Database | ||
|
||
In this document we only look at tables Objects and Versioned Objects. | ||
|
||
An *object* has a name, id, and reference to a *bucket* | ||
|
||
A *object version* has an id, checksum, {create, commit, delete}_time, | ||
mtime, size, **state**, **type**, ETag, serialized attributes, etc. | ||
|
||
An *object* is a group of versions identified by bucket and name. | ||
|
||
An *object* has one or more *object versions*. Regardless of the bucket versioning setting. | ||
|
||
Object **state** is an enum of Open, Committed, Deleted. See Object State Machine. | ||
|
||
Object **type** is an enum of Regular, Delete Marker. | ||
|
||
## Object State Machine | ||
|
||
```mermaid | ||
stateDiagram-v2 | ||
[*] --> Open : Create new version | ||
Open --> Committed : Success | ||
Open --> Open : Writing | ||
Committed --> Deleted : Delete | ||
Open --> [*] : Failure | ||
Deleted --> [*] : Garbage Collection | ||
``` | ||
|
||
*Open* - Initial state. Data is in flight. Data on disk is dirty. A | ||
version may stay in this state if, for example, a client fails during | ||
a PUT operation. A to be defined GC process cleans this up after a | ||
time. | ||
|
||
*Committed* - Upload finished. Data persisted. GETs and LIST | ||
will return this object. | ||
|
||
*Deleted* - Soft deletion state, but in terms of S3 permanent deletion. | ||
|
||
*terminal state* - the row no longer exists in the database. | ||
|
||
An object will never move back from Deleted to Committed. Deleting a | ||
version is treated as permanent, even though it is only permanent after | ||
the GC made it so. | ||
|
||
State changes also change timestamps: | ||
|
||
## Timestamps | ||
|
||
(See ADR 11 for information on the data type we store) | ||
|
||
Versioned objects store the following timestamps: | ||
|
||
*commit_time* - When the object changes to *Committed* state. | ||
|
||
*delete_time* - When the object changes to *Deleted* state. | ||
|
||
*create_time* - Set when the row is first created in *Open* state. | ||
|
||
*mtime* - A modification time passed to SFS. We follow the RADOS | ||
logic: Passed a `set_mtime` and `mtime`, we persist `set_mtime`. If | ||
`is_zero(set_mtime)` we take `real_clock::now()`. Return the `mtime` | ||
we persisted. | ||
|
||
## Object Version Types | ||
|
||
An object version may be of special type *delete marker*, representing | ||
an S3 delete marker in versioned buckets. | ||
|
||
## Garbage Collection | ||
|
||
Input: object versions in deleted state. | ||
|
||
Either delete single versions or delete whole objects. | ||
|
||
In case whole objects, delete the object row if all versions are deleted. | ||
In a transaction delete all versions, then the object. | ||
|
||
Should a concurrent transaction add a new version either will fail on | ||
the foreign key constraint between version and object. In the GC we | ||
continue with the next object; in the create path we retry or let the | ||
client retry. | ||
|
||
## Operations | ||
|
||
### Deletion | ||
|
||
Deleting a version is not to be confused with hiding objects by adding | ||
deletion markers. Deleting an object sets the state to DELETED. | ||
|
||
### Hiding / Deletion Marker | ||
|
||
A delete to a versioned bucket without a version creates a delete | ||
marker. This hides the object and all its version from unversioned GETs | ||
and LISTS. | ||
|
||
SFS implements this with object versions having type delete marker. If | ||
one exists on an object, it is excluded. | ||
|
||
Deletion markers are object versions and follow the same state machine | ||
and garbage collection as regular versions do. They however don't have | ||
a size and data on the filesystem. | ||
|
||
### Create / Update | ||
|
||
(versioned, unversioned buckets) In a transaction: | ||
|
||
1. Find or create an object row (id, name, bucket) | ||
2. Create an object version in state OPEN with creation time of now. | ||
If versioned, use the version id provided. Otherwise generate one. | ||
|
||
Receive and write data. | ||
|
||
On completion: | ||
(versioned buckets) set state COMMITTED of the previously created version. Record | ||
checksum, size, etc. | ||
|
||
(unversioned buckets) set state COMMITTED of the previously created version | ||
AND set all other versions DELETED. | ||
|
||
### Access: Listing, GETs | ||
|
||
Many operations expect a head, null, last or "canonical" version of | ||
an object. We define this as the *last* committed version, having the *latest* | ||
*commit_time* timestamp. | ||
|
||
If the timestamp resolution is too low to distinguish versions the highest id wins. | ||
|
||
(unversioned) Since we rely on versions to implement updates, more | ||
than one committed version may exist. We make this less likely by the commit rule in | ||
Create / Update above. Should multiple exists, the latest is the one we use. | ||
|
||
### Out of scope (for now): Versioning Suspended Buckets | ||
|
||
For now SFS won't support versioning suspended buckets. Object | ||
versions created while the bucket was unversioned or in versioning | ||
suspended state have a 'null' version id. This is not directly | ||
supported by this design and requires refinement in the future. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
<!-- #cSpell:words multicol datefunc --> | ||
# SFS Timestamps | ||
|
||
## Context and Problem Statement | ||
|
||
We need to create, handle and persist timestamps. | ||
|
||
In the RGW space we have ceph_time.h and std::chrono. | ||
|
||
SQLite represents time as ISO-8601, Julian day or unix timestamps. | ||
Stored in TEXT, REAL or INTEGER data types. [SQLite doc: | ||
data types](https://www.sqlite.org/datatype3.html). It has functions to | ||
work with these data types. [SQLite doc: | ||
datefunc](https://www.sqlite.org/lang_datefunc.html). | ||
|
||
This doc is about the conversion between RGW/sfs and SQLite space. | ||
|
||
Summary of discussion in weeks 19, 20 2023. [GH | ||
Comments](https://github.com/aquarist-labs/s3gw/pull/497) | ||
|
||
## Ceph time | ||
|
||
We use `ceph::real_time` as timestamps. | ||
|
||
`ceph::real_time` is a `uint64_t` nanosecond count since epoch. | ||
|
||
## How to store time in SQLite | ||
|
||
### Requirements | ||
|
||
Minimum microsecond resolution. | ||
|
||
We can leverage SQLIte range queries, sorting, indices. Ideally we can | ||
leverage date / time functions with minor conversion. | ||
|
||
### Considered Options | ||
|
||
Options, that don't fit our requirements: | ||
|
||
- *ISO8601* strings. Not enough resolution. | ||
- *UNIX time* - Not enough resolution | ||
|
||
- *multicol* - Store seconds as *ISO8601* or *UNIX time* (SQLite | ||
functions work directly). Store nanoseconds in a second column. | ||
Queries awkward. | ||
- *int64 ns* - Store as int64, nanosecond resolution, convert from | ||
uint64. Max value up to year 2262. Queries work. | ||
- *int64 us* - Store as int64, microsecond resolution, convert from | ||
uint64. Max value up to year 2554. Queries work. | ||
- *hex* - Store as 16 char hex string. Full `ceph::real_time` range. | ||
Conversion cost. Queries work. | ||
- *uint64* - Squeeze `uint64_t` into SQLite `INTEGER` type | ||
representation. No queries. | ||
- *blob* - Store `ceph::real_time` as an SQLite blob. Can't query and | ||
index as easily as *hex* or *int64* options. | ||
|
||
### Decision Outcome | ||
|
||
We choose *int64 ns*, because it meets all criteria. | ||
|
||
It requires minimal conversion. We can live with the year 2262 | ||
limitation. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -53,3 +53,6 @@ pytest | |
proxying | ||
COSI | ||
objectstorage | ||
bufferlist | ||
ETag | ||
chrono |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move this to
docs/design
? Makes more sense I think.