SFS ADRs based on Design Review Session 2023-05-05 #497

irq0 · 2023-05-05T15:25:51Z

Signed-off-by: Marcel Lauhoff marcel.lauhoff@suse.com

giubacc · 2023-05-08T07:22:53Z

I've found what the null version is: https://docs.aws.amazon.com/AmazonS3/latest/userguide/AddingObjectstoVersionSuspendedBuckets.html

Basically the null version occurs whenever a versioned bucket becomes unversioned because versioning has been suspended.
In this case, if you PUT an object, its version becomes the null version, that is, a fake id denoting the null version (that can be overwritten by subsequent PUTs).
If you delete an object on a versioned bucket a it is needed to check if the null version exists and in case it will be deleted.
So the null version is the transposition of the "regular" object of an unversioned bucket in the world of the versioned buckets.
I think it could be worth to add this information to the ADR.

jecluis · 2023-05-08T09:00:30Z

docs/decisions/0010-sfs-versioning.md

maybe move this to docs/design ? Makes more sense I think.

jecluis

LGTM, but I'd like the doc to be moved to docs/design/ instead. That, or we'll need to bring the other doc there here. I do think there's an advantage in keeping design docs all in the same place though.

irq0 · 2023-05-08T13:15:27Z

LGTM, but I'd like the doc to be moved to docs/design/ instead. That, or we'll need to bring the other doc there here. I do think there's an advantage in keeping design docs all in the same place though.

I don't have a strong opinion on either one directory or two.
Only thing I would love to have is a README that tells us what goes where. I can't intuitively distinguish between a "architecture decision record" and a design document.

So, what goes where?

doc/decisions

how we work (0000-use-markdown-any-decision-records.md, 0001-reorganize-documentation.md, 0004-release-captain.md, 0005-project-labels.md, 0006-git-branches-tags.md, 0007-release-process.md, 0008-pull-requests-code-review-guide.md)

doc/design

how we want the software to work (0002-s3gw-storage-backend.md, 0003-sfs.md, 0009-sfs-collection-of-high-level-design-decisions.md, 0010-sfs-versioning.md)

I think doc/decisions is too confusing. Suggest the following:

doc/decisions/design - how we want the software to work
doc/decisions/how-we-work - how we work docs from above

jecluis · 2023-05-08T13:20:29Z

* how we work (0000-use-markdown-any-decision-records.md, 0001-reorganize-documentation.md, 0004-release-captain.md, 0005-project-labels.md, 0006-git-branches-tags.md, 0007-release-process.md, 0008-pull-requests-code-review-guide.md)

doc/design

* how we want the software to work (0002-s3gw-storage-backend.md, 0003-sfs.md, 0009-sfs-collection-of-high-level-design-decisions.md, 0010-sfs-versioning.md)

I think doc/decisions is too confusing. Suggest the following:

* doc/decisions/design - how we want the software to work

* doc/decisions/how-we-work - how we work docs from above

That division sounds good to me. We might want to renumber them though, per subdir? @jhmarina thoughts?

jecluis · 2023-05-08T13:21:26Z

@irq0 in the meantime, given that's a tangential topic, lets just go ahead and merge this. That can be handled on a different PR.

jecluis · 2023-05-08T13:22:08Z

@irq0 in the meantime, given that's a tangential topic, lets just go ahead and merge this. That can be handled on a different PR.

And I mean "merge once everybody has reviewed and is happy (or not too sad)" :)

0xavi0 · 2023-05-11T10:00:51Z

docs/decisions/0010-sfs-versioning.md

+Many operations expect a head, null, last or "canonical" version of
+an object. We define this as the *last* version, having the *latest*
+creation time.


The current implementation uses de id of the version. The greater id is the latest version. (id is the primary key of the versioned object table)
If we want to implement this based on time... Wouldn't it be better to use modification time instead to creation time?. In case that 2 threads are putting a version in parallel, using the creation time means the last trying to write wins, while using the modification time means the last that finished writing wins. I think it makes more sense that the one winning is the last writing?

Also, if we want to query what's the latest version based on creation/modification time we might need to change the schema, because right now those time columns are stored as ceph::real_time which is stored as a BLOB in the database.

Problem is that maybe the SQLite data structures have not enough precision?

From the SQLite documentation

SQLite does not have a storage class set aside for storing dates and/or times. Instead, the built-in [Date And Time Functions](https://www.sqlite.org/lang_datefunc.html) of SQLite are capable of storing dates and times as TEXT, REAL, or INTEGER values: TEXT as ISO8601 strings ("YYYY-MM-DD HH:MM:SS.SSS"). REAL as Julian day numbers, the number of days since noon in Greenwich on November 24, 4714 B.C. according to the proleptic Gregorian calendar. INTEGER as Unix Time, the number of seconds since 1970-01-01 00:00:00 UTC.

Also, if we want to query what's the latest version based on creation/modification time we might need to change the schema, because right now those time columns are stored as ceph::real_time which is stored as a BLOB in the database.

On the multipart work I'm using timestamps to store a ceph::real_time instead, which I think is stored as a double. The conversion is trivial to do, and we get to use those fields. I didn't think about the necessary precision though, I should double check that.

Looks like we need to standardize on a time format and add that to the high level doc.

For reference:
ceph::real_time is std::chrono::time_point<real_clock>
real_clock is Ceph's high resolution real time clock. Nanosecond resolution.

I'd love to keep as much resolution as we can. Minimum microseconds. At the same time leveraging sqlite indices, range queries, sorting is important too.

Options:

Store as ISO8601 strings. We loose resolution. Values are monotonically increasing. order by and range queries work nicely. We can use sqlite datetime functions.

Store as integer. Unix time is second resolution = meh. So store micos? nanos? SQLite has max 64bit integers. I don't thing struct timespec fits good into this.

Store as multiple columns. ISO8601 string + nanoseconds part or two integers. First for search, second for resolution.

Store as blob. I read somewhere[42] that sqlite does comparisons based on memcmp on this. If we encode the time data right, we can have it comparable and monotonically increasing. No idea if we can do BETWEEN queries and how it indexes. Probably the least desirable format.

What else?

[42] https://www.sqlitetutorial.net/sqlite-data-types/

IMO, we should try to keep the same precision that rgw has... (nanos).
I would prefer not to have things in more than 1 column.

Isn't 2 ^ 64 enough? .... it should cover up to year 2554, right?
I think SQLite stores signed INTEGERS, though....

I'm using ceph::real_clock::to_double(mp.mtime) to obtain a double, which I assumed would capture a high resolution clock value, but now I'm not so sure. It seems to translate to

std::chrono::duration<double>(t.time_since_epoch()).count();

which seems to be using the default for std::chrono::duration, that being "seconds" as far as I can tell.

And finally... something I wanted to investigate long ago when all this started and it's custom type binding in sqlite_orm

Very cool! Wanted to try the binding support for a while as well :). Looks good. It could replace the conversion layer we have if we add bindings for the ceph types, right?

This would move the conversion to another place, but the conversion code will be still needed.
We won't see any performance gain. (and the code to be maintained will be highly templated code, which not everybody loves)

This is something I wanted to investigate since the very beginning but given that we didn't had the time I could not go deeper.

I think it makes user code cleaner, because we work with our own types.

Anyway, I would maybe do it for the timestamps in this iteration and will try to change the rest when the versioning stuff is done. I don't think this is as urgent as versioning work.

Test code for storing ceph::real_clock::time_point as int64 and checking that values stored or read are not negative (throws a std::runtime_error in that case)

I can order_by and use max and min from the sqlite storage class same as with the 16 chars hex string representation.

Trying to pass a time_point like ceph::real_clock::time_point(std::chrono::nanoseconds(std::numeric_limits<uint64_t>::max())) won't be accepted.

Error converting ceph::real_clock::time_point to int64. Nanoseconds value: 18446744073709551615 is out of range

namespace ceph { static int64_t to_int64(const ceph::real_clock::time_point& t) { uint64_t nanos = std::chrono::duration_cast<std::chrono::nanoseconds>(t.time_since_epoch()).count(); // we check that the value is not greater than int64 max if (nanos > std::numeric_limits<int64_t>::max()) { std::stringstream oss; oss << "Error converting ceph::real_clock::time_point to int64. Nanoseconds value: " << nanos << " is out of range"; throw std::runtime_error(oss.str()); } return static_cast<int64_t>(nanos); } static std::optional<ceph::real_clock::time_point> from_int64(int64_t int_time) { std::optional<ceph::real_clock::time_point> ret; if (int_time < 0) { // to ensure that we stick to the int64 positive range. std::stringstream oss; oss << "Error converting int64 nanoseconds value to ceph::real_cock::time_point. Value: " << int_time << " is out of range"; throw std::runtime_error(oss.str()); } uint64_t uint64_nanos = static_cast<uint64_t>(int_time); ceph::real_clock::time_point ret_time = ceph::real_clock::time_point(std::chrono::nanoseconds(uint64_nanos)); ret = ret_time; return ret; } } // namespace ceph namespace sqlite_orm { template<> struct type_printer<ceph::real_clock::time_point> : public integer_printer {}; template<> struct statement_binder<ceph::real_clock::time_point> { int bind(sqlite3_stmt* stmt, int index, const ceph::real_clock::time_point& value) const { return statement_binder<uint64_t>().bind(stmt, index, to_int64(value)); } }; template<> struct field_printer<ceph::real_clock::time_point> { std::string operator()(const ceph::real_clock::time_point& t) const { auto int_value = ceph::to_int64(t); return std::to_string(int_value); } }; template<> struct row_extractor<ceph::real_clock::time_point> { ceph::real_clock::time_point extract(int64_t row_value) const { if(row_value < 0) { throw std::runtime_error("incorrect encoded timestamp int64. (" + std::to_string(row_value) + ") is out of range"); } auto sd = ceph::from_int64(row_value); if(sd) { return sd.value(); } else { throw std::runtime_error("incorrect encoded timestamp int64 (" + std::to_string(row_value) + ")"); } } ceph::real_clock::time_point extract(sqlite3_stmt* stmt, int columnIndex) const { std::cout << "Extracting value" << std::endl; auto int_value = sqlite3_column_int64(stmt, columnIndex); return this->extract(int_value); } }; }

As I know there is a lot of noise I'm making a summary:

Options I see so far.

Store timestamps as int64 (which will limit our timestamps range up to year 2262)

Store timestamps as 16 chars hex string (we can store the whole uint64 range and, if in the future upstream adds more precision we can increase the number of chars... This comes with a conversion code that adds latency)

Store as uint64 (but we cannot query by timestamps because sqlite takes sign into account for order_by and max, main)

Keep storing it as a blob if we don't need to query by timestamp.

And... for deciding how to identify what's the last version of an object.

Max creation time

Max modification time (or another field that we create so the last thread writing wins)

Max version id (I think we can go for this in favor of max creation time, because they are equivalents and we don't need to change the timestamps code. When we set creation time we also create a new version id, so the greater one will still be the latest one)

I'd like to add option int64, but with microsecond resolution (max 2554-07-21 23:34:33.709551).

Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>

irq0 · 2023-05-15T11:59:38Z

Last push adds:
Timestamp discussion as adr 11
ADR 10: Timestamps to record, refinement of the last version rule.

Notes from the design review session 2023-05-05 Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>

Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>

irq0 · 2023-05-15T13:47:16Z

Now has a bit about the set_mtime / mtime

0xavi0

lgtm

Changes the db schema following [this ADR](https://github.com/aquarist-labs/s3gw/pull/497). It also deletes repeated fields that we had in objects vs versioned_objects tables. Logic of `sfs` was intentionally not updated in this PR. (For example new field `version_type` is added to the schema, but still not used). Updates in logic, specially in versioning logic, will come in a future PR. New Features: **Sqliteorm type bindings for the following types.** * `uuid_d` * `ceph::real_time` * `rgw::sal::sfs::ObjetState` * `rgw::sal::sfs::VersionType` In general, any enum type that has a `LAST_VALUE` item assigned to the last valid value and that also starts with 0 is eligible to be binded to sqliteorm, as the type binding for enums is generic. For example: you can create an enum like: ```c++ enum class TestEnum { WHATEVER = 0, SOMETHING_ELSE, AND_THIS, LAST_VALUE = AND_THIS }; ``` and use it right away in sqliteorm. **Object has no explicit conversion code** As all the types in the `object` are compatible with `sqliteorm` we don't need extra conversion layer. This is a preview of what we can do with the rest of database objects once `BLOBS` are also type binded in the future. That was not done in this PR to avoid too many changes. Note: The new type `VersionType` was added as a new file `version_type.h` because we are in a re-design process and I wasn't sure if it should be located in any other existing file that might be eligible to be deleted in the near future. Fixes: https://github.com/aquarist-labs/s3gw/issues/480 Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>

Fixes: https://github.com/aquarist-labs/s3gw/issues/672 References: - https://github.com/aquarist-labs/s3gw/issues/534 - https://github.com/aquarist-labs/s3gw/pull/497 Signed-off-by: Volker Theile <vtheile@suse.com>

…abled This is to prevent the user from disabling the versioning because right now you can go from: - unversioned to versioned any time. - versioned -> unversioned is not supported in S3. - versioned -> suspended versioned is not supported by us. Fixes: https://github.com/aquarist-labs/s3gw/issues/672 References: - https://github.com/aquarist-labs/s3gw/issues/534 - https://github.com/aquarist-labs/s3gw/pull/497 Signed-off-by: Volker Theile <vtheile@suse.com>

Changes the db schema following [this ADR](https://github.com/aquarist-labs/s3gw/pull/497). It also deletes repeated fields that we had in objects vs versioned_objects tables. Logic of `sfs` was intentionally not updated in this PR. (For example new field `version_type` is added to the schema, but still not used). Updates in logic, specially in versioning logic, will come in a future PR. New Features: **Sqliteorm type bindings for the following types.** * `uuid_d` * `ceph::real_time` * `rgw::sal::sfs::ObjetState` * `rgw::sal::sfs::VersionType` In general, any enum type that has a `LAST_VALUE` item assigned to the last valid value and that also starts with 0 is eligible to be binded to sqliteorm, as the type binding for enums is generic. For example: you can create an enum like: ```c++ enum class TestEnum { WHATEVER = 0, SOMETHING_ELSE, AND_THIS, LAST_VALUE = AND_THIS }; ``` and use it right away in sqliteorm. **Object has no explicit conversion code** As all the types in the `object` are compatible with `sqliteorm` we don't need extra conversion layer. This is a preview of what we can do with the rest of database objects once `BLOBS` are also type binded in the future. That was not done in this PR to avoid too many changes. Note: The new type `VersionType` was added as a new file `version_type.h` because we are in a re-design process and I wasn't sure if it should be located in any other existing file that might be eligible to be deleted in the near future. Fixes: https://github.com/aquarist-labs/s3gw/issues/480 Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>

irq0 self-assigned this May 5, 2023

irq0 added kind/design Design work area/rgw-sfs RGW & SFS related labels May 5, 2023

irq0 requested review from jecluis, 0xavi0 and giubacc May 5, 2023 15:31

jecluis reviewed May 8, 2023

View reviewed changes

docs/decisions/0010-sfs-versioning.md

Copy link

Contributor

jecluis May 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move this to docs/design ? Makes more sense I think.

jecluis requested changes May 8, 2023

View reviewed changes

jhmarina added this to the v0.16.0 milestone May 8, 2023

jecluis previously approved these changes May 8, 2023

View reviewed changes

This was referenced May 8, 2023

Formalize Object State Machine and Recovery Logic #364

Closed

Reconcile possible duplicated columns in object and versioned object tables #480

Closed

0xavi0 reviewed May 11, 2023

View reviewed changes

irq0 mentioned this pull request May 12, 2023

PUT operation on unversioned objects not always atomic #462

Closed

adr: SFS high level design decisions

947f934

Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>

irq0 dismissed jecluis’s stale review via 345ca87 May 15, 2023 11:57

irq0 force-pushed the pr/sfs-adrs-2023-05-05 branch 2 times, most recently from 345ca87 to 65cf391 Compare May 15, 2023 11:58

jhmarina modified the milestones: v0.16.0, v0.17.0 May 15, 2023

0xavi0 mentioned this pull request May 15, 2023

Versioning improvements (Epic) #514

Closed

Marcel Lauhoff added 2 commits May 15, 2023 15:45

adr: SFS versioning

40bc1bb

Notes from the design review session 2023-05-05 Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>

adr: SFS timestamps

33a3d26

Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>

irq0 force-pushed the pr/sfs-adrs-2023-05-05 branch from 65cf391 to 33a3d26 Compare May 15, 2023 13:45

0xavi0 approved these changes May 15, 2023

View reviewed changes

giubacc approved these changes May 15, 2023

View reviewed changes

jecluis approved these changes May 15, 2023

View reviewed changes

irq0 merged commit 6ca08db into s3gw-tech:main May 15, 2023

0xavi0 mentioned this pull request May 16, 2023

rgw/sfs: change db schema as described in ADR. aquarist-labs/ceph#150

Merged

11 tasks

irq0 mentioned this pull request May 17, 2023

Disable bucket transition to versioning suspended #534

Closed

0xavi0 mentioned this pull request May 18, 2023

Check and improve versioning s3tests #472

Closed

votdev mentioned this pull request Aug 16, 2023

Disable versioning checkbox after bucket creation s3gw-tech/s3gw-ui#252

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFS ADRs based on Design Review Session 2023-05-05 #497

SFS ADRs based on Design Review Session 2023-05-05 #497

irq0 commented May 5, 2023

giubacc commented May 8, 2023

jecluis May 8, 2023

jecluis left a comment

irq0 commented May 8, 2023

jecluis commented May 8, 2023

jecluis commented May 8, 2023

jecluis commented May 8, 2023

0xavi0 May 11, 2023 •

edited

Loading

jecluis May 11, 2023

irq0 May 11, 2023

0xavi0 May 11, 2023 •

edited

Loading

jecluis May 11, 2023

irq0 May 12, 2023

0xavi0 May 15, 2023

0xavi0 May 15, 2023 •

edited

Loading

0xavi0 May 15, 2023 •

edited

Loading

irq0 May 15, 2023

irq0 commented May 15, 2023

irq0 commented May 15, 2023

0xavi0 left a comment

SFS ADRs based on Design Review Session 2023-05-05 #497

SFS ADRs based on Design Review Session 2023-05-05 #497

Conversation

irq0 commented May 5, 2023

giubacc commented May 8, 2023

jecluis May 8, 2023

Choose a reason for hiding this comment

jecluis left a comment

Choose a reason for hiding this comment

irq0 commented May 8, 2023

jecluis commented May 8, 2023

jecluis commented May 8, 2023

jecluis commented May 8, 2023

0xavi0 May 11, 2023 • edited Loading

Choose a reason for hiding this comment

jecluis May 11, 2023

Choose a reason for hiding this comment

irq0 May 11, 2023

Choose a reason for hiding this comment

0xavi0 May 11, 2023 • edited Loading

Choose a reason for hiding this comment

jecluis May 11, 2023

Choose a reason for hiding this comment

irq0 May 12, 2023

Choose a reason for hiding this comment

0xavi0 May 15, 2023

Choose a reason for hiding this comment

0xavi0 May 15, 2023 • edited Loading

Choose a reason for hiding this comment

0xavi0 May 15, 2023 • edited Loading

Choose a reason for hiding this comment

irq0 May 15, 2023

Choose a reason for hiding this comment

irq0 commented May 15, 2023

irq0 commented May 15, 2023

0xavi0 left a comment

Choose a reason for hiding this comment

0xavi0 May 11, 2023 •

edited

Loading

0xavi0 May 11, 2023 •

edited

Loading

0xavi0 May 15, 2023 •

edited

Loading

0xavi0 May 15, 2023 •

edited

Loading