-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SFS ADRs based on Design Review Session 2023-05-05 #497
Conversation
I've found what the Basically the null version occurs whenever a versioned bucket becomes unversioned because versioning has been suspended. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move this to docs/design
? Makes more sense I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I'd like the doc to be moved to docs/design/
instead. That, or we'll need to bring the other doc there here. I do think there's an advantage in keeping design docs all in the same place though.
I don't have a strong opinion on either one directory or two. So, what goes where? doc/decisions
doc/design
I think doc/decisions is too confusing. Suggest the following:
|
That division sounds good to me. We might want to renumber them though, per subdir? @jhmarina thoughts? |
@irq0 in the meantime, given that's a tangential topic, lets just go ahead and merge this. That can be handled on a different PR. |
And I mean "merge once everybody has reviewed and is happy (or not too sad)" :) |
Many operations expect a head, null, last or "canonical" version of | ||
an object. We define this as the *last* version, having the *latest* | ||
creation time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation uses de id
of the version. The greater id
is the latest version. (id
is the primary key of the versioned object table)
If we want to implement this based on time... Wouldn't it be better to use modification time instead to creation time?. In case that 2 threads are putting a version in parallel, using the creation time means the last trying to write wins, while using the modification time means the last that finished writing wins. I think it makes more sense that the one winning is the last writing?
Also, if we want to query what's the latest version based on creation/modification time we might need to change the schema, because right now those time columns are stored as ceph::real_time
which is stored as a BLOB
in the database.
Problem is that maybe the SQLite
data structures have not enough precision?
From the SQLite documentation
SQLite does not have a storage class set aside for storing dates and/or times. Instead, the built-in [Date And Time Functions](https://www.sqlite.org/lang_datefunc.html) of SQLite are capable of storing dates and times as TEXT, REAL, or INTEGER values:
TEXT as ISO8601 strings ("YYYY-MM-DD HH:MM:SS.SSS").
REAL as Julian day numbers, the number of days since noon in Greenwich on November 24, 4714 B.C. according to the proleptic Gregorian calendar.
INTEGER as Unix Time, the number of seconds since 1970-01-01 00:00:00 UTC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we want to query what's the latest version based on creation/modification time we might need to change the schema, because right now those time columns are stored as
ceph::real_time
which is stored as aBLOB
in the database.
On the multipart work I'm using timestamps to store a ceph::real_time
instead, which I think is stored as a double. The conversion is trivial to do, and we get to use those fields. I didn't think about the necessary precision though, I should double check that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we need to standardize on a time format and add that to the high level doc.
For reference:
ceph::real_time
is std::chrono::time_point<real_clock>
real_clock
is Ceph's high resolution real time clock. Nanosecond resolution.
I'd love to keep as much resolution as we can. Minimum microseconds. At the same time leveraging sqlite indices, range queries, sorting is important too.
Options:
- Store as ISO8601 strings. We loose resolution. Values are monotonically increasing. order by and range queries work nicely. We can use sqlite datetime functions.
- Store as integer. Unix time is second resolution = meh. So store micos? nanos? SQLite has max 64bit integers. I don't thing struct timespec fits good into this.
- Store as multiple columns. ISO8601 string + nanoseconds part or two integers. First for search, second for resolution.
- Store as blob. I read somewhere[42] that sqlite does comparisons based on memcmp on this. If we encode the time data right, we can have it comparable and monotonically increasing. No idea if we can do BETWEEN queries and how it indexes. Probably the least desirable format.
What else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should try to keep the same precision that rgw
has... (nanos).
I would prefer not to have things in more than 1 column.
Isn't 2 ^ 64 enough? .... it should cover up to year 2554, right?
I think SQLite stores signed INTEGERS, though....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using ceph::real_clock::to_double(mp.mtime)
to obtain a double, which I assumed would capture a high resolution clock value, but now I'm not so sure. It seems to translate to
std::chrono::duration<double>(t.time_since_epoch()).count();
which seems to be using the default for std::chrono::duration
, that being "seconds" as far as I can tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And finally... something I wanted to investigate long ago when all this started and it's custom type binding in
sqlite_orm
Very cool! Wanted to try the binding support for a while as well :). Looks good. It could replace the conversion layer we have if we add bindings for the ceph types, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would move the conversion to another place, but the conversion code will be still needed.
We won't see any performance gain. (and the code to be maintained will be highly templated code, which not everybody loves)
This is something I wanted to investigate since the very beginning but given that we didn't had the time I could not go deeper.
I think it makes user code cleaner, because we work with our own types.
Anyway, I would maybe do it for the timestamps in this iteration and will try to change the rest when the versioning stuff is done. I don't think this is as urgent as versioning work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test code for storing ceph::real_clock::time_point
as int64
and checking that values stored or read are not negative (throws a std::runtime_error
in that case)
I can order_by
and use max
and min
from the sqlite
storage class same as with the 16 chars hex string representation.
Trying to pass a time_point
like ceph::real_clock::time_point(std::chrono::nanoseconds(std::numeric_limits<uint64_t>::max()))
won't be accepted.
Error converting ceph::real_clock::time_point to int64. Nanoseconds value: 18446744073709551615 is out of range
namespace ceph {
static int64_t to_int64(const ceph::real_clock::time_point& t) {
uint64_t nanos = std::chrono::duration_cast<std::chrono::nanoseconds>(t.time_since_epoch()).count();
// we check that the value is not greater than int64 max
if (nanos > std::numeric_limits<int64_t>::max()) {
std::stringstream oss;
oss << "Error converting ceph::real_clock::time_point to int64. Nanoseconds value: " << nanos << " is out of range";
throw std::runtime_error(oss.str());
}
return static_cast<int64_t>(nanos);
}
static std::optional<ceph::real_clock::time_point> from_int64(int64_t int_time) {
std::optional<ceph::real_clock::time_point> ret;
if (int_time < 0) {
// to ensure that we stick to the int64 positive range.
std::stringstream oss;
oss << "Error converting int64 nanoseconds value to ceph::real_cock::time_point. Value: " << int_time << " is out of range";
throw std::runtime_error(oss.str());
}
uint64_t uint64_nanos = static_cast<uint64_t>(int_time);
ceph::real_clock::time_point ret_time = ceph::real_clock::time_point(std::chrono::nanoseconds(uint64_nanos));
ret = ret_time;
return ret;
}
} // namespace ceph
namespace sqlite_orm
{
template<>
struct type_printer<ceph::real_clock::time_point> : public integer_printer {};
template<>
struct statement_binder<ceph::real_clock::time_point> {
int bind(sqlite3_stmt* stmt, int index, const ceph::real_clock::time_point& value) const {
return statement_binder<uint64_t>().bind(stmt, index, to_int64(value));
}
};
template<>
struct field_printer<ceph::real_clock::time_point> {
std::string operator()(const ceph::real_clock::time_point& t) const {
auto int_value = ceph::to_int64(t);
return std::to_string(int_value);
}
};
template<>
struct row_extractor<ceph::real_clock::time_point> {
ceph::real_clock::time_point extract(int64_t row_value) const {
if(row_value < 0) {
throw std::runtime_error("incorrect encoded timestamp int64. (" + std::to_string(row_value) + ") is out of range");
}
auto sd = ceph::from_int64(row_value);
if(sd) {
return sd.value();
} else {
throw std::runtime_error("incorrect encoded timestamp int64 (" + std::to_string(row_value) + ")");
}
}
ceph::real_clock::time_point extract(sqlite3_stmt* stmt, int columnIndex) const {
std::cout << "Extracting value" << std::endl;
auto int_value = sqlite3_column_int64(stmt, columnIndex);
return this->extract(int_value);
}
};
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I know there is a lot of noise I'm making a summary:
Options I see so far.
- Store timestamps as
int64
(which will limit our timestamps range up to year 2262) - Store timestamps as 16 chars hex string (we can store the whole uint64 range and, if in the future upstream adds more precision we can increase the number of chars... This comes with a conversion code that adds latency)
- Store as
uint64
(but we cannot query by timestamps because sqlite takes sign into account for order_by and max, main) - Keep storing it as a
blob
if we don't need to query by timestamp.
And... for deciding how to identify what's the last version of an object.
- Max creation time
- Max modification time (or another field that we create so the last thread writing wins)
- Max version id (I think we can go for this in favor of max creation time, because they are equivalents and we don't need to change the timestamps code. When we set creation time we also create a new version id, so the greater one will still be the latest one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to add option int64, but with microsecond resolution (max 2554-07-21 23:34:33.709551).
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
345ca87
to
65cf391
Compare
Last push adds: |
Notes from the design review session 2023-05-05 Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
Signed-off-by: Marcel Lauhoff <marcel.lauhoff@suse.com>
65cf391
to
33a3d26
Compare
Now has a bit about the set_mtime / mtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Changes the db schema following [this ADR](https://github.com/aquarist-labs/s3gw/pull/497). It also deletes repeated fields that we had in objects vs versioned_objects tables. Logic of `sfs` was intentionally not updated in this PR. (For example new field `version_type` is added to the schema, but still not used). Updates in logic, specially in versioning logic, will come in a future PR. New Features: **Sqliteorm type bindings for the following types.** * `uuid_d` * `ceph::real_time` * `rgw::sal::sfs::ObjetState` * `rgw::sal::sfs::VersionType` In general, any enum type that has a `LAST_VALUE` item assigned to the last valid value and that also starts with 0 is eligible to be binded to sqliteorm, as the type binding for enums is generic. For example: you can create an enum like: ```c++ enum class TestEnum { WHATEVER = 0, SOMETHING_ELSE, AND_THIS, LAST_VALUE = AND_THIS }; ``` and use it right away in sqliteorm. **Object has no explicit conversion code** As all the types in the `object` are compatible with `sqliteorm` we don't need extra conversion layer. This is a preview of what we can do with the rest of database objects once `BLOBS` are also type binded in the future. That was not done in this PR to avoid too many changes. Note: The new type `VersionType` was added as a new file `version_type.h` because we are in a re-design process and I wasn't sure if it should be located in any other existing file that might be eligible to be deleted in the near future. Fixes: https://github.com/aquarist-labs/s3gw/issues/480 Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>
Changes the db schema following [this ADR](https://github.com/aquarist-labs/s3gw/pull/497). It also deletes repeated fields that we had in objects vs versioned_objects tables. Logic of `sfs` was intentionally not updated in this PR. (For example new field `version_type` is added to the schema, but still not used). Updates in logic, specially in versioning logic, will come in a future PR. New Features: **Sqliteorm type bindings for the following types.** * `uuid_d` * `ceph::real_time` * `rgw::sal::sfs::ObjetState` * `rgw::sal::sfs::VersionType` In general, any enum type that has a `LAST_VALUE` item assigned to the last valid value and that also starts with 0 is eligible to be binded to sqliteorm, as the type binding for enums is generic. For example: you can create an enum like: ```c++ enum class TestEnum { WHATEVER = 0, SOMETHING_ELSE, AND_THIS, LAST_VALUE = AND_THIS }; ``` and use it right away in sqliteorm. **Object has no explicit conversion code** As all the types in the `object` are compatible with `sqliteorm` we don't need extra conversion layer. This is a preview of what we can do with the rest of database objects once `BLOBS` are also type binded in the future. That was not done in this PR to avoid too many changes. Note: The new type `VersionType` was added as a new file `version_type.h` because we are in a re-design process and I wasn't sure if it should be located in any other existing file that might be eligible to be deleted in the near future. Fixes: https://github.com/aquarist-labs/s3gw/issues/480 Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>
Changes the db schema following [this ADR](https://github.com/aquarist-labs/s3gw/pull/497). It also deletes repeated fields that we had in objects vs versioned_objects tables. Logic of `sfs` was intentionally not updated in this PR. (For example new field `version_type` is added to the schema, but still not used). Updates in logic, specially in versioning logic, will come in a future PR. New Features: **Sqliteorm type bindings for the following types.** * `uuid_d` * `ceph::real_time` * `rgw::sal::sfs::ObjetState` * `rgw::sal::sfs::VersionType` In general, any enum type that has a `LAST_VALUE` item assigned to the last valid value and that also starts with 0 is eligible to be binded to sqliteorm, as the type binding for enums is generic. For example: you can create an enum like: ```c++ enum class TestEnum { WHATEVER = 0, SOMETHING_ELSE, AND_THIS, LAST_VALUE = AND_THIS }; ``` and use it right away in sqliteorm. **Object has no explicit conversion code** As all the types in the `object` are compatible with `sqliteorm` we don't need extra conversion layer. This is a preview of what we can do with the rest of database objects once `BLOBS` are also type binded in the future. That was not done in this PR to avoid too many changes. Note: The new type `VersionType` was added as a new file `version_type.h` because we are in a re-design process and I wasn't sure if it should be located in any other existing file that might be eligible to be deleted in the near future. Fixes: https://github.com/aquarist-labs/s3gw/issues/480 Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>
Fixes: https://github.com/aquarist-labs/s3gw/issues/672 References: - https://github.com/aquarist-labs/s3gw/issues/534 - https://github.com/aquarist-labs/s3gw/pull/497 Signed-off-by: Volker Theile <vtheile@suse.com>
…abled This is to prevent the user from disabling the versioning because right now you can go from: - unversioned to versioned any time. - versioned -> unversioned is not supported in S3. - versioned -> suspended versioned is not supported by us. Fixes: https://github.com/aquarist-labs/s3gw/issues/672 References: - https://github.com/aquarist-labs/s3gw/issues/534 - https://github.com/aquarist-labs/s3gw/pull/497 Signed-off-by: Volker Theile <vtheile@suse.com>
…abled This is to prevent the user from disabling the versioning because right now you can go from: - unversioned to versioned any time. - versioned -> unversioned is not supported in S3. - versioned -> suspended versioned is not supported by us. Fixes: https://github.com/aquarist-labs/s3gw/issues/672 References: - https://github.com/aquarist-labs/s3gw/issues/534 - https://github.com/aquarist-labs/s3gw/pull/497 Signed-off-by: Volker Theile <vtheile@suse.com>
Changes the db schema following [this ADR](https://github.com/aquarist-labs/s3gw/pull/497). It also deletes repeated fields that we had in objects vs versioned_objects tables. Logic of `sfs` was intentionally not updated in this PR. (For example new field `version_type` is added to the schema, but still not used). Updates in logic, specially in versioning logic, will come in a future PR. New Features: **Sqliteorm type bindings for the following types.** * `uuid_d` * `ceph::real_time` * `rgw::sal::sfs::ObjetState` * `rgw::sal::sfs::VersionType` In general, any enum type that has a `LAST_VALUE` item assigned to the last valid value and that also starts with 0 is eligible to be binded to sqliteorm, as the type binding for enums is generic. For example: you can create an enum like: ```c++ enum class TestEnum { WHATEVER = 0, SOMETHING_ELSE, AND_THIS, LAST_VALUE = AND_THIS }; ``` and use it right away in sqliteorm. **Object has no explicit conversion code** As all the types in the `object` are compatible with `sqliteorm` we don't need extra conversion layer. This is a preview of what we can do with the rest of database objects once `BLOBS` are also type binded in the future. That was not done in this PR to avoid too many changes. Note: The new type `VersionType` was added as a new file `version_type.h` because we are in a re-design process and I wasn't sure if it should be located in any other existing file that might be eligible to be deleted in the near future. Fixes: https://github.com/aquarist-labs/s3gw/issues/480 Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>
Changes the db schema following [this ADR](https://github.com/aquarist-labs/s3gw/pull/497). It also deletes repeated fields that we had in objects vs versioned_objects tables. Logic of `sfs` was intentionally not updated in this PR. (For example new field `version_type` is added to the schema, but still not used). Updates in logic, specially in versioning logic, will come in a future PR. New Features: **Sqliteorm type bindings for the following types.** * `uuid_d` * `ceph::real_time` * `rgw::sal::sfs::ObjetState` * `rgw::sal::sfs::VersionType` In general, any enum type that has a `LAST_VALUE` item assigned to the last valid value and that also starts with 0 is eligible to be binded to sqliteorm, as the type binding for enums is generic. For example: you can create an enum like: ```c++ enum class TestEnum { WHATEVER = 0, SOMETHING_ELSE, AND_THIS, LAST_VALUE = AND_THIS }; ``` and use it right away in sqliteorm. **Object has no explicit conversion code** As all the types in the `object` are compatible with `sqliteorm` we don't need extra conversion layer. This is a preview of what we can do with the rest of database objects once `BLOBS` are also type binded in the future. That was not done in this PR to avoid too many changes. Note: The new type `VersionType` was added as a new file `version_type.h` because we are in a re-design process and I wasn't sure if it should be located in any other existing file that might be eligible to be deleted in the near future. Fixes: https://github.com/aquarist-labs/s3gw/issues/480 Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>
Changes the db schema following [this ADR](https://github.com/aquarist-labs/s3gw/pull/497). It also deletes repeated fields that we had in objects vs versioned_objects tables. Logic of `sfs` was intentionally not updated in this PR. (For example new field `version_type` is added to the schema, but still not used). Updates in logic, specially in versioning logic, will come in a future PR. New Features: **Sqliteorm type bindings for the following types.** * `uuid_d` * `ceph::real_time` * `rgw::sal::sfs::ObjetState` * `rgw::sal::sfs::VersionType` In general, any enum type that has a `LAST_VALUE` item assigned to the last valid value and that also starts with 0 is eligible to be binded to sqliteorm, as the type binding for enums is generic. For example: you can create an enum like: ```c++ enum class TestEnum { WHATEVER = 0, SOMETHING_ELSE, AND_THIS, LAST_VALUE = AND_THIS }; ``` and use it right away in sqliteorm. **Object has no explicit conversion code** As all the types in the `object` are compatible with `sqliteorm` we don't need extra conversion layer. This is a preview of what we can do with the rest of database objects once `BLOBS` are also type binded in the future. That was not done in this PR to avoid too many changes. Note: The new type `VersionType` was added as a new file `version_type.h` because we are in a re-design process and I wasn't sure if it should be located in any other existing file that might be eligible to be deleted in the near future. Fixes: https://github.com/aquarist-labs/s3gw/issues/480 Signed-off-by: Xavi Garcia <xavi.garcia@suse.com>
Signed-off-by: Marcel Lauhoff marcel.lauhoff@suse.com