-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce CqlTimeuuid
type
#894
Conversation
let timeuuid = match CqlTimeuuid::from_slice(buf) { | ||
Ok(timeuuid) => timeuuid, | ||
Err(e) => { | ||
return Err(ParseError::BadIncomingData(format!( | ||
"Timeuuid parsing failed: {e}" | ||
))) | ||
} | ||
}; | ||
CqlValue::Timeuuid(timeuuid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please verify whether Scylla (and Cassandra) validates timeuuids like this, i.e. whether it allows inserting uuids that are not necessarily timeuuids into timeuuid column. If they are then perhaps we can do this validation here as well, otherwise I would abstain from it so that users don't run into a situation where they cannot read data they inserted earlier or via another driver.
Perhaps it is safer not to validate here anyway - even if the DBs properly validate right now, just in case if validation breaks in Scylla/Cassandra at some future release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please verify whether Scylla (and Cassandra) validates timeuuids like this, i.e. whether it allows inserting uuids that are not necessarily timeuuids into timeuuid column.
A simple test case shows that Scylla validates whether uuid inserted to timeuuid column is a valid timeuuid.
#[tokio::test]
async fn test_invalid_timeuuid() {
let session: Session = create_new_session_builder().build().await.unwrap();
let ks = unique_keyspace_name();
session
.query(
format!(
"CREATE KEYSPACE IF NOT EXISTS {} WITH REPLICATION = \
{{'class' : 'NetworkTopologyStrategy', 'replication_factor' : 1}}",
ks
),
&[],
)
.await
.unwrap();
session.use_keyspace(ks, false).await.unwrap();
session
.query(
"CREATE TABLE tab (p int, t timeuuid, PRIMARY KEY (p, t))",
(),
)
.await
.unwrap();
// Invalid timeuuid (version 2 uuid)
let invalid_timeuuid = uuid::uuid!("00000000-0000-2000-8080-808080808080");
session
.query("INSERT INTO tab (p, t) VALUES (0, ?)", (invalid_timeuuid,))
.await
.unwrap();
}
The result:
thread 'transport::cql_types_test::test_invalid_timeuuid' panicked at scylla/src/transport/cql_types_test.rs:1731:10:
called `Result::unwrap()` on an `Err` value: DbError(Invalid, "Exception while binding column t: marshaling error: Unsupported UUID version (2)")
Perhaps it is safer not to validate here anyway - even if the DBs properly validate right now, just in case if validation breaks in Scylla/Cassandra at some future release.
But I agree, I'll get rid of the validation part in CqlTimeuuid
constructors i.e. replace
impl TryFrom<Uuid> for CqlTimeuuid {
type Error = InvalidTimeuuidVersion;
fn try_from(value: Uuid) -> Result<Self, Self::Error> {
let version = value.get_version_num();
if version != 1 {
return Err(InvalidTimeuuidVersion(version));
}
Ok(Self(value))
}
}
with
impl From<Uuid> for CqlTimeuuid { ... }
b930435
to
5af99f6
Compare
v2:
|
let sorted_timeuuid_vals: Vec<CqlTimeuuid> = vec![ | ||
CqlTimeuuid::from_str("00000000-0000-1000-8080-808080808080").unwrap(), | ||
CqlTimeuuid::from_str("00000000-0000-1000-ffff-ffffffffffff").unwrap(), | ||
CqlTimeuuid::from_str("00000000-0000-1000-0000-000000000000").unwrap(), | ||
CqlTimeuuid::from_str("fed35080-0efb-11ee-a1ca-00006490e9a4").unwrap(), | ||
CqlTimeuuid::from_str("00000257-0efc-11ee-9547-00006490e9a6").unwrap(), | ||
CqlTimeuuid::from_str("ffffffff-ffff-1fff-ffff-ffffffffffef").unwrap(), | ||
CqlTimeuuid::from_str("ffffffff-ffff-1fff-ffff-ffffffffffff").unwrap(), | ||
CqlTimeuuid::from_str("ffffffff-ffff-1fff-0000-000000000000").unwrap(), | ||
CqlTimeuuid::from_str("ffffffff-ffff-1fff-7f7f-7f7f7f7f7f7f").unwrap(), | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, if we test sorting, it would be good to have unsorted data. You can use https://docs.rs/rand/latest/rand/seq/trait.SliceRandom.html#tymethod.shuffle to avoid having 2 literals in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the test non-deterministic... If one of the 362880 possible orderings results in an incorrectly sorted sequence then you won't know which one failed. I think we should either use a fixed seed or use a random one but print it during the test so that we can reproduce the results.
I also think that we should test multiple orderings in one run with sort()
- let's say several thousand times, so that this part of the test doesn't run longer than a couple milliseconds. Of course, it's unpractical to insert this data to ScyllaDB for each of the orderings, but I don't think it matters because we can assume that ScyllaDB's ordering works correctly - inserting this data in any order only once and then verifying that it matches our expected order should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also use a random generator with known algorithm - std::thread_rng
uses an unspecified algorithm which can probably change in the future. We already have tests with fixed pseudorandomness - for example, see partitioners_output_same_result_no_matter_how_input_is_partitioned
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, 362880
is not a huge number. Perhaps checking all the permutations would not take much time (let's aim under 100ms)? We can iterate over permutations using Itertools::permutations()
.
scylla-cql/src/frame/value.rs
Outdated
// Compare two values of timeuuid type. | ||
// Cassandra legacy requires: | ||
// - using signed compare for least significant bits. | ||
// - masking off UUID version during compare, to | ||
// treat possible non-version-1 UUID the same way as UUID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be great to have this information in the docstring for CqlTimeuuid
. The custom behaviors of CqlTimeuuid
make it something more than a mere wrapper over Uuid
, so I think it needs to be properly explained in the documentation.
let sorted_timeuuid_vals: Vec<CqlTimeuuid> = vec![ | ||
CqlTimeuuid::from_str("00000000-0000-1000-8080-808080808080").unwrap(), | ||
CqlTimeuuid::from_str("00000000-0000-1000-ffff-ffffffffffff").unwrap(), | ||
CqlTimeuuid::from_str("00000000-0000-1000-0000-000000000000").unwrap(), | ||
CqlTimeuuid::from_str("fed35080-0efb-11ee-a1ca-00006490e9a4").unwrap(), | ||
CqlTimeuuid::from_str("00000257-0efc-11ee-9547-00006490e9a6").unwrap(), | ||
CqlTimeuuid::from_str("ffffffff-ffff-1fff-ffff-ffffffffffef").unwrap(), | ||
CqlTimeuuid::from_str("ffffffff-ffff-1fff-ffff-ffffffffffff").unwrap(), | ||
CqlTimeuuid::from_str("ffffffff-ffff-1fff-0000-000000000000").unwrap(), | ||
CqlTimeuuid::from_str("ffffffff-ffff-1fff-7f7f-7f7f7f7f7f7f").unwrap(), | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the test non-deterministic... If one of the 362880 possible orderings results in an incorrectly sorted sequence then you won't know which one failed. I think we should either use a fixed seed or use a random one but print it during the test so that we can reproduce the results.
I also think that we should test multiple orderings in one run with sort()
- let's say several thousand times, so that this part of the test doesn't run longer than a couple milliseconds. Of course, it's unpractical to insert this data to ScyllaDB for each of the orderings, but I don't think it matters because we can assume that ScyllaDB's ordering works correctly - inserting this data in any order only once and then verifying that it matches our expected order should be enough.
Introduced a `CqlTimeuuid` type which is a wrapper for v1 uuids.
By default, uuids are compared in lexicographical order. However, scylla (and cassandra) use different comparison semantics for timeuuids (v1 uuids). Timeuuids contain an encoded timestamp within 8 most-significant bytes. Scylla compares timeuuids in a following manner: - compare the timestamps retrieved from 8 msb - if timestamps are equal, perform a signed comparison of 8 lsb
Before this commit, users were able to map native timeuuid type to rust `uuid::Uuid` type and vice-versa. After this change, users are only allowed to use `value::CqlTimeuuid` type to work with scylla's native timeuuid type. Changes: - timeuuid test needed to be adjusted, so it maps columns to `CqlTimeuuid` instead of `uuid::Uuid` - TracingEvent struct needs to be adjusted as `event_id` column of `system_traces.events` table is of `timeuuid` type
v3:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some nitpicks, but they aren't blocking the merge - can be fixed later. LGTM.
self.lsb_signed().hash(state); | ||
self.msb().hash(state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the lsb_signed()
and msb()
do some unnecessary work - the former performs a XOR, and the latter shuffles bytes around. It would be sufficient to take the bytes as they are, AND one of the bytes (like msb()
does now) and compare. Same goes for the PartialEq
implementation.
These are micro-optimizations with dubious performance impact, so it's fine to implement them in a followup (or never).
]; | ||
|
||
// Generate all permutations. | ||
let perms = Itertools::permutations(sorted_timeuuid_vals.iter(), sorted_timeuuid_vals.len()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Itertools
is an extention trait for iterators. While what you wrote works, you are supposed to use it on the iterators directly:
let perms = sorted_timeuuid_vals
.iter()
.permutations(sorted_timeuuid_vals.len())
.collect::<Vec<_>>();
// Generate all permutations. | ||
let perms = Itertools::permutations(sorted_timeuuid_vals.iter(), sorted_timeuuid_vals.len()) | ||
.collect::<Vec<_>>(); | ||
// Ensure that all of the permutations were generated. | ||
assert_eq!(362880, perms.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the perms
are going to take a few tens of megabytes of memory: an allocation of 362880 * 3 * 8 = ~8.3MB
of contiguous memory for the direct storage of perms
alone (but may be up to twice as much due to vec exponential reallocation) and 362880 * 9 * 16 = ~50MB
for the timeuuids alone (spread over 362880
individual allocations).
While this is not very much for today's standards, you need to take into account that many test cases can run in parallel, so this memory usage adds up - and it's a good practice in general to be mindful of those allocations.
In this case, it could have been easily and completely avoided by not collecting the permutations
iterator but rather consuming it in a loop.
Again, not a huge problem and not blocking the merge in my opinion. Can be fixed in a followup.
Fixes: #749
Changes
CqlTimeuuid
type as a wrapper for version 1 uuidsuuid::Uuid
delegate methods forCqlTimeuuid
CqlTimeuuid
type so it has the same semantics as Scylla/CassandraCqlValue::Timeuuid
now wrapsCqlTimeuuid
CqlTimeuuid
timeuuid
<->uuid::Uuid
mappingPre-review checklist
./docs/source/
.Fixes:
annotations to PR description.