Introduce `CqlTimeuuid` type #894

muzarski · 2023-12-22T07:13:25Z

Fixes: #749

Changes

Introduced CqlTimeuuid type as a wrapper for version 1 uuids
Implemented uuid::Uuid delegate methods for CqlTimeuuid
Implemented strong ordering on CqlTimeuuid type so it has the same semantics as Scylla/Cassandra
CqlValue::Timeuuid now wraps CqlTimeuuid
Implemented serialization and deserialization logic for CqlTimeuuid
Got rid of timeuuid <-> uuid::Uuid mapping

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

scylla/src/transport/cql_types_test.rs

docs/source/data-types/data-types.md

piodul · 2024-01-08T09:47:25Z

scylla-cql/src/frame/response/result.rs

+            let timeuuid = match CqlTimeuuid::from_slice(buf) {
+                Ok(timeuuid) => timeuuid,
+                Err(e) => {
+                    return Err(ParseError::BadIncomingData(format!(
+                        "Timeuuid parsing failed: {e}"
+                    )))
+                }
+            };
+            CqlValue::Timeuuid(timeuuid)


Please verify whether Scylla (and Cassandra) validates timeuuids like this, i.e. whether it allows inserting uuids that are not necessarily timeuuids into timeuuid column. If they are then perhaps we can do this validation here as well, otherwise I would abstain from it so that users don't run into a situation where they cannot read data they inserted earlier or via another driver.

Perhaps it is safer not to validate here anyway - even if the DBs properly validate right now, just in case if validation breaks in Scylla/Cassandra at some future release.

Please verify whether Scylla (and Cassandra) validates timeuuids like this, i.e. whether it allows inserting uuids that are not necessarily timeuuids into timeuuid column.

A simple test case shows that Scylla validates whether uuid inserted to timeuuid column is a valid timeuuid.

#[tokio::test] async fn test_invalid_timeuuid() { let session: Session = create_new_session_builder().build().await.unwrap(); let ks = unique_keyspace_name(); session .query( format!( "CREATE KEYSPACE IF NOT EXISTS {} WITH REPLICATION = \ {{'class' : 'NetworkTopologyStrategy', 'replication_factor' : 1}}", ks ), &[], ) .await .unwrap(); session.use_keyspace(ks, false).await.unwrap(); session .query( "CREATE TABLE tab (p int, t timeuuid, PRIMARY KEY (p, t))", (), ) .await .unwrap(); // Invalid timeuuid (version 2 uuid) let invalid_timeuuid = uuid::uuid!("00000000-0000-2000-8080-808080808080"); session .query("INSERT INTO tab (p, t) VALUES (0, ?)", (invalid_timeuuid,)) .await .unwrap(); }

The result:

thread 'transport::cql_types_test::test_invalid_timeuuid' panicked at scylla/src/transport/cql_types_test.rs:1731:10: called `Result::unwrap()` on an `Err` value: DbError(Invalid, "Exception while binding column t: marshaling error: Unsupported UUID version (2)")

Perhaps it is safer not to validate here anyway - even if the DBs properly validate right now, just in case if validation breaks in Scylla/Cassandra at some future release.

But I agree, I'll get rid of the validation part in CqlTimeuuid constructors i.e. replace

impl TryFrom<Uuid> for CqlTimeuuid { type Error = InvalidTimeuuidVersion; fn try_from(value: Uuid) -> Result<Self, Self::Error> { let version = value.get_version_num(); if version != 1 { return Err(InvalidTimeuuidVersion(version)); } Ok(Self(value)) } }

with

impl From<Uuid> for CqlTimeuuid { ... }

muzarski · 2024-01-11T12:58:57Z

v2:

rebased on top of master
fixed review comments

scylla-cql/src/frame/response/result.rs

scylla-cql/src/frame/value_tests.rs

Lorak-mmk · 2024-01-16T15:08:46Z

scylla/src/transport/cql_types_test.rs

+    let sorted_timeuuid_vals: Vec<CqlTimeuuid> = vec![
+        CqlTimeuuid::from_str("00000000-0000-1000-8080-808080808080").unwrap(),
+        CqlTimeuuid::from_str("00000000-0000-1000-ffff-ffffffffffff").unwrap(),
+        CqlTimeuuid::from_str("00000000-0000-1000-0000-000000000000").unwrap(),
+        CqlTimeuuid::from_str("fed35080-0efb-11ee-a1ca-00006490e9a4").unwrap(),
+        CqlTimeuuid::from_str("00000257-0efc-11ee-9547-00006490e9a6").unwrap(),
+        CqlTimeuuid::from_str("ffffffff-ffff-1fff-ffff-ffffffffffef").unwrap(),
+        CqlTimeuuid::from_str("ffffffff-ffff-1fff-ffff-ffffffffffff").unwrap(),
+        CqlTimeuuid::from_str("ffffffff-ffff-1fff-0000-000000000000").unwrap(),
+        CqlTimeuuid::from_str("ffffffff-ffff-1fff-7f7f-7f7f7f7f7f7f").unwrap(),
+    ];


Again, if we test sorting, it would be good to have unsorted data. You can use https://docs.rs/rand/latest/rand/seq/trait.SliceRandom.html#tymethod.shuffle to avoid having 2 literals in the code.

This makes the test non-deterministic... If one of the 362880 possible orderings results in an incorrectly sorted sequence then you won't know which one failed. I think we should either use a fixed seed or use a random one but print it during the test so that we can reproduce the results.

I also think that we should test multiple orderings in one run with sort() - let's say several thousand times, so that this part of the test doesn't run longer than a couple milliseconds. Of course, it's unpractical to insert this data to ScyllaDB for each of the orderings, but I don't think it matters because we can assume that ScyllaDB's ordering works correctly - inserting this data in any order only once and then verifying that it matches our expected order should be enough.

We should also use a random generator with known algorithm - std::thread_rng uses an unspecified algorithm which can probably change in the future. We already have tests with fixed pseudorandomness - for example, see partitioners_output_same_result_no_matter_how_input_is_partitioned.

Also, 362880 is not a huge number. Perhaps checking all the permutations would not take much time (let's aim under 100ms)? We can iterate over permutations using Itertools::permutations().

scylla-cql/src/frame/value.rs

piodul · 2024-01-24T21:31:52Z

scylla-cql/src/frame/value.rs

+        // Compare two values of timeuuid type.
+        // Cassandra legacy requires:
+        // - using signed compare for least significant bits.
+        // - masking off UUID version during compare, to
+        //   treat possible non-version-1 UUID the same way as UUID.


I think it would be great to have this information in the docstring for CqlTimeuuid. The custom behaviors of CqlTimeuuid make it something more than a mere wrapper over Uuid, so I think it needs to be properly explained in the documentation.

piodul · 2024-01-24T22:01:48Z

scylla/src/transport/cql_types_test.rs

+    let sorted_timeuuid_vals: Vec<CqlTimeuuid> = vec![
+        CqlTimeuuid::from_str("00000000-0000-1000-8080-808080808080").unwrap(),
+        CqlTimeuuid::from_str("00000000-0000-1000-ffff-ffffffffffff").unwrap(),
+        CqlTimeuuid::from_str("00000000-0000-1000-0000-000000000000").unwrap(),
+        CqlTimeuuid::from_str("fed35080-0efb-11ee-a1ca-00006490e9a4").unwrap(),
+        CqlTimeuuid::from_str("00000257-0efc-11ee-9547-00006490e9a6").unwrap(),
+        CqlTimeuuid::from_str("ffffffff-ffff-1fff-ffff-ffffffffffef").unwrap(),
+        CqlTimeuuid::from_str("ffffffff-ffff-1fff-ffff-ffffffffffff").unwrap(),
+        CqlTimeuuid::from_str("ffffffff-ffff-1fff-0000-000000000000").unwrap(),
+        CqlTimeuuid::from_str("ffffffff-ffff-1fff-7f7f-7f7f7f7f7f7f").unwrap(),
+    ];


This makes the test non-deterministic... If one of the 362880 possible orderings results in an incorrectly sorted sequence then you won't know which one failed. I think we should either use a fixed seed or use a random one but print it during the test so that we can reproduce the results.

I also think that we should test multiple orderings in one run with sort() - let's say several thousand times, so that this part of the test doesn't run longer than a couple milliseconds. Of course, it's unpractical to insert this data to ScyllaDB for each of the orderings, but I don't think it matters because we can assume that ScyllaDB's ordering works correctly - inserting this data in any order only once and then verifying that it matches our expected order should be enough.

Introduced a `CqlTimeuuid` type which is a wrapper for v1 uuids.

By default, uuids are compared in lexicographical order. However, scylla (and cassandra) use different comparison semantics for timeuuids (v1 uuids). Timeuuids contain an encoded timestamp within 8 most-significant bytes. Scylla compares timeuuids in a following manner: - compare the timestamps retrieved from 8 msb - if timestamps are equal, perform a signed comparison of 8 lsb

Before this commit, users were able to map native timeuuid type to rust `uuid::Uuid` type and vice-versa. After this change, users are only allowed to use `value::CqlTimeuuid` type to work with scylla's native timeuuid type. Changes: - timeuuid test needed to be adjusted, so it maps columns to `CqlTimeuuid` instead of `uuid::Uuid` - TracingEvent struct needs to be adjusted as `event_id` column of `system_traces.events` table is of `timeuuid` type

muzarski · 2024-01-25T14:23:05Z

v3:

rebased on top of master
addressed review comments

piodul

Left some nitpicks, but they aren't blocking the merge - can be fixed later. LGTM.

piodul · 2024-01-25T16:50:09Z

scylla-cql/src/frame/value.rs

+        self.lsb_signed().hash(state);
+        self.msb().hash(state);


Nit: the lsb_signed() and msb() do some unnecessary work - the former performs a XOR, and the latter shuffles bytes around. It would be sufficient to take the bytes as they are, AND one of the bytes (like msb() does now) and compare. Same goes for the PartialEq implementation.

These are micro-optimizations with dubious performance impact, so it's fine to implement them in a followup (or never).

piodul · 2024-01-26T07:33:22Z

scylla/src/transport/cql_types_test.rs

+    ];
+
+    // Generate all permutations.
+    let perms = Itertools::permutations(sorted_timeuuid_vals.iter(), sorted_timeuuid_vals.len())


Nit: Itertools is an extention trait for iterators. While what you wrote works, you are supposed to use it on the iterators directly:

let perms = sorted_timeuuid_vals .iter() .permutations(sorted_timeuuid_vals.len()) .collect::<Vec<_>>();

piodul · 2024-01-26T07:43:27Z

scylla/src/transport/cql_types_test.rs

+    // Generate all permutations.
+    let perms = Itertools::permutations(sorted_timeuuid_vals.iter(), sorted_timeuuid_vals.len())
+        .collect::<Vec<_>>();
+    // Ensure that all of the permutations were generated.
+    assert_eq!(362880, perms.len());


Nit: the perms are going to take a few tens of megabytes of memory: an allocation of 362880 * 3 * 8 = ~8.3MB of contiguous memory for the direct storage of perms alone (but may be up to twice as much due to vec exponential reallocation) and 362880 * 9 * 16 = ~50MB for the timeuuids alone (spread over 362880 individual allocations).

While this is not very much for today's standards, you need to take into account that many test cases can run in parallel, so this memory usage adds up - and it's a good practice in general to be mindful of those allocations.

In this case, it could have been easily and completely avoided by not collecting the permutations iterator but rather consuming it in a loop.

Again, not a huge problem and not blocking the merge in my opinion. Can be fixed in a followup.

piodul requested changes Jan 8, 2024

View reviewed changes

muzarski force-pushed the timeuuid branch 3 times, most recently from b930435 to 5af99f6 Compare January 10, 2024 15:39

muzarski requested a review from piodul January 10, 2024 15:49

avelanarius assigned muzarski Jan 15, 2024

avelanarius added this to the 0.12.0 milestone Jan 15, 2024

Lorak-mmk requested changes Jan 16, 2024

View reviewed changes

muzarski force-pushed the timeuuid branch from 5af99f6 to 4f79f5e Compare January 19, 2024 05:58

muzarski requested a review from Lorak-mmk January 19, 2024 05:59

Lorak-mmk added the API-breaking This might introduce incompatible API changes label Jan 22, 2024

Lorak-mmk approved these changes Jan 22, 2024

View reviewed changes

piodul requested changes Jan 24, 2024

View reviewed changes

muzarski added 10 commits January 25, 2024 14:32

scylla-cql: introduce CqlTimeuuid type

e0c35c1

Introduced a `CqlTimeuuid` type which is a wrapper for v1 uuids.

timeuuid: implement Hash for CqlTimeuuid

8d81a6b

tests: add test for timeuuid ordering properties

abc8d28

timeuuid: add missing delegate methods

9e2ac1f

timeuuid: implement fmt::Display for CqlTimeuuid

6b870a3

cql: wrap CqlTimeuuid in CqlValue::Timeuuid

d11e8c3

tests: add timeuuid runtime ordering test

0db6583

docs: update uuid/timeuuid info

1958216

muzarski force-pushed the timeuuid branch from 4f79f5e to 1958216 Compare January 25, 2024 14:21

muzarski requested a review from piodul January 25, 2024 14:26

piodul approved these changes Jan 26, 2024

View reviewed changes

piodul merged commit d01b473 into scylladb:main Jan 26, 2024
9 checks passed

piodul mentioned this pull request Jan 26, 2024

Hide public usages of num_bigint::BigInt behind a feature flag #902

Merged

8 tasks

Lorak-mmk mentioned this pull request Feb 1, 2024

Release 0.12 #924

Closed

muzarski deleted the timeuuid branch October 29, 2024 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce `CqlTimeuuid` type #894

Introduce `CqlTimeuuid` type #894

muzarski commented Dec 22, 2023

piodul Jan 8, 2024

muzarski Jan 10, 2024

muzarski commented Jan 11, 2024

Lorak-mmk Jan 16, 2024

piodul Jan 24, 2024

piodul Jan 24, 2024 •

edited

Loading

piodul Jan 24, 2024

piodul Jan 24, 2024

piodul Jan 24, 2024

muzarski commented Jan 25, 2024

piodul left a comment

piodul Jan 25, 2024

piodul Jan 26, 2024

piodul Jan 26, 2024

Introduce CqlTimeuuid type #894

Introduce CqlTimeuuid type #894

Conversation

muzarski commented Dec 22, 2023

Changes

Pre-review checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muzarski commented Jan 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piodul Jan 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muzarski commented Jan 25, 2024

piodul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Introduce `CqlTimeuuid` type #894

Introduce `CqlTimeuuid` type #894

piodul Jan 24, 2024 •

edited

Loading