re_datastore: component chunks & streamlining batches #584

teh-cmc · 2022-12-16T16:59:52Z

Requires #590.

This PR introduces a further layer of subdivision in ComponentBuckets: chunks.

Chunks hold raw component data exactly as it was inserted, allowing for zero-copy/zero-alloc insertion.
This makes the write-path actually viable once again, see benchmarks below.

While the different chunks live in non-contiguous memory, this actually matters very little in terms of performance for the read path: access across many rows is more often than not random anyway.
What matters is that memory within a single row is always contiguous, and that doesn't change.
More context in this slack thread.

Still, there's an obvious opportunity for us to compact chunks and defragment memory: when buckets get retired, i.e. when they get full and we "archive them" in read-only mode.
This PR does exactly this, which means that only active buckets can have more than a single chunk.

Finally, this PR uses the opportunity to streamline the "batch insertions" situation: currently, inserting MsgBundles that contain more than one row per component is effectively undefined behaviour as you'd get a single RowIndex for multiple rows.
This PR now inserts each row the batch one at a time, always creating chunks of length 1.

This effectively prepares the field for real batched insertions (which will first require MsgBundle to carry more than one timepoint) and for the upcoming clustering keys.

Bench

datastore/batch/rects/insert
                        time:   [151.51 µs 151.69 µs 151.95 µs]
                        thrpt:  [65.813 Melem/s 65.923 Melem/s 66.002 Melem/s]
                 change:
                        time:   [-83.891% -83.791% -83.689%] (p = 0.00 < 0.05)
                        thrpt:  [+513.07% +516.93% +520.78%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) high mild
  10 (10.00%) high severe

datastore/batch/rects/query
                        time:   [439.35 ns 440.51 ns 441.87 ns]
                        thrpt:  [226.31 Melem/s 227.01 Melem/s 227.61 Melem/s]
                 change:
                        time:   [+2.3095% +2.6594% +2.9437%] (p = 0.00 < 0.05)
                        thrpt:  [-2.8595% -2.5905% -2.2573%]
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

crates/re_arrow_store/src/store_read.rs

crates/re_arrow_store/src/store_write.rs

…low!

teh-cmc · 2022-12-18T15:38:36Z

Just doing some insane git acrobatics so that github generates the right diffs.
I'm sure it's gonna be fine, how could it go wrong?

emilk

I think I need a walk-through to be able to properly review this

emilk · 2022-12-18T17:37:21Z

crates/re_arrow_store/src/store.rs

+    /// Has this bucket been retired yet?
+    ///
+    /// At any given moment, all buckets except the currently active one have to be retired.
+    pub(crate) retired: bool,


This needs a description of retired actually means.

From the PR description it sounds like it means that it is full, or that is full and has also been compacted.

If so, perhaps full, compacted or !active is a better name? "Retired" makes me think it is no longer in use. Or is this common DB vernacular?

"Retired" in this case means that it's been archived and is now read-only, so it'll only be used for the read path from now on.

Being retired implies a bunch of things, currently:

the bucket is now read-only,

the bucket has been compacted (its chunks have been concatenated),

the bucket is inactive as far as the write path is concerned,

and probably more in the future.

I definitely need to improve the doc there. As for the name... maybe archived then?

Added something along the lines in the doc-comment 🤞

emilk · 2022-12-18T17:41:11Z

crates/re_arrow_store/src/store.rs

-    pub(crate) data: Box<dyn Array>,
+    /// All the data for this bucket: many rows of a single column.
+    ///
+    /// Each chunk is a list of list of components, i.e. `ListArray<ListArray<StructArray>>`:


Suggested change

/// Each chunk is a list of list of components, i.e. `ListArray<ListArray<StructArray>>`:

/// Each chunk is a list of list of components, i.e. `Vec<ListArray<StructArray>>`:

would match the actual type a bit closer, confusing me slightly less :)

The type of Self::chunks is Vec<ListArray<ListArray<StructArray>>>, but the type of each chunk is indeed exactly ListArray<ListArray<StructArray>>.

I agree with Emil here: a chunk is (and should be) still just be a single ListArray of components. I don't think we should ever have a ListArray<ListArray>. A ListArray already has the double-list property you're describing:

| Component | +------------+ | [ a, b, c] | | [ d, e] |

The array index corresponds to the different rows.
The list index within a row corresponds to the different instances within a single row.

Oh yeah no re-reading this now I can see you're definitely both right; got my head all messed up earlier.

Will fix all these docs first thing tomorrow thanks 👍

crates/re_arrow_store/src/store_read.rs

emilk · 2022-12-18T17:45:19Z

crates/re_arrow_store/src/store_write.rs

+                // So use the fact that `rows` is always of unit-length for now.
+                let rows_single = rows;


again, it's good to debug_assert that assumption

it's asserted when entering this method:

ensure!( nb_rows == 1, "we currently don't support more than one row per batch, as a `MsgBundle` can only \ carry a single timepoint for now!" ); ensure!( components .iter() .all(|bundle| bundle.value.len() == nb_rows), "all components across the bundle must share the same number of rows", );

…_of_copies

jleibs · 2022-12-18T23:50:05Z

crates/re_arrow_store/src/store_write.rs

+                        let row = rows_single
+                            .as_any()
+                            .downcast_ref::<ListArray<i32>>()
+                            .unwrap()
+                            .value(row_nr);
+                        ComponentTable::new((*name).clone(), row.data_type())


I think you can do this using get_child_type along the lines of:

Suggested change

let row = rows_single

.as_any()

.downcast_ref::<ListArray<i32>>()

.unwrap()

.value(row_nr);

ComponentTable::new((*name).clone(), row.data_type())

let child_type = ListArray<i32>::get_child_type(rows_single.data_type());

ComponentTable::new((*name).clone(), child_type);

jleibs

Aside from a bit of doc comment cleanup this looks great to me!

* get is supposed to return a row, not a [row] * unwrap note * the bench too * self review * doc test also * and re_query ofc! * slicing is _very_ slow, don't do it if you don't have to * no more col_arrays in re_query * there's actually no need for concatenating at all * incrementally compute and cache bucket sizes * cleaning up and documenting existing limitations * introducing bucket retirement * issue ref * some more doc stuff * self-review * polars/fmt should always be there for tests * streamlining batch support * take list header into account * it's fine * self-review * just something i want to keep around for later * (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow! * merge cmc/datastore/get_a_single_row (#590) * no more col_arrays in re_query * introducing the notion of clustering key, thankfully breaking all tests by design * making good use of that shiny new Instance component * merge cmc/datastore/get_rid_of_copies (#584) * missed one * introducing arrow_util with is_dense_array() * finding the clustering comp of the row... or creating it! * rebasin' * post rebase clean up * addressing PR comments, I hope * ensure that clustering components are properly sorted, failing the existing test suite * build_instances now generate sorted ids, thus greenlighting the test suite * missed a couple * addressed PR comments * going for the ArrayExt route * completing the quadrifecta of checks * the unavoidable typed error revolution is on its way, and it's early * where we're going we don't need polars * update everything for the new APIs * error for unsupported clustering key types * clean up and actually testing our error paths * move those nasty internal tests into their own dirty corner * finally some high-level tests in here * i happen to like where this is going * shuffling things * demonstrating that implicit instances are somehow broken * fully working implicit clustering keys, but demonstrating a sorting issue somewhere * there is still something weird going on tho * latest_at behaving as one would expect * automatically cache generated cluster instances * time to clean up en masse * still want to put some stress on the bucketing * make ArrayExt::is_dense a little more friendly, just in case... * re_query: use polars/fmt in tests * re_query: remove implicit instances * fixing the u32 vs u64 instance drama * cluster-aware polars helpers :> * cleanin up tests * continuing cleanup and doc * updating visuals for this brave new world * docs * self-review * bruh * bruh... * ... * outdated comment * no reason to search for it multiple times * polars_helpers => polars_util for consistency's sake * addressing PR comments and a couple other things * xxx * post-merge fixes * more fixes

* get is supposed to return a row, not a [row] * unwrap note * the bench too * self review * doc test also * and re_query ofc! * slicing is _very_ slow, don't do it if you don't have to * no more col_arrays in re_query * there's actually no need for concatenating at all * incrementally compute and cache bucket sizes * cleaning up and documenting existing limitations * introducing bucket retirement * issue ref * some more doc stuff * self-review * polars/fmt should always be there for tests * streamlining batch support * take list header into account * it's fine * self-review * just something i want to keep around for later * (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow! * merge cmc/datastore/get_a_single_row (#590) * no more col_arrays in re_query * introducing the notion of clustering key, thankfully breaking all tests by design * making good use of that shiny new Instance component * merge cmc/datastore/get_rid_of_copies (#584) * missed one * introducing arrow_util with is_dense_array() * finding the clustering comp of the row... or creating it! * rebasin' * post rebase clean up * addressing PR comments, I hope * ensure that clustering components are properly sorted, failing the existing test suite * build_instances now generate sorted ids, thus greenlighting the test suite * missed a couple * addressed PR comments * going for the ArrayExt route * completing the quadrifecta of checks * the unavoidable typed error revolution is on its way, and it's early * where we're going we don't need polars * update everything for the new APIs * error for unsupported clustering key types * clean up and actually testing our error paths * move those nasty internal tests into their own dirty corner * finally some high-level tests in here * i happen to like where this is going * shuffling things * demonstrating that implicit instances are somehow broken * fully working implicit clustering keys, but demonstrating a sorting issue somewhere * there is still something weird going on tho * latest_at behaving as one would expect * automatically cache generated cluster instances * time to clean up en masse * still want to put some stress on the bucketing * make ArrayExt::is_dense a little more friendly, just in case... * TimeType::format_range * independent latest_at query and using appropriate types everywhere * re_query: use polars/fmt in tests * re_query: remove implicit instances * fixing the u32 vs u64 instance drama * really starting to like how this looks * cluster-aware polars helpers :> * cleanin up tests * continuing cleanup and doc * updating visuals for this brave new world * docs * self-review * bruh * bruh... * ... * outdated comment * no reason to search for it multiple times * polars_helpers => polars_util for consistency's sake * addressing PR comments and a couple other things * xxx * post-merge fixes * TimeInt should be nohash * high-level polar range tools + making first half of range impl pass * implement the streaming half * finally defeated all demons * still passes? * it looks like we've made it out alive * polars util: join however you wish * fixed formatting * point2d's PoVs working as expected * passing full ranges * docs and such part 1, the semantics are hell * fixing the filtering mess in tests * me stoopid * polars docs * addressing the clones * xxx * missed a gazillon conflict somehow * polars util spring cleaning * do indicate and demonstrate that range_components is _not_ a real streaming join * fixed some comments * bruh * screw it, going for the real deal: full streaming joins * YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE * yep yep i like this * I hereby declare myself _satisfied_ * initiating the great cleanup * add notes for upcoming terminology pr * bringing IndexRowNr into the mix and slowly starting to fix terminology mess * improving range_components ergonomics * putting it all in self-reviewable state * self-review * add bench * xxx * addressing PR comments * demonstrating nasty edge-case with streaming-joins * update streaming-join merging rules to fix said edge case * implement PoV-less, always-yield lower-level API + adapt higher-level one * addressing PR comments * self and not-so-self reviews

* get is supposed to return a row, not a [row] * unwrap note * the bench too * self review * doc test also * and re_query ofc! * slicing is _very_ slow, don't do it if you don't have to * no more col_arrays in re_query * there's actually no need for concatenating at all * incrementally compute and cache bucket sizes * cleaning up and documenting existing limitations * introducing bucket retirement * issue ref * some more doc stuff * self-review * polars/fmt should always be there for tests * streamlining batch support * take list header into account * it's fine * self-review * just something i want to keep around for later * (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow! * merge cmc/datastore/get_a_single_row (#590) * no more col_arrays in re_query * introducing the notion of clustering key, thankfully breaking all tests by design * making good use of that shiny new Instance component * merge cmc/datastore/get_rid_of_copies (#584) * missed one * introducing arrow_util with is_dense_array() * finding the clustering comp of the row... or creating it! * rebasin' * post rebase clean up * addressing PR comments, I hope * ensure that clustering components are properly sorted, failing the existing test suite * build_instances now generate sorted ids, thus greenlighting the test suite * missed a couple * addressed PR comments * going for the ArrayExt route * completing the quadrifecta of checks * the unavoidable typed error revolution is on its way, and it's early * where we're going we don't need polars * update everything for the new APIs * error for unsupported clustering key types * clean up and actually testing our error paths * move those nasty internal tests into their own dirty corner * finally some high-level tests in here * i happen to like where this is going * shuffling things * demonstrating that implicit instances are somehow broken * fully working implicit clustering keys, but demonstrating a sorting issue somewhere * there is still something weird going on tho * latest_at behaving as one would expect * automatically cache generated cluster instances * time to clean up en masse * still want to put some stress on the bucketing * make ArrayExt::is_dense a little more friendly, just in case... * TimeType::format_range * independent latest_at query and using appropriate types everywhere * re_query: use polars/fmt in tests * re_query: remove implicit instances * fixing the u32 vs u64 instance drama * really starting to like how this looks * cluster-aware polars helpers :> * cleanin up tests * continuing cleanup and doc * updating visuals for this brave new world * docs * self-review * bruh * bruh... * ... * outdated comment * no reason to search for it multiple times * polars_helpers => polars_util for consistency's sake * addressing PR comments and a couple other things * xxx * post-merge fixes * TimeInt should be nohash * high-level polar range tools + making first half of range impl pass * implement the streaming half * finally defeated all demons * still passes? * it looks like we've made it out alive * polars util: join however you wish * fixed formatting * point2d's PoVs working as expected * passing full ranges * docs and such part 1, the semantics are hell * fixing the filtering mess in tests * me stoopid * polars docs * addressing the clones * xxx * missed a gazillon conflict somehow * polars util spring cleaning * do indicate and demonstrate that range_components is _not_ a real streaming join * fixed some comments * bruh * screw it, going for the real deal: full streaming joins * YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE * yep yep i like this * I hereby declare myself _satisfied_ * initiating the great cleanup * add notes for upcoming terminology pr * bringing IndexRowNr into the mix and slowly starting to fix terminology mess * improving range_components ergonomics * putting it all in self-reviewable state * self-review * add bench * xxx * having some casual fun with dataframes :> * now with components! * just some experiments im not too found of... keeping around just in case * Revert "just some experiments im not too found of... keeping around just in case" This reverts commit 15f6487. * playing around with insert_id-as-data... which turns out to be quite helpful * going into store_insert_ids for real * full impl * add example * self-review * reviewable * addressing PR comments * is that readable * Revert "is that readable" This reverts commit f802ff5. * standalone examples for all dataframe APIs * burn all todos * update doc: you can do that now! * demonstrating nasty edge-case with streaming-joins * update streaming-join merging rules to fix said edge case * addressed PR comments

* get is supposed to return a row, not a [row] * unwrap note * the bench too * self review * doc test also * and re_query ofc! * slicing is _very_ slow, don't do it if you don't have to * no more col_arrays in re_query * there's actually no need for concatenating at all * incrementally compute and cache bucket sizes * cleaning up and documenting existing limitations * introducing bucket retirement * issue ref * some more doc stuff * self-review * polars/fmt should always be there for tests * streamlining batch support * take list header into account * it's fine * self-review * just something i want to keep around for later * (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow! * merge cmc/datastore/get_a_single_row (#590) * no more col_arrays in re_query * introducing the notion of clustering key, thankfully breaking all tests by design * making good use of that shiny new Instance component * merge cmc/datastore/get_rid_of_copies (#584) * missed one * introducing arrow_util with is_dense_array() * finding the clustering comp of the row... or creating it! * rebasin' * post rebase clean up * addressing PR comments, I hope * ensure that clustering components are properly sorted, failing the existing test suite * build_instances now generate sorted ids, thus greenlighting the test suite * missed a couple * addressed PR comments * going for the ArrayExt route * completing the quadrifecta of checks * the unavoidable typed error revolution is on its way, and it's early * where we're going we don't need polars * update everything for the new APIs * error for unsupported clustering key types * clean up and actually testing our error paths * move those nasty internal tests into their own dirty corner * finally some high-level tests in here * i happen to like where this is going * shuffling things * demonstrating that implicit instances are somehow broken * fully working implicit clustering keys, but demonstrating a sorting issue somewhere * there is still something weird going on tho * latest_at behaving as one would expect * automatically cache generated cluster instances * time to clean up en masse * still want to put some stress on the bucketing * make ArrayExt::is_dense a little more friendly, just in case... * TimeType::format_range * independent latest_at query and using appropriate types everywhere * re_query: use polars/fmt in tests * re_query: remove implicit instances * fixing the u32 vs u64 instance drama * really starting to like how this looks * cluster-aware polars helpers :> * cleanin up tests * continuing cleanup and doc * updating visuals for this brave new world * docs * self-review * bruh * bruh... * ... * outdated comment * no reason to search for it multiple times * polars_helpers => polars_util for consistency's sake * addressing PR comments and a couple other things * xxx * post-merge fixes * TimeInt should be nohash * high-level polar range tools + making first half of range impl pass * implement the streaming half * finally defeated all demons * still passes? * it looks like we've made it out alive * polars util: join however you wish * fixed formatting * point2d's PoVs working as expected * passing full ranges * docs and such part 1, the semantics are hell * fixing the filtering mess in tests * me stoopid * polars docs * addressing the clones * xxx * missed a gazillon conflict somehow * polars util spring cleaning * do indicate and demonstrate that range_components is _not_ a real streaming join * fixed some comments * bruh * screw it, going for the real deal: full streaming joins * YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE * yep yep i like this * I hereby declare myself _satisfied_ * initiating the great cleanup * add notes for upcoming terminology pr * bringing IndexRowNr into the mix and slowly starting to fix terminology mess * improving range_components ergonomics * putting it all in self-reviewable state * self-review * add bench * xxx * addressing PR comments * sanity checking cluster components * demonstrating nasty edge-case with streaming-joins * update streaming-join merging rules to fix said edge case * implement PoV-less, always-yield lower-level API + adapt higher-level one * addressing PR comments * self and not-so-self reviews * always-on cluster key, it's fiiiine

* get is supposed to return a row, not a [row] * unwrap note * the bench too * self review * doc test also * and re_query ofc! * slicing is _very_ slow, don't do it if you don't have to * no more col_arrays in re_query * there's actually no need for concatenating at all * incrementally compute and cache bucket sizes * cleaning up and documenting existing limitations * introducing bucket retirement * issue ref * some more doc stuff * self-review * polars/fmt should always be there for tests * streamlining batch support * take list header into account * it's fine * self-review * just something i want to keep around for later * (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow! * merge cmc/datastore/get_a_single_row (#590) * no more col_arrays in re_query * introducing the notion of clustering key, thankfully breaking all tests by design * making good use of that shiny new Instance component * merge cmc/datastore/get_rid_of_copies (#584) * missed one * introducing arrow_util with is_dense_array() * finding the clustering comp of the row... or creating it! * rebasin' * post rebase clean up * addressing PR comments, I hope * ensure that clustering components are properly sorted, failing the existing test suite * build_instances now generate sorted ids, thus greenlighting the test suite * missed a couple * addressed PR comments * going for the ArrayExt route * completing the quadrifecta of checks * the unavoidable typed error revolution is on its way, and it's early * where we're going we don't need polars * update everything for the new APIs * error for unsupported clustering key types * clean up and actually testing our error paths * move those nasty internal tests into their own dirty corner * finally some high-level tests in here * i happen to like where this is going * shuffling things * demonstrating that implicit instances are somehow broken * fully working implicit clustering keys, but demonstrating a sorting issue somewhere * there is still something weird going on tho * latest_at behaving as one would expect * automatically cache generated cluster instances * time to clean up en masse * still want to put some stress on the bucketing * make ArrayExt::is_dense a little more friendly, just in case... * TimeType::format_range * independent latest_at query and using appropriate types everywhere * re_query: use polars/fmt in tests * re_query: remove implicit instances * fixing the u32 vs u64 instance drama * really starting to like how this looks * cluster-aware polars helpers :> * cleanin up tests * continuing cleanup and doc * updating visuals for this brave new world * docs * self-review * bruh * bruh... * ... * outdated comment * no reason to search for it multiple times * polars_helpers => polars_util for consistency's sake * addressing PR comments and a couple other things * xxx * post-merge fixes * TimeInt should be nohash * high-level polar range tools + making first half of range impl pass * implement the streaming half * finally defeated all demons * still passes? * it looks like we've made it out alive * polars util: join however you wish * fixed formatting * point2d's PoVs working as expected * passing full ranges * docs and such part 1, the semantics are hell * fixing the filtering mess in tests * me stoopid * polars docs * addressing the clones * xxx * missed a gazillon conflict somehow * polars util spring cleaning * do indicate and demonstrate that range_components is _not_ a real streaming join * fixed some comments * bruh * screw it, going for the real deal: full streaming joins * YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE * yep yep i like this * I hereby declare myself _satisfied_ * initiating the great cleanup * add notes for upcoming terminology pr * bringing IndexRowNr into the mix and slowly starting to fix terminology mess * improving range_components ergonomics * putting it all in self-reviewable state * self-review * add bench * xxx * addressing PR comments * first impl * ported simple_query() to simple_range * doc and such * added e2e example for range queries * self-review * support for new EntityView * demonstrating nasty edge-case with streaming-joins * update streaming-join merging rules to fix said edge case * mimicking range_components' new merging rules * implement PoV-less, always-yield lower-level API + adapt higher-level one * addressing PR comments * ported to new low-level APIs * xxx * addressed PR comments * self and not-so-self reviews * the future is quite literally here

* get is supposed to return a row, not a [row] * unwrap note * the bench too * self review * doc test also * and re_query ofc! * slicing is _very_ slow, don't do it if you don't have to * no more col_arrays in re_query * there's actually no need for concatenating at all * incrementally compute and cache bucket sizes * cleaning up and documenting existing limitations * introducing bucket retirement * issue ref * some more doc stuff * self-review * polars/fmt should always be there for tests * streamlining batch support * take list header into account * it's fine * self-review * just something i want to keep around for later * (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow! * merge cmc/datastore/get_a_single_row (#590) * no more col_arrays in re_query * introducing the notion of clustering key, thankfully breaking all tests by design * making good use of that shiny new Instance component * merge cmc/datastore/get_rid_of_copies (#584) * missed one * introducing arrow_util with is_dense_array() * finding the clustering comp of the row... or creating it! * rebasin' * post rebase clean up * addressing PR comments, I hope * ensure that clustering components are properly sorted, failing the existing test suite * build_instances now generate sorted ids, thus greenlighting the test suite * missed a couple * addressed PR comments * going for the ArrayExt route * completing the quadrifecta of checks * the unavoidable typed error revolution is on its way, and it's early * where we're going we don't need polars * update everything for the new APIs * error for unsupported clustering key types * clean up and actually testing our error paths * move those nasty internal tests into their own dirty corner * finally some high-level tests in here * i happen to like where this is going * shuffling things * demonstrating that implicit instances are somehow broken * fully working implicit clustering keys, but demonstrating a sorting issue somewhere * there is still something weird going on tho * latest_at behaving as one would expect * automatically cache generated cluster instances * time to clean up en masse * still want to put some stress on the bucketing * make ArrayExt::is_dense a little more friendly, just in case... * TimeType::format_range * independent latest_at query and using appropriate types everywhere * re_query: use polars/fmt in tests * re_query: remove implicit instances * fixing the u32 vs u64 instance drama * really starting to like how this looks * cluster-aware polars helpers :> * cleanin up tests * continuing cleanup and doc * updating visuals for this brave new world * docs * self-review * bruh * bruh... * ... * outdated comment * no reason to search for it multiple times * polars_helpers => polars_util for consistency's sake * addressing PR comments and a couple other things * xxx * post-merge fixes * TimeInt should be nohash * high-level polar range tools + making first half of range impl pass * implement the streaming half * finally defeated all demons * still passes? * it looks like we've made it out alive * polars util: join however you wish * fixed formatting * point2d's PoVs working as expected * passing full ranges * docs and such part 1, the semantics are hell * fixing the filtering mess in tests * me stoopid * polars docs * addressing the clones * xxx * missed a gazillon conflict somehow * polars util spring cleaning * do indicate and demonstrate that range_components is _not_ a real streaming join * fixed some comments * bruh * screw it, going for the real deal: full streaming joins * YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE * yep yep i like this * I hereby declare myself _satisfied_ * initiating the great cleanup * add notes for upcoming terminology pr * bringing IndexRowNr into the mix and slowly starting to fix terminology mess * improving range_components ergonomics * putting it all in self-reviewable state * self-review * add bench * xxx * addressing PR comments * first impl * ported simple_query() to simple_range * doc and such * added e2e example for range queries * self-review * support for new EntityView * demonstrating nasty edge-case with streaming-joins * update streaming-join merging rules to fix said edge case * mimicking range_components' new merging rules * Demonstrating how insanely slow the obvious solution is datastore/insert/batch/rects/insert time: [387.54 µs 387.98 µs 388.52 µs] thrpt: [25.739 Melem/s 25.775 Melem/s 25.804 Melem/s] change: time: [+227.27% +227.92% +228.56%] (p = 0.00 < 0.05) thrpt: [-69.564% -69.505% -69.444%] Performance has regressed. * it'd be a tiny bit better with some kind of splats... datastore/insert/batch/rects/insert time: [284.35 µs 284.55 µs 284.86 µs] thrpt: [35.105 Melem/s 35.144 Melem/s 35.167 Melem/s] change: time: [+137.45% +138.08% +138.52%] (p = 0.00 < 0.05) thrpt: [-58.075% -57.997% -57.885%] Performance has regressed. * and now with MsgId being a full fledged component datastore/insert/batch/rects/insert time: [180.84 µs 184.42 µs 188.96 µs] thrpt: [52.920 Melem/s 54.225 Melem/s 55.296 Melem/s] change: time: [+1.0072% +2.1236% +3.3206%] (p = 0.00 < 0.05) thrpt: [-3.2139% -2.0795% -0.9972%] * stuff * bruh * implement PoV-less, always-yield lower-level API + adapt higher-level one * addressing PR comments * ported to new low-level APIs * xxx * addressed PR comments * self and not-so-self reviews * the future is quite literally here * some comments at the very least * msgid standing on its own * bruh

* get is supposed to return a row, not a [row] * unwrap note * the bench too * self review * doc test also * and re_query ofc! * slicing is _very_ slow, don't do it if you don't have to * no more col_arrays in re_query * there's actually no need for concatenating at all * incrementally compute and cache bucket sizes * cleaning up and documenting existing limitations * introducing bucket retirement * issue ref * some more doc stuff * self-review * polars/fmt should always be there for tests * streamlining batch support * take list header into account * it's fine * self-review * just something i want to keep around for later * (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow! * merge cmc/datastore/get_a_single_row (#590) * no more col_arrays in re_query * introducing the notion of clustering key, thankfully breaking all tests by design * making good use of that shiny new Instance component * merge cmc/datastore/get_rid_of_copies (#584) * missed one * introducing arrow_util with is_dense_array() * finding the clustering comp of the row... or creating it! * rebasin' * post rebase clean up * addressing PR comments, I hope * ensure that clustering components are properly sorted, failing the existing test suite * build_instances now generate sorted ids, thus greenlighting the test suite * missed a couple * addressed PR comments * going for the ArrayExt route * completing the quadrifecta of checks * the unavoidable typed error revolution is on its way, and it's early * where we're going we don't need polars * update everything for the new APIs * error for unsupported clustering key types * clean up and actually testing our error paths * move those nasty internal tests into their own dirty corner * finally some high-level tests in here * i happen to like where this is going * shuffling things * demonstrating that implicit instances are somehow broken * fully working implicit clustering keys, but demonstrating a sorting issue somewhere * there is still something weird going on tho * latest_at behaving as one would expect * automatically cache generated cluster instances * time to clean up en masse * still want to put some stress on the bucketing * make ArrayExt::is_dense a little more friendly, just in case... * TimeType::format_range * independent latest_at query and using appropriate types everywhere * re_query: use polars/fmt in tests * re_query: remove implicit instances * fixing the u32 vs u64 instance drama * really starting to like how this looks * cluster-aware polars helpers :> * cleanin up tests * continuing cleanup and doc * updating visuals for this brave new world * docs * self-review * bruh * bruh... * ... * outdated comment * no reason to search for it multiple times * polars_helpers => polars_util for consistency's sake * addressing PR comments and a couple other things * xxx * post-merge fixes * TimeInt should be nohash * high-level polar range tools + making first half of range impl pass * implement the streaming half * finally defeated all demons * still passes? * it looks like we've made it out alive * polars util: join however you wish * fixed formatting * point2d's PoVs working as expected * passing full ranges * docs and such part 1, the semantics are hell * fixing the filtering mess in tests * me stoopid * polars docs * addressing the clones * xxx * missed a gazillon conflict somehow * polars util spring cleaning * do indicate and demonstrate that range_components is _not_ a real streaming join * fixed some comments * bruh * screw it, going for the real deal: full streaming joins * YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE * yep yep i like this * I hereby declare myself _satisfied_ * initiating the great cleanup * add notes for upcoming terminology pr * bringing IndexRowNr into the mix and slowly starting to fix terminology mess * improving range_components ergonomics * putting it all in self-reviewable state * self-review * add bench * xxx * addressing PR comments * first impl * ported simple_query() to simple_range * doc and such * added e2e example for range queries * self-review * support for new EntityView * demonstrating nasty edge-case with streaming-joins * update streaming-join merging rules to fix said edge case * mimicking range_components' new merging rules * implement timepoints-as-components and insert them automatically * derive proper ObjectType TextEntry component * introducing the TextEntry component * re_viewer now supports both legacy and arrow data sources * added pure rust example for text entries * plugging everything on the python side * self-review * bruh * it's decent, but we won't ever be able to deduplicate that way :( * Revert "it's decent, but we won't ever be able to deduplicate that way :(" This reverts commit f0f4e00. * Demonstrating how insanely slow the obvious solution is datastore/insert/batch/rects/insert time: [387.54 µs 387.98 µs 388.52 µs] thrpt: [25.739 Melem/s 25.775 Melem/s 25.804 Melem/s] change: time: [+227.27% +227.92% +228.56%] (p = 0.00 < 0.05) thrpt: [-69.564% -69.505% -69.444%] Performance has regressed. * it'd be a tiny bit better with some kind of splats... datastore/insert/batch/rects/insert time: [284.35 µs 284.55 µs 284.86 µs] thrpt: [35.105 Melem/s 35.144 Melem/s 35.167 Melem/s] change: time: [+137.45% +138.08% +138.52%] (p = 0.00 < 0.05) thrpt: [-58.075% -57.997% -57.885%] Performance has regressed. * and now with MsgId being a full fledged component datastore/insert/batch/rects/insert time: [180.84 µs 184.42 µs 188.96 µs] thrpt: [52.920 Melem/s 54.225 Melem/s 55.296 Melem/s] change: time: [+1.0072% +2.1236% +3.3206%] (p = 0.00 < 0.05) thrpt: [-3.2139% -2.0795% -0.9972%] * stuff * bruh * storing msg metadata * Revert "implement timepoints-as-components and insert them automatically" This reverts commit e6a6fd5. * I guess it's decent * implement PoV-less, always-yield lower-level API + adapt higher-level one * addressing PR comments * ported to new low-level APIs * xxx * addressed PR comments * self and not-so-self reviews * the future is quite literally here * add MsgBundle helpers * remove hardwired ref to Instance * post-merge fixes * cleanup * guess that isnt very useful anymore * using new msgbundle helpers * updating my non-sensical msgbundle helpers * explaining the example * woops * addressed PR comments

jondo2010 reviewed Dec 17, 2022

View reviewed changes

crates/re_arrow_store/src/store_read.rs Outdated Show resolved Hide resolved

jondo2010 reviewed Dec 17, 2022

View reviewed changes

crates/re_arrow_store/src/store_write.rs Outdated Show resolved Hide resolved

teh-cmc changed the title ~~re_datastore: no concatenations on the write path~~ re_datastore: introducing component bucket chunking Dec 17, 2022

teh-cmc marked this pull request as ready for review December 17, 2022 23:17

teh-cmc added a commit that referenced this pull request Dec 18, 2022

merge cmc/datastore/get_rid_of_copies (#584)

a41180e

This was referenced Dec 18, 2022

re_datastore: introduce cluster keys #593

Merged

re_datastore: support for batched row insertions #589

Closed

teh-cmc marked this pull request as draft December 18, 2022 12:21

teh-cmc changed the title ~~re_datastore: introducing component bucket chunking~~ re_datastore: component chunks & streamlining batches Dec 18, 2022

teh-cmc marked this pull request as ready for review December 18, 2022 13:55

teh-cmc added 7 commits December 18, 2022 16:01

get is supposed to return a row, not a [row]

940364b

unwrap note

fcf6d5a

the bench too

1a86bee

self review

9e22ac1

doc test also

2fac6d2

and re_query ofc!

8682229

slicing is _very_ slow, don't do it if you don't have to

b7e5fd5

teh-cmc changed the base branch from main to cmc/datastore/get_a_single_row December 18, 2022 15:17

teh-cmc added 12 commits December 18, 2022 16:27

no more col_arrays in re_query

da816d1

there's actually no need for concatenating at all

b22eecc

incrementally compute and cache bucket sizes

4a7b7ef

cleaning up and documenting existing limitations

f88f248

introducing bucket retirement

2bcd47e

issue ref

c8b40b6

some more doc stuff

36ce4db

self-review

751c2e8

polars/fmt should always be there for tests

6776365

streamlining batch support

9652013

take list header into account

bce700e

it's fine

5c6fec8

teh-cmc added 5 commits December 18, 2022 16:32

self-review

e97eab6

just something i want to keep around for later

37cd9b2

(un)wrapping lists is a bit slow... and slicing them is _extremely_ s…

4682c60

…low!

merge cmc/datastore/get_a_single_row (#590)

5abfffe

no more col_arrays in re_query

02170b9

teh-cmc force-pushed the cmc/datastore/get_rid_of_copies branch from c8637a2 to 02170b9 Compare December 18, 2022 15:36

teh-cmc added a commit that referenced this pull request Dec 18, 2022

merge cmc/datastore/get_rid_of_copies (#584)

c311caf

emilk reviewed Dec 18, 2022

View reviewed changes

Base automatically changed from cmc/datastore/get_a_single_row to main December 18, 2022 18:03

teh-cmc added 3 commits December 18, 2022 19:30

addressing PR comments, I hope

d3a33cf

Merge remote-tracking branch 'origin/main' into cmc/datastore/get_rid…

9091e29

…_of_copies

missed a couple

af046b5

jleibs reviewed Dec 18, 2022

View reviewed changes

jleibs approved these changes Dec 19, 2022

View reviewed changes

addressed PR comments

5d19e5f

teh-cmc requested a review from emilk December 19, 2022 08:06

teh-cmc linked an issue Dec 19, 2022 that may be closed by this pull request

re_datastore: get rid of immutable arrays in component buckets #451

Closed

teh-cmc merged commit cea5b6b into main Dec 19, 2022

teh-cmc deleted the cmc/datastore/get_rid_of_copies branch December 19, 2022 10:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re_datastore: component chunks & streamlining batches #584

re_datastore: component chunks & streamlining batches #584

teh-cmc commented Dec 16, 2022 •

edited

Loading

teh-cmc commented Dec 18, 2022

emilk left a comment

emilk Dec 18, 2022

teh-cmc Dec 18, 2022

teh-cmc Dec 18, 2022

emilk Dec 18, 2022

teh-cmc Dec 18, 2022

jleibs Dec 18, 2022 •

edited

Loading

teh-cmc Dec 18, 2022 •

edited

Loading

emilk Dec 18, 2022

teh-cmc Dec 18, 2022

jleibs Dec 18, 2022

teh-cmc Dec 18, 2022

jleibs left a comment

	/// Each chunk is a list of list of components, i.e. `ListArray<ListArray<StructArray>>`:
	/// Each chunk is a list of list of components, i.e. `Vec<ListArray<StructArray>>`:

		// So use the fact that `rows` is always of unit-length for now.
		let rows_single = rows;

re_datastore: component chunks & streamlining batches #584

re_datastore: component chunks & streamlining batches #584

Conversation

teh-cmc commented Dec 16, 2022 • edited Loading

Bench

teh-cmc commented Dec 18, 2022

emilk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jleibs Dec 18, 2022 • edited Loading

Choose a reason for hiding this comment

teh-cmc Dec 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jleibs left a comment

Choose a reason for hiding this comment

teh-cmc commented Dec 16, 2022 •

edited

Loading

jleibs Dec 18, 2022 •

edited

Loading

teh-cmc Dec 18, 2022 •

edited

Loading