-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re_datastore: component chunks & streamlining batches #584
Conversation
c8637a2
to
02170b9
Compare
Just doing some insane git acrobatics so that github generates the right diffs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need a walk-through to be able to properly review this
crates/re_arrow_store/src/store.rs
Outdated
/// Has this bucket been retired yet? | ||
/// | ||
/// At any given moment, all buckets except the currently active one have to be retired. | ||
pub(crate) retired: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a description of retired
actually means.
From the PR description it sounds like it means that it is full, or that is full and has also been compacted.
If so, perhaps full
, compacted
or !active
is a better name? "Retired" makes me think it is no longer in use. Or is this common DB vernacular?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Retired" in this case means that it's been archived and is now read-only, so it'll only be used for the read path from now on.
Being retired implies a bunch of things, currently:
- the bucket is now read-only,
- the bucket has been compacted (its chunks have been concatenated),
- the bucket is inactive as far as the write path is concerned,
- and probably more in the future.
I definitely need to improve the doc there. As for the name... maybe archived
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added something along the lines in the doc-comment 🤞
crates/re_arrow_store/src/store.rs
Outdated
pub(crate) data: Box<dyn Array>, | ||
/// All the data for this bucket: many rows of a single column. | ||
/// | ||
/// Each chunk is a list of list of components, i.e. `ListArray<ListArray<StructArray>>`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Each chunk is a list of list of components, i.e. `ListArray<ListArray<StructArray>>`: | |
/// Each chunk is a list of list of components, i.e. `Vec<ListArray<StructArray>>`: |
would match the actual type a bit closer, confusing me slightly less :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of Self::chunks
is Vec<ListArray<ListArray<StructArray>>>
, but the type of each chunk is indeed exactly ListArray<ListArray<StructArray>>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Emil here: a chunk is (and should be) still just be a single ListArray
of components. I don't think we should ever have a ListArray<ListArray>
. A ListArray already has the double-list property you're describing:
| Component |
+------------+
| [ a, b, c] |
| [ d, e] |
The array index corresponds to the different rows.
The list index within a row corresponds to the different instances within a single row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah no re-reading this now I can see you're definitely both right; got my head all messed up earlier.
Will fix all these docs first thing tomorrow thanks 👍
// So use the fact that `rows` is always of unit-length for now. | ||
let rows_single = rows; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, it's good to debug_assert
that assumption
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's asserted when entering this method:
ensure!(
nb_rows == 1,
"we currently don't support more than one row per batch, as a `MsgBundle` can only \
carry a single timepoint for now!"
);
ensure!(
components
.iter()
.all(|bundle| bundle.value.len() == nb_rows),
"all components across the bundle must share the same number of rows",
);
let row = rows_single | ||
.as_any() | ||
.downcast_ref::<ListArray<i32>>() | ||
.unwrap() | ||
.value(row_nr); | ||
ComponentTable::new((*name).clone(), row.data_type()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do this using get_child_type
along the lines of:
let row = rows_single | |
.as_any() | |
.downcast_ref::<ListArray<i32>>() | |
.unwrap() | |
.value(row_nr); | |
ComponentTable::new((*name).clone(), row.data_type()) | |
let child_type = ListArray<i32>::get_child_type(rows_single.data_type()); | |
ComponentTable::new((*name).clone(), child_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha! nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from a bit of doc comment cleanup this looks great to me!
* get is supposed to return a row, not a [row] * unwrap note * the bench too * self review * doc test also * and re_query ofc! * slicing is _very_ slow, don't do it if you don't have to * no more col_arrays in re_query * there's actually no need for concatenating at all * incrementally compute and cache bucket sizes * cleaning up and documenting existing limitations * introducing bucket retirement * issue ref * some more doc stuff * self-review * polars/fmt should always be there for tests * streamlining batch support * take list header into account * it's fine * self-review * just something i want to keep around for later * (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow! * merge cmc/datastore/get_a_single_row (#590) * no more col_arrays in re_query * introducing the notion of clustering key, thankfully breaking all tests by design * making good use of that shiny new Instance component * merge cmc/datastore/get_rid_of_copies (#584) * missed one * introducing arrow_util with is_dense_array() * finding the clustering comp of the row... or creating it! * rebasin' * post rebase clean up * addressing PR comments, I hope * ensure that clustering components are properly sorted, failing the existing test suite * build_instances now generate sorted ids, thus greenlighting the test suite * missed a couple * addressed PR comments * going for the ArrayExt route * completing the quadrifecta of checks * the unavoidable typed error revolution is on its way, and it's early * where we're going we don't need polars * update everything for the new APIs * error for unsupported clustering key types * clean up and actually testing our error paths * move those nasty internal tests into their own dirty corner * finally some high-level tests in here * i happen to like where this is going * shuffling things * demonstrating that implicit instances are somehow broken * fully working implicit clustering keys, but demonstrating a sorting issue somewhere * there is still something weird going on tho * latest_at behaving as one would expect * automatically cache generated cluster instances * time to clean up en masse * still want to put some stress on the bucketing * make ArrayExt::is_dense a little more friendly, just in case... * re_query: use polars/fmt in tests * re_query: remove implicit instances * fixing the u32 vs u64 instance drama * cluster-aware polars helpers :> * cleanin up tests * continuing cleanup and doc * updating visuals for this brave new world * docs * self-review * bruh * bruh... * ... * outdated comment * no reason to search for it multiple times * polars_helpers => polars_util for consistency's sake * addressing PR comments and a couple other things * xxx * post-merge fixes * more fixes
* get is supposed to return a row, not a [row] * unwrap note * the bench too * self review * doc test also * and re_query ofc! * slicing is _very_ slow, don't do it if you don't have to * no more col_arrays in re_query * there's actually no need for concatenating at all * incrementally compute and cache bucket sizes * cleaning up and documenting existing limitations * introducing bucket retirement * issue ref * some more doc stuff * self-review * polars/fmt should always be there for tests * streamlining batch support * take list header into account * it's fine * self-review * just something i want to keep around for later * (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow! * merge cmc/datastore/get_a_single_row (#590) * no more col_arrays in re_query * introducing the notion of clustering key, thankfully breaking all tests by design * making good use of that shiny new Instance component * merge cmc/datastore/get_rid_of_copies (#584) * missed one * introducing arrow_util with is_dense_array() * finding the clustering comp of the row... or creating it! * rebasin' * post rebase clean up * addressing PR comments, I hope * ensure that clustering components are properly sorted, failing the existing test suite * build_instances now generate sorted ids, thus greenlighting the test suite * missed a couple * addressed PR comments * going for the ArrayExt route * completing the quadrifecta of checks * the unavoidable typed error revolution is on its way, and it's early * where we're going we don't need polars * update everything for the new APIs * error for unsupported clustering key types * clean up and actually testing our error paths * move those nasty internal tests into their own dirty corner * finally some high-level tests in here * i happen to like where this is going * shuffling things * demonstrating that implicit instances are somehow broken * fully working implicit clustering keys, but demonstrating a sorting issue somewhere * there is still something weird going on tho * latest_at behaving as one would expect * automatically cache generated cluster instances * time to clean up en masse * still want to put some stress on the bucketing * make ArrayExt::is_dense a little more friendly, just in case... * TimeType::format_range * independent latest_at query and using appropriate types everywhere * re_query: use polars/fmt in tests * re_query: remove implicit instances * fixing the u32 vs u64 instance drama * really starting to like how this looks * cluster-aware polars helpers :> * cleanin up tests * continuing cleanup and doc * updating visuals for this brave new world * docs * self-review * bruh * bruh... * ... * outdated comment * no reason to search for it multiple times * polars_helpers => polars_util for consistency's sake * addressing PR comments and a couple other things * xxx * post-merge fixes * TimeInt should be nohash * high-level polar range tools + making first half of range impl pass * implement the streaming half * finally defeated all demons * still passes? * it looks like we've made it out alive * polars util: join however you wish * fixed formatting * point2d's PoVs working as expected * passing full ranges * docs and such part 1, the semantics are hell * fixing the filtering mess in tests * me stoopid * polars docs * addressing the clones * xxx * missed a gazillon conflict somehow * polars util spring cleaning * do indicate and demonstrate that range_components is _not_ a real streaming join * fixed some comments * bruh * screw it, going for the real deal: full streaming joins * YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE * yep yep i like this * I hereby declare myself _satisfied_ * initiating the great cleanup * add notes for upcoming terminology pr * bringing IndexRowNr into the mix and slowly starting to fix terminology mess * improving range_components ergonomics * putting it all in self-reviewable state * self-review * add bench * xxx * addressing PR comments * demonstrating nasty edge-case with streaming-joins * update streaming-join merging rules to fix said edge case * implement PoV-less, always-yield lower-level API + adapt higher-level one * addressing PR comments * self and not-so-self reviews
* get is supposed to return a row, not a [row] * unwrap note * the bench too * self review * doc test also * and re_query ofc! * slicing is _very_ slow, don't do it if you don't have to * no more col_arrays in re_query * there's actually no need for concatenating at all * incrementally compute and cache bucket sizes * cleaning up and documenting existing limitations * introducing bucket retirement * issue ref * some more doc stuff * self-review * polars/fmt should always be there for tests * streamlining batch support * take list header into account * it's fine * self-review * just something i want to keep around for later * (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow! * merge cmc/datastore/get_a_single_row (#590) * no more col_arrays in re_query * introducing the notion of clustering key, thankfully breaking all tests by design * making good use of that shiny new Instance component * merge cmc/datastore/get_rid_of_copies (#584) * missed one * introducing arrow_util with is_dense_array() * finding the clustering comp of the row... or creating it! * rebasin' * post rebase clean up * addressing PR comments, I hope * ensure that clustering components are properly sorted, failing the existing test suite * build_instances now generate sorted ids, thus greenlighting the test suite * missed a couple * addressed PR comments * going for the ArrayExt route * completing the quadrifecta of checks * the unavoidable typed error revolution is on its way, and it's early * where we're going we don't need polars * update everything for the new APIs * error for unsupported clustering key types * clean up and actually testing our error paths * move those nasty internal tests into their own dirty corner * finally some high-level tests in here * i happen to like where this is going * shuffling things * demonstrating that implicit instances are somehow broken * fully working implicit clustering keys, but demonstrating a sorting issue somewhere * there is still something weird going on tho * latest_at behaving as one would expect * automatically cache generated cluster instances * time to clean up en masse * still want to put some stress on the bucketing * make ArrayExt::is_dense a little more friendly, just in case... * TimeType::format_range * independent latest_at query and using appropriate types everywhere * re_query: use polars/fmt in tests * re_query: remove implicit instances * fixing the u32 vs u64 instance drama * really starting to like how this looks * cluster-aware polars helpers :> * cleanin up tests * continuing cleanup and doc * updating visuals for this brave new world * docs * self-review * bruh * bruh... * ... * outdated comment * no reason to search for it multiple times * polars_helpers => polars_util for consistency's sake * addressing PR comments and a couple other things * xxx * post-merge fixes * TimeInt should be nohash * high-level polar range tools + making first half of range impl pass * implement the streaming half * finally defeated all demons * still passes? * it looks like we've made it out alive * polars util: join however you wish * fixed formatting * point2d's PoVs working as expected * passing full ranges * docs and such part 1, the semantics are hell * fixing the filtering mess in tests * me stoopid * polars docs * addressing the clones * xxx * missed a gazillon conflict somehow * polars util spring cleaning * do indicate and demonstrate that range_components is _not_ a real streaming join * fixed some comments * bruh * screw it, going for the real deal: full streaming joins * YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE * yep yep i like this * I hereby declare myself _satisfied_ * initiating the great cleanup * add notes for upcoming terminology pr * bringing IndexRowNr into the mix and slowly starting to fix terminology mess * improving range_components ergonomics * putting it all in self-reviewable state * self-review * add bench * xxx * having some casual fun with dataframes :> * now with components! * just some experiments im not too found of... keeping around just in case * Revert "just some experiments im not too found of... keeping around just in case" This reverts commit 15f6487. * playing around with insert_id-as-data... which turns out to be quite helpful * going into store_insert_ids for real * full impl * add example * self-review * reviewable * addressing PR comments * is that readable * Revert "is that readable" This reverts commit f802ff5. * standalone examples for all dataframe APIs * burn all todos * update doc: you can do that now! * demonstrating nasty edge-case with streaming-joins * update streaming-join merging rules to fix said edge case * addressed PR comments
* get is supposed to return a row, not a [row] * unwrap note * the bench too * self review * doc test also * and re_query ofc! * slicing is _very_ slow, don't do it if you don't have to * no more col_arrays in re_query * there's actually no need for concatenating at all * incrementally compute and cache bucket sizes * cleaning up and documenting existing limitations * introducing bucket retirement * issue ref * some more doc stuff * self-review * polars/fmt should always be there for tests * streamlining batch support * take list header into account * it's fine * self-review * just something i want to keep around for later * (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow! * merge cmc/datastore/get_a_single_row (#590) * no more col_arrays in re_query * introducing the notion of clustering key, thankfully breaking all tests by design * making good use of that shiny new Instance component * merge cmc/datastore/get_rid_of_copies (#584) * missed one * introducing arrow_util with is_dense_array() * finding the clustering comp of the row... or creating it! * rebasin' * post rebase clean up * addressing PR comments, I hope * ensure that clustering components are properly sorted, failing the existing test suite * build_instances now generate sorted ids, thus greenlighting the test suite * missed a couple * addressed PR comments * going for the ArrayExt route * completing the quadrifecta of checks * the unavoidable typed error revolution is on its way, and it's early * where we're going we don't need polars * update everything for the new APIs * error for unsupported clustering key types * clean up and actually testing our error paths * move those nasty internal tests into their own dirty corner * finally some high-level tests in here * i happen to like where this is going * shuffling things * demonstrating that implicit instances are somehow broken * fully working implicit clustering keys, but demonstrating a sorting issue somewhere * there is still something weird going on tho * latest_at behaving as one would expect * automatically cache generated cluster instances * time to clean up en masse * still want to put some stress on the bucketing * make ArrayExt::is_dense a little more friendly, just in case... * TimeType::format_range * independent latest_at query and using appropriate types everywhere * re_query: use polars/fmt in tests * re_query: remove implicit instances * fixing the u32 vs u64 instance drama * really starting to like how this looks * cluster-aware polars helpers :> * cleanin up tests * continuing cleanup and doc * updating visuals for this brave new world * docs * self-review * bruh * bruh... * ... * outdated comment * no reason to search for it multiple times * polars_helpers => polars_util for consistency's sake * addressing PR comments and a couple other things * xxx * post-merge fixes * TimeInt should be nohash * high-level polar range tools + making first half of range impl pass * implement the streaming half * finally defeated all demons * still passes? * it looks like we've made it out alive * polars util: join however you wish * fixed formatting * point2d's PoVs working as expected * passing full ranges * docs and such part 1, the semantics are hell * fixing the filtering mess in tests * me stoopid * polars docs * addressing the clones * xxx * missed a gazillon conflict somehow * polars util spring cleaning * do indicate and demonstrate that range_components is _not_ a real streaming join * fixed some comments * bruh * screw it, going for the real deal: full streaming joins * YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE * yep yep i like this * I hereby declare myself _satisfied_ * initiating the great cleanup * add notes for upcoming terminology pr * bringing IndexRowNr into the mix and slowly starting to fix terminology mess * improving range_components ergonomics * putting it all in self-reviewable state * self-review * add bench * xxx * addressing PR comments * sanity checking cluster components * demonstrating nasty edge-case with streaming-joins * update streaming-join merging rules to fix said edge case * implement PoV-less, always-yield lower-level API + adapt higher-level one * addressing PR comments * self and not-so-self reviews * always-on cluster key, it's fiiiine
* get is supposed to return a row, not a [row] * unwrap note * the bench too * self review * doc test also * and re_query ofc! * slicing is _very_ slow, don't do it if you don't have to * no more col_arrays in re_query * there's actually no need for concatenating at all * incrementally compute and cache bucket sizes * cleaning up and documenting existing limitations * introducing bucket retirement * issue ref * some more doc stuff * self-review * polars/fmt should always be there for tests * streamlining batch support * take list header into account * it's fine * self-review * just something i want to keep around for later * (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow! * merge cmc/datastore/get_a_single_row (#590) * no more col_arrays in re_query * introducing the notion of clustering key, thankfully breaking all tests by design * making good use of that shiny new Instance component * merge cmc/datastore/get_rid_of_copies (#584) * missed one * introducing arrow_util with is_dense_array() * finding the clustering comp of the row... or creating it! * rebasin' * post rebase clean up * addressing PR comments, I hope * ensure that clustering components are properly sorted, failing the existing test suite * build_instances now generate sorted ids, thus greenlighting the test suite * missed a couple * addressed PR comments * going for the ArrayExt route * completing the quadrifecta of checks * the unavoidable typed error revolution is on its way, and it's early * where we're going we don't need polars * update everything for the new APIs * error for unsupported clustering key types * clean up and actually testing our error paths * move those nasty internal tests into their own dirty corner * finally some high-level tests in here * i happen to like where this is going * shuffling things * demonstrating that implicit instances are somehow broken * fully working implicit clustering keys, but demonstrating a sorting issue somewhere * there is still something weird going on tho * latest_at behaving as one would expect * automatically cache generated cluster instances * time to clean up en masse * still want to put some stress on the bucketing * make ArrayExt::is_dense a little more friendly, just in case... * TimeType::format_range * independent latest_at query and using appropriate types everywhere * re_query: use polars/fmt in tests * re_query: remove implicit instances * fixing the u32 vs u64 instance drama * really starting to like how this looks * cluster-aware polars helpers :> * cleanin up tests * continuing cleanup and doc * updating visuals for this brave new world * docs * self-review * bruh * bruh... * ... * outdated comment * no reason to search for it multiple times * polars_helpers => polars_util for consistency's sake * addressing PR comments and a couple other things * xxx * post-merge fixes * TimeInt should be nohash * high-level polar range tools + making first half of range impl pass * implement the streaming half * finally defeated all demons * still passes? * it looks like we've made it out alive * polars util: join however you wish * fixed formatting * point2d's PoVs working as expected * passing full ranges * docs and such part 1, the semantics are hell * fixing the filtering mess in tests * me stoopid * polars docs * addressing the clones * xxx * missed a gazillon conflict somehow * polars util spring cleaning * do indicate and demonstrate that range_components is _not_ a real streaming join * fixed some comments * bruh * screw it, going for the real deal: full streaming joins * YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE * yep yep i like this * I hereby declare myself _satisfied_ * initiating the great cleanup * add notes for upcoming terminology pr * bringing IndexRowNr into the mix and slowly starting to fix terminology mess * improving range_components ergonomics * putting it all in self-reviewable state * self-review * add bench * xxx * addressing PR comments * first impl * ported simple_query() to simple_range * doc and such * added e2e example for range queries * self-review * support for new EntityView * demonstrating nasty edge-case with streaming-joins * update streaming-join merging rules to fix said edge case * mimicking range_components' new merging rules * implement PoV-less, always-yield lower-level API + adapt higher-level one * addressing PR comments * ported to new low-level APIs * xxx * addressed PR comments * self and not-so-self reviews * the future is quite literally here
* get is supposed to return a row, not a [row] * unwrap note * the bench too * self review * doc test also * and re_query ofc! * slicing is _very_ slow, don't do it if you don't have to * no more col_arrays in re_query * there's actually no need for concatenating at all * incrementally compute and cache bucket sizes * cleaning up and documenting existing limitations * introducing bucket retirement * issue ref * some more doc stuff * self-review * polars/fmt should always be there for tests * streamlining batch support * take list header into account * it's fine * self-review * just something i want to keep around for later * (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow! * merge cmc/datastore/get_a_single_row (#590) * no more col_arrays in re_query * introducing the notion of clustering key, thankfully breaking all tests by design * making good use of that shiny new Instance component * merge cmc/datastore/get_rid_of_copies (#584) * missed one * introducing arrow_util with is_dense_array() * finding the clustering comp of the row... or creating it! * rebasin' * post rebase clean up * addressing PR comments, I hope * ensure that clustering components are properly sorted, failing the existing test suite * build_instances now generate sorted ids, thus greenlighting the test suite * missed a couple * addressed PR comments * going for the ArrayExt route * completing the quadrifecta of checks * the unavoidable typed error revolution is on its way, and it's early * where we're going we don't need polars * update everything for the new APIs * error for unsupported clustering key types * clean up and actually testing our error paths * move those nasty internal tests into their own dirty corner * finally some high-level tests in here * i happen to like where this is going * shuffling things * demonstrating that implicit instances are somehow broken * fully working implicit clustering keys, but demonstrating a sorting issue somewhere * there is still something weird going on tho * latest_at behaving as one would expect * automatically cache generated cluster instances * time to clean up en masse * still want to put some stress on the bucketing * make ArrayExt::is_dense a little more friendly, just in case... * TimeType::format_range * independent latest_at query and using appropriate types everywhere * re_query: use polars/fmt in tests * re_query: remove implicit instances * fixing the u32 vs u64 instance drama * really starting to like how this looks * cluster-aware polars helpers :> * cleanin up tests * continuing cleanup and doc * updating visuals for this brave new world * docs * self-review * bruh * bruh... * ... * outdated comment * no reason to search for it multiple times * polars_helpers => polars_util for consistency's sake * addressing PR comments and a couple other things * xxx * post-merge fixes * TimeInt should be nohash * high-level polar range tools + making first half of range impl pass * implement the streaming half * finally defeated all demons * still passes? * it looks like we've made it out alive * polars util: join however you wish * fixed formatting * point2d's PoVs working as expected * passing full ranges * docs and such part 1, the semantics are hell * fixing the filtering mess in tests * me stoopid * polars docs * addressing the clones * xxx * missed a gazillon conflict somehow * polars util spring cleaning * do indicate and demonstrate that range_components is _not_ a real streaming join * fixed some comments * bruh * screw it, going for the real deal: full streaming joins * YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE * yep yep i like this * I hereby declare myself _satisfied_ * initiating the great cleanup * add notes for upcoming terminology pr * bringing IndexRowNr into the mix and slowly starting to fix terminology mess * improving range_components ergonomics * putting it all in self-reviewable state * self-review * add bench * xxx * addressing PR comments * first impl * ported simple_query() to simple_range * doc and such * added e2e example for range queries * self-review * support for new EntityView * demonstrating nasty edge-case with streaming-joins * update streaming-join merging rules to fix said edge case * mimicking range_components' new merging rules * Demonstrating how insanely slow the obvious solution is datastore/insert/batch/rects/insert time: [387.54 µs 387.98 µs 388.52 µs] thrpt: [25.739 Melem/s 25.775 Melem/s 25.804 Melem/s] change: time: [+227.27% +227.92% +228.56%] (p = 0.00 < 0.05) thrpt: [-69.564% -69.505% -69.444%] Performance has regressed. * it'd be a tiny bit better with some kind of splats... datastore/insert/batch/rects/insert time: [284.35 µs 284.55 µs 284.86 µs] thrpt: [35.105 Melem/s 35.144 Melem/s 35.167 Melem/s] change: time: [+137.45% +138.08% +138.52%] (p = 0.00 < 0.05) thrpt: [-58.075% -57.997% -57.885%] Performance has regressed. * and now with MsgId being a full fledged component datastore/insert/batch/rects/insert time: [180.84 µs 184.42 µs 188.96 µs] thrpt: [52.920 Melem/s 54.225 Melem/s 55.296 Melem/s] change: time: [+1.0072% +2.1236% +3.3206%] (p = 0.00 < 0.05) thrpt: [-3.2139% -2.0795% -0.9972%] * stuff * bruh * implement PoV-less, always-yield lower-level API + adapt higher-level one * addressing PR comments * ported to new low-level APIs * xxx * addressed PR comments * self and not-so-self reviews * the future is quite literally here * some comments at the very least * msgid standing on its own * bruh
* get is supposed to return a row, not a [row] * unwrap note * the bench too * self review * doc test also * and re_query ofc! * slicing is _very_ slow, don't do it if you don't have to * no more col_arrays in re_query * there's actually no need for concatenating at all * incrementally compute and cache bucket sizes * cleaning up and documenting existing limitations * introducing bucket retirement * issue ref * some more doc stuff * self-review * polars/fmt should always be there for tests * streamlining batch support * take list header into account * it's fine * self-review * just something i want to keep around for later * (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow! * merge cmc/datastore/get_a_single_row (#590) * no more col_arrays in re_query * introducing the notion of clustering key, thankfully breaking all tests by design * making good use of that shiny new Instance component * merge cmc/datastore/get_rid_of_copies (#584) * missed one * introducing arrow_util with is_dense_array() * finding the clustering comp of the row... or creating it! * rebasin' * post rebase clean up * addressing PR comments, I hope * ensure that clustering components are properly sorted, failing the existing test suite * build_instances now generate sorted ids, thus greenlighting the test suite * missed a couple * addressed PR comments * going for the ArrayExt route * completing the quadrifecta of checks * the unavoidable typed error revolution is on its way, and it's early * where we're going we don't need polars * update everything for the new APIs * error for unsupported clustering key types * clean up and actually testing our error paths * move those nasty internal tests into their own dirty corner * finally some high-level tests in here * i happen to like where this is going * shuffling things * demonstrating that implicit instances are somehow broken * fully working implicit clustering keys, but demonstrating a sorting issue somewhere * there is still something weird going on tho * latest_at behaving as one would expect * automatically cache generated cluster instances * time to clean up en masse * still want to put some stress on the bucketing * make ArrayExt::is_dense a little more friendly, just in case... * TimeType::format_range * independent latest_at query and using appropriate types everywhere * re_query: use polars/fmt in tests * re_query: remove implicit instances * fixing the u32 vs u64 instance drama * really starting to like how this looks * cluster-aware polars helpers :> * cleanin up tests * continuing cleanup and doc * updating visuals for this brave new world * docs * self-review * bruh * bruh... * ... * outdated comment * no reason to search for it multiple times * polars_helpers => polars_util for consistency's sake * addressing PR comments and a couple other things * xxx * post-merge fixes * TimeInt should be nohash * high-level polar range tools + making first half of range impl pass * implement the streaming half * finally defeated all demons * still passes? * it looks like we've made it out alive * polars util: join however you wish * fixed formatting * point2d's PoVs working as expected * passing full ranges * docs and such part 1, the semantics are hell * fixing the filtering mess in tests * me stoopid * polars docs * addressing the clones * xxx * missed a gazillon conflict somehow * polars util spring cleaning * do indicate and demonstrate that range_components is _not_ a real streaming join * fixed some comments * bruh * screw it, going for the real deal: full streaming joins * YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE * yep yep i like this * I hereby declare myself _satisfied_ * initiating the great cleanup * add notes for upcoming terminology pr * bringing IndexRowNr into the mix and slowly starting to fix terminology mess * improving range_components ergonomics * putting it all in self-reviewable state * self-review * add bench * xxx * addressing PR comments * first impl * ported simple_query() to simple_range * doc and such * added e2e example for range queries * self-review * support for new EntityView * demonstrating nasty edge-case with streaming-joins * update streaming-join merging rules to fix said edge case * mimicking range_components' new merging rules * implement timepoints-as-components and insert them automatically * derive proper ObjectType TextEntry component * introducing the TextEntry component * re_viewer now supports both legacy and arrow data sources * added pure rust example for text entries * plugging everything on the python side * self-review * bruh * it's decent, but we won't ever be able to deduplicate that way :( * Revert "it's decent, but we won't ever be able to deduplicate that way :(" This reverts commit f0f4e00. * Demonstrating how insanely slow the obvious solution is datastore/insert/batch/rects/insert time: [387.54 µs 387.98 µs 388.52 µs] thrpt: [25.739 Melem/s 25.775 Melem/s 25.804 Melem/s] change: time: [+227.27% +227.92% +228.56%] (p = 0.00 < 0.05) thrpt: [-69.564% -69.505% -69.444%] Performance has regressed. * it'd be a tiny bit better with some kind of splats... datastore/insert/batch/rects/insert time: [284.35 µs 284.55 µs 284.86 µs] thrpt: [35.105 Melem/s 35.144 Melem/s 35.167 Melem/s] change: time: [+137.45% +138.08% +138.52%] (p = 0.00 < 0.05) thrpt: [-58.075% -57.997% -57.885%] Performance has regressed. * and now with MsgId being a full fledged component datastore/insert/batch/rects/insert time: [180.84 µs 184.42 µs 188.96 µs] thrpt: [52.920 Melem/s 54.225 Melem/s 55.296 Melem/s] change: time: [+1.0072% +2.1236% +3.3206%] (p = 0.00 < 0.05) thrpt: [-3.2139% -2.0795% -0.9972%] * stuff * bruh * storing msg metadata * Revert "implement timepoints-as-components and insert them automatically" This reverts commit e6a6fd5. * I guess it's decent * implement PoV-less, always-yield lower-level API + adapt higher-level one * addressing PR comments * ported to new low-level APIs * xxx * addressed PR comments * self and not-so-self reviews * the future is quite literally here * add MsgBundle helpers * remove hardwired ref to Instance * post-merge fixes * cleanup * guess that isnt very useful anymore * using new msgbundle helpers * updating my non-sensical msgbundle helpers * explaining the example * woops * addressed PR comments
Requires #590.
This PR introduces a further layer of subdivision in
ComponentBucket
s: chunks.Chunks hold raw component data exactly as it was inserted, allowing for zero-copy/zero-alloc insertion.
This makes the write-path actually viable once again, see benchmarks below.
While the different chunks live in non-contiguous memory, this actually matters very little in terms of performance for the read path: access across many rows is more often than not random anyway.
What matters is that memory within a single row is always contiguous, and that doesn't change.
More context in this slack thread.
Still, there's an obvious opportunity for us to compact chunks and defragment memory: when buckets get retired, i.e. when they get full and we "archive them" in read-only mode.
This PR does exactly this, which means that only active buckets can have more than a single chunk.
Finally, this PR uses the opportunity to streamline the "batch insertions" situation: currently, inserting
MsgBundle
s that contain more than one row per component is effectively undefined behaviour as you'd get a singleRowIndex
for multiple rows.This PR now inserts each row the batch one at a time, always creating chunks of length 1.
This effectively prepares the field for real batched insertions (which will first require
MsgBundle
to carry more than one timepoint) and for the upcoming clustering keys.Bench