Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re_datastore: component chunks & streamlining batches #584

Merged
merged 28 commits into from
Dec 19, 2022

Conversation

teh-cmc
Copy link
Member

@teh-cmc teh-cmc commented Dec 16, 2022

Requires #590.

This PR introduces a further layer of subdivision in ComponentBuckets: chunks.

Chunks hold raw component data exactly as it was inserted, allowing for zero-copy/zero-alloc insertion.
This makes the write-path actually viable once again, see benchmarks below.

While the different chunks live in non-contiguous memory, this actually matters very little in terms of performance for the read path: access across many rows is more often than not random anyway.
What matters is that memory within a single row is always contiguous, and that doesn't change.
More context in this slack thread.

Still, there's an obvious opportunity for us to compact chunks and defragment memory: when buckets get retired, i.e. when they get full and we "archive them" in read-only mode.
This PR does exactly this, which means that only active buckets can have more than a single chunk.

Finally, this PR uses the opportunity to streamline the "batch insertions" situation: currently, inserting MsgBundles that contain more than one row per component is effectively undefined behaviour as you'd get a single RowIndex for multiple rows.
This PR now inserts each row the batch one at a time, always creating chunks of length 1.

This effectively prepares the field for real batched insertions (which will first require MsgBundle to carry more than one timepoint) and for the upcoming clustering keys.

Bench

datastore/batch/rects/insert
                        time:   [151.51 µs 151.69 µs 151.95 µs]
                        thrpt:  [65.813 Melem/s 65.923 Melem/s 66.002 Melem/s]
                 change:
                        time:   [-83.891% -83.791% -83.689%] (p = 0.00 < 0.05)
                        thrpt:  [+513.07% +516.93% +520.78%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) high mild
  10 (10.00%) high severe

datastore/batch/rects/query
                        time:   [439.35 ns 440.51 ns 441.87 ns]
                        thrpt:  [226.31 Melem/s 227.01 Melem/s 227.61 Melem/s]
                 change:
                        time:   [+2.3095% +2.6594% +2.9437%] (p = 0.00 < 0.05)
                        thrpt:  [-2.8595% -2.5905% -2.2573%]
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

@teh-cmc teh-cmc changed the title re_datastore: no concatenations on the write path re_datastore: introducing component bucket chunking Dec 17, 2022
@teh-cmc teh-cmc marked this pull request as ready for review December 17, 2022 23:17
teh-cmc added a commit that referenced this pull request Dec 18, 2022
@teh-cmc teh-cmc marked this pull request as draft December 18, 2022 12:21
@teh-cmc teh-cmc changed the title re_datastore: introducing component bucket chunking re_datastore: component chunks & streamlining batches Dec 18, 2022
@teh-cmc teh-cmc marked this pull request as ready for review December 18, 2022 13:55
@teh-cmc teh-cmc changed the base branch from main to cmc/datastore/get_a_single_row December 18, 2022 15:17
@teh-cmc teh-cmc force-pushed the cmc/datastore/get_rid_of_copies branch from c8637a2 to 02170b9 Compare December 18, 2022 15:36
@teh-cmc
Copy link
Member Author

teh-cmc commented Dec 18, 2022

Just doing some insane git acrobatics so that github generates the right diffs.
I'm sure it's gonna be fine, how could it go wrong?

teh-cmc added a commit that referenced this pull request Dec 18, 2022
Copy link
Member

@emilk emilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need a walk-through to be able to properly review this

/// Has this bucket been retired yet?
///
/// At any given moment, all buckets except the currently active one have to be retired.
pub(crate) retired: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a description of retired actually means.

From the PR description it sounds like it means that it is full, or that is full and has also been compacted.

If so, perhaps full, compacted or !active is a better name? "Retired" makes me think it is no longer in use. Or is this common DB vernacular?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Retired" in this case means that it's been archived and is now read-only, so it'll only be used for the read path from now on.

Being retired implies a bunch of things, currently:

  • the bucket is now read-only,
  • the bucket has been compacted (its chunks have been concatenated),
  • the bucket is inactive as far as the write path is concerned,
  • and probably more in the future.

I definitely need to improve the doc there. As for the name... maybe archived then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added something along the lines in the doc-comment 🤞

pub(crate) data: Box<dyn Array>,
/// All the data for this bucket: many rows of a single column.
///
/// Each chunk is a list of list of components, i.e. `ListArray<ListArray<StructArray>>`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Each chunk is a list of list of components, i.e. `ListArray<ListArray<StructArray>>`:
/// Each chunk is a list of list of components, i.e. `Vec<ListArray<StructArray>>`:

would match the actual type a bit closer, confusing me slightly less :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of Self::chunks is Vec<ListArray<ListArray<StructArray>>>, but the type of each chunk is indeed exactly ListArray<ListArray<StructArray>>.

Copy link
Member

@jleibs jleibs Dec 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Emil here: a chunk is (and should be) still just be a single ListArray of components. I don't think we should ever have a ListArray<ListArray>. A ListArray already has the double-list property you're describing:

| Component  |
+------------+
| [ a, b, c] |
| [ d, e]    |

The array index corresponds to the different rows.
The list index within a row corresponds to the different instances within a single row.

Copy link
Member Author

@teh-cmc teh-cmc Dec 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah no re-reading this now I can see you're definitely both right; got my head all messed up earlier.

Will fix all these docs first thing tomorrow thanks 👍

crates/re_arrow_store/src/store_read.rs Outdated Show resolved Hide resolved
Comment on lines +90 to +91
// So use the fact that `rows` is always of unit-length for now.
let rows_single = rows;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, it's good to debug_assert that assumption

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's asserted when entering this method:

ensure!(
    nb_rows == 1,
    "we currently don't support more than one row per batch, as a `MsgBundle` can only \
        carry a single timepoint for now!"
);

ensure!(
    components
        .iter()
        .all(|bundle| bundle.value.len() == nb_rows),
    "all components across the bundle must share the same number of rows",
);

Base automatically changed from cmc/datastore/get_a_single_row to main December 18, 2022 18:03
Comment on lines 97 to 102
let row = rows_single
.as_any()
.downcast_ref::<ListArray<i32>>()
.unwrap()
.value(row_nr);
ComponentTable::new((*name).clone(), row.data_type())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do this using get_child_type along the lines of:

Suggested change
let row = rows_single
.as_any()
.downcast_ref::<ListArray<i32>>()
.unwrap()
.value(row_nr);
ComponentTable::new((*name).clone(), row.data_type())
let child_type = ListArray<i32>::get_child_type(rows_single.data_type());
ComponentTable::new((*name).clone(), child_type);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha! nice

Copy link
Member

@jleibs jleibs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from a bit of doc comment cleanup this looks great to me!

@teh-cmc teh-cmc requested a review from emilk December 19, 2022 08:06
@teh-cmc teh-cmc linked an issue Dec 19, 2022 that may be closed by this pull request
@teh-cmc teh-cmc merged commit cea5b6b into main Dec 19, 2022
@teh-cmc teh-cmc deleted the cmc/datastore/get_rid_of_copies branch December 19, 2022 10:20
teh-cmc added a commit that referenced this pull request Dec 22, 2022
* get is supposed to return a row, not a [row]

* unwrap note

* the bench too

* self review

* doc test also

* and re_query ofc!

* slicing is _very_ slow, don't do it if you don't have to

* no more col_arrays in re_query

* there's actually no need for concatenating at all

* incrementally compute and cache bucket sizes

* cleaning up and documenting existing limitations

* introducing bucket retirement

* issue ref

* some more doc stuff

* self-review

* polars/fmt should always be there for tests

* streamlining batch support

* take list header into account

* it's fine

* self-review

* just something i want to keep around for later

* (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow!

* merge cmc/datastore/get_a_single_row (#590)

* no more col_arrays in re_query

* introducing the notion of clustering key, thankfully breaking all tests by design

* making good use of that shiny new Instance component

* merge cmc/datastore/get_rid_of_copies (#584)

* missed one

* introducing arrow_util with is_dense_array()

* finding the clustering comp of the row... or creating it!

* rebasin'

* post rebase clean up

* addressing PR comments, I hope

* ensure that clustering components are properly sorted, failing the existing test suite

* build_instances now generate sorted ids, thus greenlighting the test suite

* missed a couple

* addressed PR comments

* going for the ArrayExt route

* completing the quadrifecta of checks

* the unavoidable typed error revolution is on its way, and it's early

* where we're going we don't need polars

* update everything for the new APIs

* error for unsupported clustering key types

* clean up and actually testing our error paths

* move those nasty internal tests into their own dirty corner

* finally some high-level tests in here

* i happen to like where this is going

* shuffling things

* demonstrating that implicit instances are somehow broken

* fully working implicit clustering keys, but demonstrating a sorting issue somewhere

* there is still something weird going on tho

* latest_at behaving as one would expect

* automatically cache generated cluster instances

* time to clean up en masse

* still want to put some stress on the bucketing

* make ArrayExt::is_dense a little more friendly, just in case...

* re_query: use polars/fmt in tests

* re_query: remove implicit instances

* fixing the u32 vs u64 instance drama

* cluster-aware polars helpers :>

* cleanin up tests

* continuing cleanup and doc

* updating visuals for this brave new world

* docs

* self-review

* bruh

* bruh...

* ...

* outdated comment

* no reason to search for it multiple times

* polars_helpers => polars_util for consistency's sake

* addressing PR comments and a couple other things

* xxx

* post-merge fixes

* more fixes
teh-cmc added a commit that referenced this pull request Jan 2, 2023
* get is supposed to return a row, not a [row]

* unwrap note

* the bench too

* self review

* doc test also

* and re_query ofc!

* slicing is _very_ slow, don't do it if you don't have to

* no more col_arrays in re_query

* there's actually no need for concatenating at all

* incrementally compute and cache bucket sizes

* cleaning up and documenting existing limitations

* introducing bucket retirement

* issue ref

* some more doc stuff

* self-review

* polars/fmt should always be there for tests

* streamlining batch support

* take list header into account

* it's fine

* self-review

* just something i want to keep around for later

* (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow!

* merge cmc/datastore/get_a_single_row (#590)

* no more col_arrays in re_query

* introducing the notion of clustering key, thankfully breaking all tests by design

* making good use of that shiny new Instance component

* merge cmc/datastore/get_rid_of_copies (#584)

* missed one

* introducing arrow_util with is_dense_array()

* finding the clustering comp of the row... or creating it!

* rebasin'

* post rebase clean up

* addressing PR comments, I hope

* ensure that clustering components are properly sorted, failing the existing test suite

* build_instances now generate sorted ids, thus greenlighting the test suite

* missed a couple

* addressed PR comments

* going for the ArrayExt route

* completing the quadrifecta of checks

* the unavoidable typed error revolution is on its way, and it's early

* where we're going we don't need polars

* update everything for the new APIs

* error for unsupported clustering key types

* clean up and actually testing our error paths

* move those nasty internal tests into their own dirty corner

* finally some high-level tests in here

* i happen to like where this is going

* shuffling things

* demonstrating that implicit instances are somehow broken

* fully working implicit clustering keys, but demonstrating a sorting issue somewhere

* there is still something weird going on tho

* latest_at behaving as one would expect

* automatically cache generated cluster instances

* time to clean up en masse

* still want to put some stress on the bucketing

* make ArrayExt::is_dense a little more friendly, just in case...

* TimeType::format_range

* independent latest_at query and using appropriate types everywhere

* re_query: use polars/fmt in tests

* re_query: remove implicit instances

* fixing the u32 vs u64 instance drama

* really starting to like how this looks

* cluster-aware polars helpers :>

* cleanin up tests

* continuing cleanup and doc

* updating visuals for this brave new world

* docs

* self-review

* bruh

* bruh...

* ...

* outdated comment

* no reason to search for it multiple times

* polars_helpers => polars_util for consistency's sake

* addressing PR comments and a couple other things

* xxx

* post-merge fixes

* TimeInt should be nohash

* high-level polar range tools + making first half of range impl pass

* implement the streaming half

* finally defeated all demons

* still passes?

* it looks like we've made it out alive

* polars util: join however you wish

* fixed formatting

* point2d's PoVs working as expected

* passing full ranges

* docs and such part 1, the semantics are hell

* fixing the filtering mess in tests

* me stoopid

* polars docs

* addressing the clones

* xxx

* missed a gazillon conflict somehow

* polars util spring cleaning

* do indicate and demonstrate that range_components is _not_ a real streaming join

* fixed some comments

* bruh

* screw it, going for the real deal: full streaming joins

* YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE

* yep yep i like this

* I hereby declare myself _satisfied_

* initiating the great cleanup

* add notes for upcoming terminology pr

* bringing IndexRowNr into the mix and slowly starting to fix terminology mess

* improving range_components ergonomics

* putting it all in self-reviewable state

* self-review

* add bench

* xxx

* addressing PR comments

* demonstrating nasty edge-case with streaming-joins

* update streaming-join merging rules to fix said edge case

* implement PoV-less, always-yield lower-level API + adapt higher-level one

* addressing PR comments

* self and not-so-self reviews
teh-cmc added a commit that referenced this pull request Jan 2, 2023
* get is supposed to return a row, not a [row]

* unwrap note

* the bench too

* self review

* doc test also

* and re_query ofc!

* slicing is _very_ slow, don't do it if you don't have to

* no more col_arrays in re_query

* there's actually no need for concatenating at all

* incrementally compute and cache bucket sizes

* cleaning up and documenting existing limitations

* introducing bucket retirement

* issue ref

* some more doc stuff

* self-review

* polars/fmt should always be there for tests

* streamlining batch support

* take list header into account

* it's fine

* self-review

* just something i want to keep around for later

* (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow!

* merge cmc/datastore/get_a_single_row (#590)

* no more col_arrays in re_query

* introducing the notion of clustering key, thankfully breaking all tests by design

* making good use of that shiny new Instance component

* merge cmc/datastore/get_rid_of_copies (#584)

* missed one

* introducing arrow_util with is_dense_array()

* finding the clustering comp of the row... or creating it!

* rebasin'

* post rebase clean up

* addressing PR comments, I hope

* ensure that clustering components are properly sorted, failing the existing test suite

* build_instances now generate sorted ids, thus greenlighting the test suite

* missed a couple

* addressed PR comments

* going for the ArrayExt route

* completing the quadrifecta of checks

* the unavoidable typed error revolution is on its way, and it's early

* where we're going we don't need polars

* update everything for the new APIs

* error for unsupported clustering key types

* clean up and actually testing our error paths

* move those nasty internal tests into their own dirty corner

* finally some high-level tests in here

* i happen to like where this is going

* shuffling things

* demonstrating that implicit instances are somehow broken

* fully working implicit clustering keys, but demonstrating a sorting issue somewhere

* there is still something weird going on tho

* latest_at behaving as one would expect

* automatically cache generated cluster instances

* time to clean up en masse

* still want to put some stress on the bucketing

* make ArrayExt::is_dense a little more friendly, just in case...

* TimeType::format_range

* independent latest_at query and using appropriate types everywhere

* re_query: use polars/fmt in tests

* re_query: remove implicit instances

* fixing the u32 vs u64 instance drama

* really starting to like how this looks

* cluster-aware polars helpers :>

* cleanin up tests

* continuing cleanup and doc

* updating visuals for this brave new world

* docs

* self-review

* bruh

* bruh...

* ...

* outdated comment

* no reason to search for it multiple times

* polars_helpers => polars_util for consistency's sake

* addressing PR comments and a couple other things

* xxx

* post-merge fixes

* TimeInt should be nohash

* high-level polar range tools + making first half of range impl pass

* implement the streaming half

* finally defeated all demons

* still passes?

* it looks like we've made it out alive

* polars util: join however you wish

* fixed formatting

* point2d's PoVs working as expected

* passing full ranges

* docs and such part 1, the semantics are hell

* fixing the filtering mess in tests

* me stoopid

* polars docs

* addressing the clones

* xxx

* missed a gazillon conflict somehow

* polars util spring cleaning

* do indicate and demonstrate that range_components is _not_ a real streaming join

* fixed some comments

* bruh

* screw it, going for the real deal: full streaming joins

* YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE

* yep yep i like this

* I hereby declare myself _satisfied_

* initiating the great cleanup

* add notes for upcoming terminology pr

* bringing IndexRowNr into the mix and slowly starting to fix terminology mess

* improving range_components ergonomics

* putting it all in self-reviewable state

* self-review

* add bench

* xxx

* having some casual fun with dataframes :>

* now with components!

* just some experiments im not too found of... keeping around just in case

* Revert "just some experiments im not too found of... keeping around just in case"

This reverts commit 15f6487.

* playing around with insert_id-as-data... which turns out to be quite helpful

* going into store_insert_ids for real

* full impl

* add example

* self-review

* reviewable

* addressing PR comments

* is that readable

* Revert "is that readable"

This reverts commit f802ff5.

* standalone examples for all dataframe APIs

* burn all todos

* update doc: you can do that now!

* demonstrating nasty edge-case with streaming-joins

* update streaming-join merging rules to fix said edge case

* addressed PR comments
teh-cmc added a commit that referenced this pull request Jan 2, 2023
* get is supposed to return a row, not a [row]

* unwrap note

* the bench too

* self review

* doc test also

* and re_query ofc!

* slicing is _very_ slow, don't do it if you don't have to

* no more col_arrays in re_query

* there's actually no need for concatenating at all

* incrementally compute and cache bucket sizes

* cleaning up and documenting existing limitations

* introducing bucket retirement

* issue ref

* some more doc stuff

* self-review

* polars/fmt should always be there for tests

* streamlining batch support

* take list header into account

* it's fine

* self-review

* just something i want to keep around for later

* (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow!

* merge cmc/datastore/get_a_single_row (#590)

* no more col_arrays in re_query

* introducing the notion of clustering key, thankfully breaking all tests by design

* making good use of that shiny new Instance component

* merge cmc/datastore/get_rid_of_copies (#584)

* missed one

* introducing arrow_util with is_dense_array()

* finding the clustering comp of the row... or creating it!

* rebasin'

* post rebase clean up

* addressing PR comments, I hope

* ensure that clustering components are properly sorted, failing the existing test suite

* build_instances now generate sorted ids, thus greenlighting the test suite

* missed a couple

* addressed PR comments

* going for the ArrayExt route

* completing the quadrifecta of checks

* the unavoidable typed error revolution is on its way, and it's early

* where we're going we don't need polars

* update everything for the new APIs

* error for unsupported clustering key types

* clean up and actually testing our error paths

* move those nasty internal tests into their own dirty corner

* finally some high-level tests in here

* i happen to like where this is going

* shuffling things

* demonstrating that implicit instances are somehow broken

* fully working implicit clustering keys, but demonstrating a sorting issue somewhere

* there is still something weird going on tho

* latest_at behaving as one would expect

* automatically cache generated cluster instances

* time to clean up en masse

* still want to put some stress on the bucketing

* make ArrayExt::is_dense a little more friendly, just in case...

* TimeType::format_range

* independent latest_at query and using appropriate types everywhere

* re_query: use polars/fmt in tests

* re_query: remove implicit instances

* fixing the u32 vs u64 instance drama

* really starting to like how this looks

* cluster-aware polars helpers :>

* cleanin up tests

* continuing cleanup and doc

* updating visuals for this brave new world

* docs

* self-review

* bruh

* bruh...

* ...

* outdated comment

* no reason to search for it multiple times

* polars_helpers => polars_util for consistency's sake

* addressing PR comments and a couple other things

* xxx

* post-merge fixes

* TimeInt should be nohash

* high-level polar range tools + making first half of range impl pass

* implement the streaming half

* finally defeated all demons

* still passes?

* it looks like we've made it out alive

* polars util: join however you wish

* fixed formatting

* point2d's PoVs working as expected

* passing full ranges

* docs and such part 1, the semantics are hell

* fixing the filtering mess in tests

* me stoopid

* polars docs

* addressing the clones

* xxx

* missed a gazillon conflict somehow

* polars util spring cleaning

* do indicate and demonstrate that range_components is _not_ a real streaming join

* fixed some comments

* bruh

* screw it, going for the real deal: full streaming joins

* YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE

* yep yep i like this

* I hereby declare myself _satisfied_

* initiating the great cleanup

* add notes for upcoming terminology pr

* bringing IndexRowNr into the mix and slowly starting to fix terminology mess

* improving range_components ergonomics

* putting it all in self-reviewable state

* self-review

* add bench

* xxx

* addressing PR comments

* sanity checking cluster components

* demonstrating nasty edge-case with streaming-joins

* update streaming-join merging rules to fix said edge case

* implement PoV-less, always-yield lower-level API + adapt higher-level one

* addressing PR comments

* self and not-so-self reviews

* always-on cluster key, it's fiiiine
teh-cmc added a commit that referenced this pull request Jan 2, 2023
* get is supposed to return a row, not a [row]

* unwrap note

* the bench too

* self review

* doc test also

* and re_query ofc!

* slicing is _very_ slow, don't do it if you don't have to

* no more col_arrays in re_query

* there's actually no need for concatenating at all

* incrementally compute and cache bucket sizes

* cleaning up and documenting existing limitations

* introducing bucket retirement

* issue ref

* some more doc stuff

* self-review

* polars/fmt should always be there for tests

* streamlining batch support

* take list header into account

* it's fine

* self-review

* just something i want to keep around for later

* (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow!

* merge cmc/datastore/get_a_single_row (#590)

* no more col_arrays in re_query

* introducing the notion of clustering key, thankfully breaking all tests by design

* making good use of that shiny new Instance component

* merge cmc/datastore/get_rid_of_copies (#584)

* missed one

* introducing arrow_util with is_dense_array()

* finding the clustering comp of the row... or creating it!

* rebasin'

* post rebase clean up

* addressing PR comments, I hope

* ensure that clustering components are properly sorted, failing the existing test suite

* build_instances now generate sorted ids, thus greenlighting the test suite

* missed a couple

* addressed PR comments

* going for the ArrayExt route

* completing the quadrifecta of checks

* the unavoidable typed error revolution is on its way, and it's early

* where we're going we don't need polars

* update everything for the new APIs

* error for unsupported clustering key types

* clean up and actually testing our error paths

* move those nasty internal tests into their own dirty corner

* finally some high-level tests in here

* i happen to like where this is going

* shuffling things

* demonstrating that implicit instances are somehow broken

* fully working implicit clustering keys, but demonstrating a sorting issue somewhere

* there is still something weird going on tho

* latest_at behaving as one would expect

* automatically cache generated cluster instances

* time to clean up en masse

* still want to put some stress on the bucketing

* make ArrayExt::is_dense a little more friendly, just in case...

* TimeType::format_range

* independent latest_at query and using appropriate types everywhere

* re_query: use polars/fmt in tests

* re_query: remove implicit instances

* fixing the u32 vs u64 instance drama

* really starting to like how this looks

* cluster-aware polars helpers :>

* cleanin up tests

* continuing cleanup and doc

* updating visuals for this brave new world

* docs

* self-review

* bruh

* bruh...

* ...

* outdated comment

* no reason to search for it multiple times

* polars_helpers => polars_util for consistency's sake

* addressing PR comments and a couple other things

* xxx

* post-merge fixes

* TimeInt should be nohash

* high-level polar range tools + making first half of range impl pass

* implement the streaming half

* finally defeated all demons

* still passes?

* it looks like we've made it out alive

* polars util: join however you wish

* fixed formatting

* point2d's PoVs working as expected

* passing full ranges

* docs and such part 1, the semantics are hell

* fixing the filtering mess in tests

* me stoopid

* polars docs

* addressing the clones

* xxx

* missed a gazillon conflict somehow

* polars util spring cleaning

* do indicate and demonstrate that range_components is _not_ a real streaming join

* fixed some comments

* bruh

* screw it, going for the real deal: full streaming joins

* YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE

* yep yep i like this

* I hereby declare myself _satisfied_

* initiating the great cleanup

* add notes for upcoming terminology pr

* bringing IndexRowNr into the mix and slowly starting to fix terminology mess

* improving range_components ergonomics

* putting it all in self-reviewable state

* self-review

* add bench

* xxx

* addressing PR comments

* first impl

* ported simple_query() to simple_range

* doc and such

* added e2e example for range queries

* self-review

* support for new EntityView

* demonstrating nasty edge-case with streaming-joins

* update streaming-join merging rules to fix said edge case

* mimicking range_components' new merging rules

* implement PoV-less, always-yield lower-level API + adapt higher-level one

* addressing PR comments

* ported to new low-level APIs

* xxx

* addressed PR comments

* self and not-so-self reviews

* the future is quite literally here
teh-cmc added a commit that referenced this pull request Jan 3, 2023
* get is supposed to return a row, not a [row]

* unwrap note

* the bench too

* self review

* doc test also

* and re_query ofc!

* slicing is _very_ slow, don't do it if you don't have to

* no more col_arrays in re_query

* there's actually no need for concatenating at all

* incrementally compute and cache bucket sizes

* cleaning up and documenting existing limitations

* introducing bucket retirement

* issue ref

* some more doc stuff

* self-review

* polars/fmt should always be there for tests

* streamlining batch support

* take list header into account

* it's fine

* self-review

* just something i want to keep around for later

* (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow!

* merge cmc/datastore/get_a_single_row (#590)

* no more col_arrays in re_query

* introducing the notion of clustering key, thankfully breaking all tests by design

* making good use of that shiny new Instance component

* merge cmc/datastore/get_rid_of_copies (#584)

* missed one

* introducing arrow_util with is_dense_array()

* finding the clustering comp of the row... or creating it!

* rebasin'

* post rebase clean up

* addressing PR comments, I hope

* ensure that clustering components are properly sorted, failing the existing test suite

* build_instances now generate sorted ids, thus greenlighting the test suite

* missed a couple

* addressed PR comments

* going for the ArrayExt route

* completing the quadrifecta of checks

* the unavoidable typed error revolution is on its way, and it's early

* where we're going we don't need polars

* update everything for the new APIs

* error for unsupported clustering key types

* clean up and actually testing our error paths

* move those nasty internal tests into their own dirty corner

* finally some high-level tests in here

* i happen to like where this is going

* shuffling things

* demonstrating that implicit instances are somehow broken

* fully working implicit clustering keys, but demonstrating a sorting issue somewhere

* there is still something weird going on tho

* latest_at behaving as one would expect

* automatically cache generated cluster instances

* time to clean up en masse

* still want to put some stress on the bucketing

* make ArrayExt::is_dense a little more friendly, just in case...

* TimeType::format_range

* independent latest_at query and using appropriate types everywhere

* re_query: use polars/fmt in tests

* re_query: remove implicit instances

* fixing the u32 vs u64 instance drama

* really starting to like how this looks

* cluster-aware polars helpers :>

* cleanin up tests

* continuing cleanup and doc

* updating visuals for this brave new world

* docs

* self-review

* bruh

* bruh...

* ...

* outdated comment

* no reason to search for it multiple times

* polars_helpers => polars_util for consistency's sake

* addressing PR comments and a couple other things

* xxx

* post-merge fixes

* TimeInt should be nohash

* high-level polar range tools + making first half of range impl pass

* implement the streaming half

* finally defeated all demons

* still passes?

* it looks like we've made it out alive

* polars util: join however you wish

* fixed formatting

* point2d's PoVs working as expected

* passing full ranges

* docs and such part 1, the semantics are hell

* fixing the filtering mess in tests

* me stoopid

* polars docs

* addressing the clones

* xxx

* missed a gazillon conflict somehow

* polars util spring cleaning

* do indicate and demonstrate that range_components is _not_ a real streaming join

* fixed some comments

* bruh

* screw it, going for the real deal: full streaming joins

* YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE

* yep yep i like this

* I hereby declare myself _satisfied_

* initiating the great cleanup

* add notes for upcoming terminology pr

* bringing IndexRowNr into the mix and slowly starting to fix terminology mess

* improving range_components ergonomics

* putting it all in self-reviewable state

* self-review

* add bench

* xxx

* addressing PR comments

* first impl

* ported simple_query() to simple_range

* doc and such

* added e2e example for range queries

* self-review

* support for new EntityView

* demonstrating nasty edge-case with streaming-joins

* update streaming-join merging rules to fix said edge case

* mimicking range_components' new merging rules

* Demonstrating how insanely slow the obvious solution is

datastore/insert/batch/rects/insert
            time:   [387.54 µs 387.98 µs 388.52 µs]
            thrpt:  [25.739 Melem/s 25.775 Melem/s 25.804 Melem/s]
     change:
            time:   [+227.27% +227.92% +228.56%] (p = 0.00 < 0.05)
            thrpt:  [-69.564% -69.505% -69.444%]
            Performance has regressed.

* it'd be a tiny bit better with some kind of splats...

datastore/insert/batch/rects/insert
            time:   [284.35 µs 284.55 µs 284.86 µs]
            thrpt:  [35.105 Melem/s 35.144 Melem/s 35.167 Melem/s]
     change:
            time:   [+137.45% +138.08% +138.52%] (p = 0.00 < 0.05)
            thrpt:  [-58.075% -57.997% -57.885%]
            Performance has regressed.

* and now with MsgId being a full fledged component

datastore/insert/batch/rects/insert
            time:   [180.84 µs 184.42 µs 188.96 µs]
            thrpt:  [52.920 Melem/s 54.225 Melem/s 55.296 Melem/s]
     change:
            time:   [+1.0072% +2.1236% +3.3206%] (p = 0.00 < 0.05)
            thrpt:  [-3.2139% -2.0795% -0.9972%]

* stuff

* bruh

* implement PoV-less, always-yield lower-level API + adapt higher-level one

* addressing PR comments

* ported to new low-level APIs

* xxx

* addressed PR comments

* self and not-so-self reviews

* the future is quite literally here

* some comments at the very least

* msgid standing on its own

* bruh
teh-cmc added a commit that referenced this pull request Jan 3, 2023
* get is supposed to return a row, not a [row]

* unwrap note

* the bench too

* self review

* doc test also

* and re_query ofc!

* slicing is _very_ slow, don't do it if you don't have to

* no more col_arrays in re_query

* there's actually no need for concatenating at all

* incrementally compute and cache bucket sizes

* cleaning up and documenting existing limitations

* introducing bucket retirement

* issue ref

* some more doc stuff

* self-review

* polars/fmt should always be there for tests

* streamlining batch support

* take list header into account

* it's fine

* self-review

* just something i want to keep around for later

* (un)wrapping lists is a bit slow... and slicing them is _extremely_ slow!

* merge cmc/datastore/get_a_single_row (#590)

* no more col_arrays in re_query

* introducing the notion of clustering key, thankfully breaking all tests by design

* making good use of that shiny new Instance component

* merge cmc/datastore/get_rid_of_copies (#584)

* missed one

* introducing arrow_util with is_dense_array()

* finding the clustering comp of the row... or creating it!

* rebasin'

* post rebase clean up

* addressing PR comments, I hope

* ensure that clustering components are properly sorted, failing the existing test suite

* build_instances now generate sorted ids, thus greenlighting the test suite

* missed a couple

* addressed PR comments

* going for the ArrayExt route

* completing the quadrifecta of checks

* the unavoidable typed error revolution is on its way, and it's early

* where we're going we don't need polars

* update everything for the new APIs

* error for unsupported clustering key types

* clean up and actually testing our error paths

* move those nasty internal tests into their own dirty corner

* finally some high-level tests in here

* i happen to like where this is going

* shuffling things

* demonstrating that implicit instances are somehow broken

* fully working implicit clustering keys, but demonstrating a sorting issue somewhere

* there is still something weird going on tho

* latest_at behaving as one would expect

* automatically cache generated cluster instances

* time to clean up en masse

* still want to put some stress on the bucketing

* make ArrayExt::is_dense a little more friendly, just in case...

* TimeType::format_range

* independent latest_at query and using appropriate types everywhere

* re_query: use polars/fmt in tests

* re_query: remove implicit instances

* fixing the u32 vs u64 instance drama

* really starting to like how this looks

* cluster-aware polars helpers :>

* cleanin up tests

* continuing cleanup and doc

* updating visuals for this brave new world

* docs

* self-review

* bruh

* bruh...

* ...

* outdated comment

* no reason to search for it multiple times

* polars_helpers => polars_util for consistency's sake

* addressing PR comments and a couple other things

* xxx

* post-merge fixes

* TimeInt should be nohash

* high-level polar range tools + making first half of range impl pass

* implement the streaming half

* finally defeated all demons

* still passes?

* it looks like we've made it out alive

* polars util: join however you wish

* fixed formatting

* point2d's PoVs working as expected

* passing full ranges

* docs and such part 1, the semantics are hell

* fixing the filtering mess in tests

* me stoopid

* polars docs

* addressing the clones

* xxx

* missed a gazillon conflict somehow

* polars util spring cleaning

* do indicate and demonstrate that range_components is _not_ a real streaming join

* fixed some comments

* bruh

* screw it, going for the real deal: full streaming joins

* YESgit sgit s FINALLY SEMANTICS I ACTUALLY LIKE

* yep yep i like this

* I hereby declare myself _satisfied_

* initiating the great cleanup

* add notes for upcoming terminology pr

* bringing IndexRowNr into the mix and slowly starting to fix terminology mess

* improving range_components ergonomics

* putting it all in self-reviewable state

* self-review

* add bench

* xxx

* addressing PR comments

* first impl

* ported simple_query() to simple_range

* doc and such

* added e2e example for range queries

* self-review

* support for new EntityView

* demonstrating nasty edge-case with streaming-joins

* update streaming-join merging rules to fix said edge case

* mimicking range_components' new merging rules

* implement timepoints-as-components and insert them automatically

* derive proper ObjectType TextEntry component

* introducing the TextEntry component

* re_viewer now supports both legacy and arrow data sources

* added pure rust example for text entries

* plugging everything on the python side

* self-review

* bruh

* it's decent, but we won't ever be able to deduplicate that way :(

* Revert "it's decent, but we won't ever be able to deduplicate that way :("

This reverts commit f0f4e00.

* Demonstrating how insanely slow the obvious solution is

datastore/insert/batch/rects/insert
            time:   [387.54 µs 387.98 µs 388.52 µs]
            thrpt:  [25.739 Melem/s 25.775 Melem/s 25.804 Melem/s]
     change:
            time:   [+227.27% +227.92% +228.56%] (p = 0.00 < 0.05)
            thrpt:  [-69.564% -69.505% -69.444%]
            Performance has regressed.

* it'd be a tiny bit better with some kind of splats...

datastore/insert/batch/rects/insert
            time:   [284.35 µs 284.55 µs 284.86 µs]
            thrpt:  [35.105 Melem/s 35.144 Melem/s 35.167 Melem/s]
     change:
            time:   [+137.45% +138.08% +138.52%] (p = 0.00 < 0.05)
            thrpt:  [-58.075% -57.997% -57.885%]
            Performance has regressed.

* and now with MsgId being a full fledged component

datastore/insert/batch/rects/insert
            time:   [180.84 µs 184.42 µs 188.96 µs]
            thrpt:  [52.920 Melem/s 54.225 Melem/s 55.296 Melem/s]
     change:
            time:   [+1.0072% +2.1236% +3.3206%] (p = 0.00 < 0.05)
            thrpt:  [-3.2139% -2.0795% -0.9972%]

* stuff

* bruh

* storing msg metadata

* Revert "implement timepoints-as-components and insert them automatically"

This reverts commit e6a6fd5.

* I guess it's decent

* implement PoV-less, always-yield lower-level API + adapt higher-level one

* addressing PR comments

* ported to new low-level APIs

* xxx

* addressed PR comments

* self and not-so-self reviews

* the future is quite literally here

* add MsgBundle helpers

* remove hardwired ref to Instance

* post-merge fixes

* cleanup

* guess that isnt very useful anymore

* using new msgbundle helpers

* updating my non-sensical msgbundle helpers

* explaining the example

* woops

* addressed PR comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

re_datastore: get rid of immutable arrays in component buckets
4 participants