Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re_datastore: component bucketing & statistics #468

Merged
merged 17 commits into from
Dec 7, 2022

Conversation

teh-cmc
Copy link
Member

@teh-cmc teh-cmc commented Dec 5, 2022

This PR lays out the foundations for bucketing in general, and implements bucketing for components specifically.

Importantly, it specifies, in code, what we precisely mean when we ask "what is the size of this Arrow array?" through a series of tests: look for test_arrow_estimated_size_bytes.

It also computes and exposes the first statistics out of the datastore, as we need those for splitting the buckets to start with.

Integration benchmarks:

datastore/batch/rects/insert
                        time:   [959.38 µs 962.37 µs 965.82 µs]
                        thrpt:  [10.354 Melem/s 10.391 Melem/s 10.423 Melem/s]
                 change:
                        time:   [+3.4968% +4.5792% +5.7706%] (p = 0.00 < 0.05)
                        thrpt:  [-5.4558% -4.3787% -3.3786%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

datastore/batch/rects/query
                        time:   [11.437 µs 11.593 µs 11.790 µs]
                        thrpt:  [8.4815 Melem/s 8.6256 Melem/s 8.7439 Melem/s]
                 change:
                        time:   [-15.342% -13.558% -11.736%] (p = 0.00 < 0.05)
                        thrpt:  [+13.297% +15.684% +18.122%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

(Note: benchmarks are pretty much irrelevant at the moment, but I'd like us to get into the habit of always publishing them when modifying store internals, so here ya go)

Closes #437
Partially unblocks #470

Copy link
Member Author

@teh-cmc teh-cmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO:

  • add before/after benchmarks to this PR: while this is irrelevant for now, it's a good habit to start asap

crates/re_arrow_store/src/store.rs Outdated Show resolved Hide resolved
crates/re_arrow_store/src/store.rs Outdated Show resolved Hide resolved
crates/re_arrow_store/src/store_read.rs Outdated Show resolved Hide resolved
crates/re_arrow_store/src/store_write.rs Outdated Show resolved Hide resolved
crates/re_arrow_store/src/store_write.rs Outdated Show resolved Hide resolved
crates/re_arrow_store/src/store_write.rs Outdated Show resolved Hide resolved
crates/re_arrow_store/src/store_write.rs Outdated Show resolved Hide resolved
crates/re_arrow_store/src/store_write.rs Outdated Show resolved Hide resolved
crates/re_arrow_store/src/store_write.rs Outdated Show resolved Hide resolved
crates/re_arrow_store/src/store_write.rs Outdated Show resolved Hide resolved
@teh-cmc teh-cmc changed the title re_datastore: component bucketing re_datastore: component bucketing & statistics Dec 6, 2022
@teh-cmc teh-cmc marked this pull request as ready for review December 6, 2022 15:43
@emilk emilk self-requested a review December 6, 2022 19:23
/// The maximum size of a component bucket before triggering a split.
///
/// ⚠ When configuring this threshold, do keep in mind that component tables are shared
/// across all timelines and all entities!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can change this easily enough later, but do we get a significant benefit to sharing component tables across entities vs doing something like:

components: HashMap<(ComponentName, EntityPathHash), ComponentTable>,

My thought is that long-term if we're dealing with disk- or remote- backed storage we may only want to load the data relevant to a subset of the object-tree and so preserving that partitioning could be beneficial.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can change this easily enough later, but do we get a significant benefit to sharing component tables across entities

Right now we definitely don't, since deduplication is done synchronously at write-time, and you can only write to multiple timelines at once at the moment, not multiple entites.
I.e. today you get deduplication across timelines (only within the same insert, though!) but never across entities.

Now I guess the question is: do we want to offer the possibility of writing to multiple entities at once at some point, like we do for timelines?
From an SDK standpoint, it could either mean changing the API so that users can pass in a sequence of entity paths, or automagically coalesce things behind the scenes (whether in the SDK, or on the server-side... or both?!).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being able to batch writes of multiple entities from the SDK would be great, and would involve sending multiple schema's across. This is easy using flight, but currently we already have a lot of overhead with the message serialization.

Copy link
Member

@emilk emilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Niiiice

crates/re_arrow_store/src/store.rs Outdated Show resolved Hide resolved
crates/re_arrow_store/src/store.rs Outdated Show resolved Hide resolved
crates/re_arrow_store/src/store.rs Outdated Show resolved Hide resolved
crates/re_arrow_store/src/store.rs Outdated Show resolved Hide resolved
impl DataStoreConfig {
pub const fn const_default() -> Self {
Self {
component_bucket_size_bytes: 32 * 1024 * 1024, // 32MiB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract into something like const COMPONENT_BUCKET_SIZE_BYTES_DEFAULT: u64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree - that just adds one extra level of indirection imho

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @emilk here, the source of truth is the default impl


/// Returns the size of the data stored across this bucket, in bytes.
pub fn total_size_bytes(&self) -> u64 {
arrow2::compute::aggregate::estimated_bytes_size(&*self.data) as u64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General typing question, should we use usize instead of u64 for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been asking this myself during development and never actually addressed it...

  • usize pros: no wasted space and performance on 32-bit systems
  • usize cons: all code paths have to properly deal with the very real possibility of overflow, and it can get excruciatingly complex
  • u64 pros: wasted space and performance on 32-bit systems
  • u64 cons: just panic on overflow, everywhere, all the time

I can't really see any reason to run database software on 32-bit these days, and even if one absolutely had to_ for some obscure reason, they'd still be minority enough that it's not worth holding back every other platform for their use case.
u64 everywhere it is!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, unless you were crazy enough to build something that was running inside of wasm32… ahem.

Copy link
Member Author

@teh-cmc teh-cmc Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not gonna lie: completely forgot about that one lol.

But then again the expected use case is for the viewer to run in the browser, not the database (at least in 99% of cases)... right? 😬

Copy link
Contributor

@jondo2010 jondo2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@teh-cmc teh-cmc merged commit 6873a63 into main Dec 7, 2022
@teh-cmc teh-cmc deleted the cmc/datastore/component_bucketing branch December 7, 2022 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

re_datastore: bucketing support for components
4 participants