re_datastore: component bucketing & statistics #468

teh-cmc · 2022-12-05T20:56:01Z

This PR lays out the foundations for bucketing in general, and implements bucketing for components specifically.

Importantly, it specifies, in code, what we precisely mean when we ask "what is the size of this Arrow array?" through a series of tests: look for test_arrow_estimated_size_bytes.

It also computes and exposes the first statistics out of the datastore, as we need those for splitting the buckets to start with.

Integration benchmarks:

datastore/batch/rects/insert
                        time:   [959.38 µs 962.37 µs 965.82 µs]
                        thrpt:  [10.354 Melem/s 10.391 Melem/s 10.423 Melem/s]
                 change:
                        time:   [+3.4968% +4.5792% +5.7706%] (p = 0.00 < 0.05)
                        thrpt:  [-5.4558% -4.3787% -3.3786%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

datastore/batch/rects/query
                        time:   [11.437 µs 11.593 µs 11.790 µs]
                        thrpt:  [8.4815 Melem/s 8.6256 Melem/s 8.7439 Melem/s]
                 change:
                        time:   [-15.342% -13.558% -11.736%] (p = 0.00 < 0.05)
                        thrpt:  [+13.297% +15.684% +18.122%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

(Note: benchmarks are pretty much irrelevant at the moment, but I'd like us to get into the habit of always publishing them when modifying store internals, so here ya go)

Closes #437
Partially unblocks #470

…-ended queue

teh-cmc

TODO:

add before/after benchmarks to this PR: while this is irrelevant for now, it's a good habit to start asap

crates/re_arrow_store/src/store.rs

crates/re_arrow_store/src/store_read.rs

crates/re_arrow_store/src/store_write.rs

…nt_bucketing

jleibs · 2022-12-06T19:28:55Z

crates/re_arrow_store/src/store.rs

+    /// The maximum size of a component bucket before triggering a split.
+    ///
+    /// ⚠ When configuring this threshold, do keep in mind that component tables are shared
+    /// across all timelines and all entities!


I think we can change this easily enough later, but do we get a significant benefit to sharing component tables across entities vs doing something like:

components: HashMap<(ComponentName, EntityPathHash), ComponentTable>,

My thought is that long-term if we're dealing with disk- or remote- backed storage we may only want to load the data relevant to a subset of the object-tree and so preserving that partitioning could be beneficial.

I think we can change this easily enough later, but do we get a significant benefit to sharing component tables across entities

Right now we definitely don't, since deduplication is done synchronously at write-time, and you can only write to multiple timelines at once at the moment, not multiple entites.
I.e. today you get deduplication across timelines (only within the same insert, though!) but never across entities.

Now I guess the question is: do we want to offer the possibility of writing to multiple entities at once at some point, like we do for timelines?
From an SDK standpoint, it could either mean changing the API so that users can pass in a sequence of entity paths, or automagically coalesce things behind the scenes (whether in the SDK, or on the server-side... or both?!).

Being able to batch writes of multiple entities from the SDK would be great, and would involve sending multiple schema's across. This is easy using flight, but currently we already have a lot of overhead with the message serialization.

emilk

Niiiice

crates/re_arrow_store/src/store.rs

jondo2010 · 2022-12-07T09:07:33Z

crates/re_arrow_store/src/store.rs

+impl DataStoreConfig {
+    pub const fn const_default() -> Self {
+        Self {
+            component_bucket_size_bytes: 32 * 1024 * 1024, // 32MiB


Extract into something like const COMPONENT_BUCKET_SIZE_BYTES_DEFAULT: u64

I disagree - that just adds one extra level of indirection imho

I agree with @emilk here, the source of truth is the default impl

crates/re_arrow_store/src/store.rs

jondo2010 · 2022-12-07T09:14:40Z

crates/re_arrow_store/src/store.rs

+
+    /// Returns the size of the data stored across this bucket, in bytes.
+    pub fn total_size_bytes(&self) -> u64 {
+        arrow2::compute::aggregate::estimated_bytes_size(&*self.data) as u64


General typing question, should we use usize instead of u64 for this?

I've been asking this myself during development and never actually addressed it...

usize pros: no wasted space and performance on 32-bit systems

usize cons: all code paths have to properly deal with the very real possibility of overflow, and it can get excruciatingly complex

u64 pros: wasted space and performance on 32-bit systems

u64 cons: just panic on overflow, everywhere, all the time

I can't really see any reason to run database software on 32-bit these days, and even if one absolutely had to_ for some obscure reason, they'd still be minority enough that it's not worth holding back every other platform for their use case.
u64 everywhere it is!

I mean, unless you were crazy enough to build something that was running inside of wasm32… ahem.

Not gonna lie: completely forgot about that one lol.

But then again the expected use case is for the viewer to run in the browser, not the database (at least in 99% of cases)... right? 😬

crates/re_arrow_store/src/store.rs

jondo2010

lgtm

…nt_bucketing

teh-cmc added 6 commits December 5, 2022 19:17

introducing DataStoreConfig & turning component buckets into a double…

931c222

…-ended queue

implementing component bucketing on the write path + size APIs

598d6a8

running existing tests through custom configs to stress test buckets

1cc6bd2

implementing bucketing on the read path, thus making tests pass

7ade226

clean it up

a447864

taplo did not run for some reason

638ac28

teh-cmc commented Dec 6, 2022

View reviewed changes

teh-cmc changed the title ~~re_datastore: component bucketing~~ re_datastore: component bucketing & statistics Dec 6, 2022

teh-cmc added 4 commits December 6, 2022 12:18

Merge remote-tracking branch 'origin/main' into cmc/datastore/compone…

c47cf50

…nt_bucketing

going over my review comments

0894bde

Merge remote-tracking branch 'origin/main' into cmc/datastore/compone…

b9e1861

…nt_bucketing

specifying arrow sizes through testing

7ce0674

teh-cmc marked this pull request as ready for review December 6, 2022 15:43

warn about bucket sizes

32530f2

emilk self-requested a review December 6, 2022 19:23

jleibs reviewed Dec 6, 2022

View reviewed changes

emilk approved these changes Dec 6, 2022

View reviewed changes

crates/re_arrow_store/src/store.rs Outdated Show resolved Hide resolved

crates/re_arrow_store/src/store.rs Outdated Show resolved Hide resolved

crates/re_arrow_store/src/store.rs Outdated Show resolved Hide resolved

crates/re_arrow_store/src/store.rs Outdated Show resolved Hide resolved

jleibs approved these changes Dec 6, 2022

View reviewed changes

jondo2010 reviewed Dec 7, 2022

View reviewed changes

crates/re_arrow_store/src/store.rs Outdated Show resolved Hide resolved

jondo2010 reviewed Dec 7, 2022

View reviewed changes

crates/re_arrow_store/src/store.rs Show resolved Hide resolved

jondo2010 approved these changes Dec 7, 2022

View reviewed changes

teh-cmc added 6 commits December 7, 2022 10:50

const defaults, the right way (tm)

c93da49

get rid of row_offset stutter

d76429e

sizes are now u64 everywhere all the time

1bcc4b0

Merge remote-tracking branch 'origin/main' into cmc/datastore/compone…

9cf021a

…nt_bucketing

dayum, missed one

8f174b0

forgot about doc test...

cb2f4b3

teh-cmc merged commit 6873a63 into main Dec 7, 2022

teh-cmc deleted the cmc/datastore/component_bucketing branch December 7, 2022 10:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re_datastore: component bucketing & statistics #468

re_datastore: component bucketing & statistics #468

teh-cmc commented Dec 5, 2022 •

edited

Loading

teh-cmc left a comment

jleibs Dec 6, 2022

teh-cmc Dec 7, 2022

jondo2010 Dec 7, 2022

emilk left a comment

jondo2010 Dec 7, 2022

emilk Dec 7, 2022

teh-cmc Dec 7, 2022

jondo2010 Dec 7, 2022

teh-cmc Dec 7, 2022

emilk Dec 7, 2022

teh-cmc Dec 7, 2022 •

edited

Loading

jondo2010 left a comment

re_datastore: component bucketing & statistics #468

re_datastore: component bucketing & statistics #468

Conversation

teh-cmc commented Dec 5, 2022 • edited Loading

teh-cmc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emilk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teh-cmc Dec 7, 2022 • edited Loading

Choose a reason for hiding this comment

jondo2010 left a comment

Choose a reason for hiding this comment

teh-cmc commented Dec 5, 2022 •

edited

Loading

teh-cmc Dec 7, 2022 •

edited

Loading