-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re_datastore: component bucketing & statistics #468
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO:
- add before/after benchmarks to this PR: while this is irrelevant for now, it's a good habit to start asap
/// The maximum size of a component bucket before triggering a split. | ||
/// | ||
/// ⚠ When configuring this threshold, do keep in mind that component tables are shared | ||
/// across all timelines and all entities! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can change this easily enough later, but do we get a significant benefit to sharing component tables across entities vs doing something like:
components: HashMap<(ComponentName, EntityPathHash), ComponentTable>,
My thought is that long-term if we're dealing with disk- or remote- backed storage we may only want to load the data relevant to a subset of the object-tree and so preserving that partitioning could be beneficial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can change this easily enough later, but do we get a significant benefit to sharing component tables across entities
Right now we definitely don't, since deduplication is done synchronously at write-time, and you can only write to multiple timelines at once at the moment, not multiple entites.
I.e. today you get deduplication across timelines (only within the same insert, though!) but never across entities.
Now I guess the question is: do we want to offer the possibility of writing to multiple entities at once at some point, like we do for timelines?
From an SDK standpoint, it could either mean changing the API so that users can pass in a sequence of entity paths, or automagically coalesce things behind the scenes (whether in the SDK, or on the server-side... or both?!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being able to batch writes of multiple entities from the SDK would be great, and would involve sending multiple schema's across. This is easy using flight
, but currently we already have a lot of overhead with the message serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Niiiice
crates/re_arrow_store/src/store.rs
Outdated
impl DataStoreConfig { | ||
pub const fn const_default() -> Self { | ||
Self { | ||
component_bucket_size_bytes: 32 * 1024 * 1024, // 32MiB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract into something like const COMPONENT_BUCKET_SIZE_BYTES_DEFAULT: u64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree - that just adds one extra level of indirection imho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @emilk here, the source of truth is the default impl
|
||
/// Returns the size of the data stored across this bucket, in bytes. | ||
pub fn total_size_bytes(&self) -> u64 { | ||
arrow2::compute::aggregate::estimated_bytes_size(&*self.data) as u64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General typing question, should we use usize
instead of u64
for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been asking this myself during development and never actually addressed it...
usize
pros: no wasted space and performance on 32-bit systemsusize
cons: all code paths have to properly deal with the very real possibility of overflow, and it can get excruciatingly complexu64
pros: wasted space and performance on 32-bit systemsu64
cons: just panic on overflow, everywhere, all the time
I can't really see any reason to run database software on 32-bit these days, and even if one absolutely had to_ for some obscure reason, they'd still be minority enough that it's not worth holding back every other platform for their use case.
u64
everywhere it is!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, unless you were crazy enough to build something that was running inside of wasm32… ahem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not gonna lie: completely forgot about that one lol.
But then again the expected use case is for the viewer to run in the browser, not the database (at least in 99% of cases)... right? 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
This PR lays out the foundations for bucketing in general, and implements bucketing for components specifically.
Importantly, it specifies, in code, what we precisely mean when we ask "what is the size of this Arrow array?" through a series of tests: look for
test_arrow_estimated_size_bytes
.It also computes and exposes the first statistics out of the datastore, as we need those for splitting the buckets to start with.
Integration benchmarks:
(Note: benchmarks are pretty much irrelevant at the moment, but I'd like us to get into the habit of always publishing them when modifying store internals, so here ya go)
Closes #437
Partially unblocks #470