-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re_datastore: component chunks & streamlining batches #584
Changes from 24 commits
940364b
fcf6d5a
1a86bee
9e22ac1
2fac6d2
8682229
b7e5fd5
da816d1
b22eecc
4a7b7ef
f88f248
2bcd47e
c8b40b6
36ce4db
751c2e8
6776365
9652013
bce700e
5c6fec8
e97eab6
37cd9b2
4682c60
5abfffe
02170b9
d3a33cf
9091e29
af046b5
5d19e5f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -836,14 +836,50 @@ pub struct ComponentBucket { | |||||
/// The offset of this bucket in the global table. | ||||||
pub(crate) row_offset: RowIndex, | ||||||
|
||||||
/// Has this bucket been retired yet? | ||||||
/// | ||||||
/// At any given moment, all buckets except the currently active one have to be retired. | ||||||
pub(crate) retired: bool, | ||||||
|
||||||
/// The time ranges (plural!) covered by this bucket. | ||||||
/// Buckets are never sorted over time, so these time ranges can grow arbitrarily large. | ||||||
/// | ||||||
/// These are only used for garbage collection. | ||||||
pub(crate) time_ranges: HashMap<Timeline, TimeRange>, | ||||||
|
||||||
/// All the data for this bucket. This is a single column! | ||||||
pub(crate) data: Box<dyn Array>, | ||||||
/// All the data for this bucket: many rows of a single column. | ||||||
/// | ||||||
/// Each chunk is a list of list of components, i.e. `ListArray<ListArray<StructArray>>`: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
would match the actual type a bit closer, confusing me slightly less :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with Emil here: a chunk is (and should be) still just be a single
The array index corresponds to the different rows. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh yeah no re-reading this now I can see you're definitely both right; got my head all messed up earlier. Will fix all these docs first thing tomorrow thanks 👍 |
||||||
/// - the first list layer corresponds to the different rows, | ||||||
/// - the second list layer corresponds to the different instances within a single row, | ||||||
/// - and finally the struct layer is the component itself. | ||||||
/// E.g.: | ||||||
/// ```ignore | ||||||
/// [ | ||||||
/// [{x: 8.687487, y: 1.9590926}, {x: 2.0559108, y: 0.1494348}, {x: 7.09219, y: 0.9616637}], | ||||||
/// [{x: 7.158843, y: 0.68897724}, {x: 8.934421, y: 2.8420508}], | ||||||
/// ] | ||||||
/// ``` | ||||||
/// | ||||||
/// During the active lifespan of the bucket, this can contain any number of chunks, | ||||||
/// depending on how the data was inserted (e.g. single insertions vs. batches). | ||||||
/// All of these chunks get compacted into one contiguous array when the bucket is retired, | ||||||
/// i.e. when the bucket is full and a new one is created. | ||||||
/// | ||||||
/// Note that, as of today, we do not actually support batched insertion nor do we support | ||||||
/// chunks of non-unit length (batches are inserted on a per-row basis internally). | ||||||
/// As a result, chunks always contain one and only one row's worth of data, at least until | ||||||
/// the bucket is retired and compacted. | ||||||
/// See also #589. | ||||||
pub(crate) chunks: Vec<Box<dyn Array>>, | ||||||
|
||||||
/// The total number of rows present in this bucket, across all chunks. | ||||||
pub(crate) total_rows: u64, | ||||||
/// The size of this bucket in bytes, across all chunks. | ||||||
/// | ||||||
/// Accurately computing the size of arrow arrays is surprisingly costly, which is why we | ||||||
/// cache this. | ||||||
pub(crate) total_size_bytes: u64, | ||||||
} | ||||||
|
||||||
impl std::fmt::Display for ComponentBucket { | ||||||
|
@@ -862,14 +898,16 @@ impl std::fmt::Display for ComponentBucket { | |||||
// - all buckets that follow are lazily instantiated when data get inserted | ||||||
// | ||||||
// TODO(#439): is that still true with deletion? | ||||||
// TODO(#589): support for non-unit-length chunks | ||||||
self.row_offset.as_u64() | ||||||
+ self | ||||||
.data | ||||||
.chunks | ||||||
.len() | ||||||
.checked_sub(1) | ||||||
.expect("buckets are never empty") as u64, | ||||||
))?; | ||||||
|
||||||
f.write_fmt(format_args!("retired: {}\n", self.retired))?; | ||||||
f.write_str("time ranges:\n")?; | ||||||
for (timeline, time_range) in &self.time_ranges { | ||||||
f.write_fmt(format_args!( | ||||||
|
@@ -878,7 +916,12 @@ impl std::fmt::Display for ComponentBucket { | |||||
))?; | ||||||
} | ||||||
|
||||||
let chunk = Chunk::new(vec![self.data()]); | ||||||
let rows = { | ||||||
use arrow2::compute::concatenate::concatenate; | ||||||
let chunks = self.chunks.iter().map(|chunk| &**chunk).collect::<Vec<_>>(); | ||||||
vec![concatenate(&chunks).unwrap()] | ||||||
}; | ||||||
let chunk = Chunk::new(rows); | ||||||
f.write_str(&arrow2::io::print::write(&[chunk], &[self.name.as_str()]))?; | ||||||
|
||||||
Ok(()) | ||||||
|
@@ -888,12 +931,12 @@ impl std::fmt::Display for ComponentBucket { | |||||
impl ComponentBucket { | ||||||
/// Returns the number of rows stored across this bucket. | ||||||
pub fn total_rows(&self) -> u64 { | ||||||
self.data.len() as u64 | ||||||
self.total_rows | ||||||
} | ||||||
|
||||||
/// Returns the size of the data stored across this bucket, in bytes. | ||||||
pub fn total_size_bytes(&self) -> u64 { | ||||||
arrow2::compute::aggregate::estimated_bytes_size(&*self.data) as u64 | ||||||
self.total_size_bytes | ||||||
} | ||||||
} | ||||||
|
||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a description of
retired
actually means.From the PR description it sounds like it means that it is full, or that is full and has also been compacted.
If so, perhaps
full
,compacted
or!active
is a better name? "Retired" makes me think it is no longer in use. Or is this common DB vernacular?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Retired" in this case means that it's been archived and is now read-only, so it'll only be used for the read path from now on.
Being retired implies a bunch of things, currently:
I definitely need to improve the doc there. As for the name... maybe
archived
then?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added something along the lines in the doc-comment 🤞