Client-side chunks 2: introduce `TransportChunk` #6439

teh-cmc · 2024-05-27T15:28:56Z

A TransportChunk is a Chunk that is ready for transport and/or storage.
It is very cheap to go from Chunk to a TransportChunk and vice-versa.

A TransportChunk maps 1:1 to a native Arrow RecordBatch. It has a stable ABI, and can be cheaply send across process boundaries.
arrow2 has no RecordBatch type; we will get one once we migrate to arrow-rs.

A TransportChunk is self-describing: it contains all the data and metadata needed to index it into storage.

We rely heavily on chunk-level and field-level metadata to communicate Rerun-specific semantics over the wire, e.g. whether some columns are already properly sorted.

The Arrow metadata system is fairly limited -- it's all untyped strings --, but for now that seems good enough. It will be trivial to switch to something else later, if need be.

Part of a PR series to implement our new chunk-based data model on the client-side (SDKs):

Checklist

I have read and agree to Contributor Guide and the Code of Conduct
I've included a screenshot or gif (if applicable)
I have tested the web demo (if applicable):
- Using examples from latest main build: rerun.io/viewer
- Using full set of examples from nightly build: rerun.io/viewer
The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG
If applicable, add a new check to the release checklist!

To run all checks from main, comment on the PR with @rerun-bot full-check.

jleibs · 2024-05-30T16:06:42Z

crates/re_chunk/src/transport.rs

+    /// The marker used to identify whether a column is sorted in field-level [`ArrowSchema`] metadata.
+    ///
+    /// The associated value is irrelevant -- if this marker is present, then it is true.
+    pub const FIELD_METADATA_MARKER_IS_SORTED: &'static str = Self::CHUNK_METADATA_MARKER_IS_SORTED;


I don't understand how this would be used in practice. All columns must have the same sort order within a chunk. This value seems like it would always have to match CHUNK_METADATA_MARKER_IS_SORTED

Ahh, I think I understand, but could probably use an improved name / comment.

It seems like:

CHUNK_METADATA_MARKER_IS_SORTED => This chunk is sorted by row-id

FIELD_METADATA_MARKER_IS_SORTED => This chunk is sorted on this timeline

If all timelines are monotically increasing they all might be set, but it's possible to be sorted by row-id, but not sorted by timeline when out-of-order logging shows up.

Yes, that's exactly what it is. I'll see if i can improve the naming/docs.

This new and improved `re_format_arrow` ™️ brings two major improvements: - It is now designed to format standard Arrow dataframes (aka chunks or batches), i.e. a `Schema` and a `Chunk`. In particular: chunk-level and field-level schema metadata will now be rendered properly with the rest of the table. - Tables larger than your terminal will now do their best to fit in, while making sure to still show just enough data. E.g. here's an excerpt of a real-world Rerun dataframe from our `helix` example: ``` cargo r -p rerun-cli --no-default-features --features native_viewer -- print helix.rrd --verbose ``` before (`main`): ![image](https://github.com/rerun-io/rerun/assets/2910679/99169b2a-d972-439d-900a-8f122a4d5ca3) and after: ![image](https://github.com/rerun-io/rerun/assets/2910679/3fe7acce-d646-4ff2-bfae-eb5073d17741) --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441

…6438) Introduces the new `re_chunk` crate: > A chunk of Rerun data, encoded using Arrow. Used for logging, transport, storage and compute. Specifically, it introduces the `Chunk` type itself, and all methods and helpers related to sorting. A `Chunk` is self-describing: it contains all the data _and_ metadata needed to index it into storage. There are a lot of things that need to be sorted within a `Chunk`, and as such we must make sure to keep track of what is or isn't sorted at all times, to avoid needlessly re-sorting things everytime a chunk changes hands. This necessitates a bunch of sanity checking all over the place to make sure we never end up in undefined states. `Chunk` is not about transport, it's about providing a nice-to-work with representation when manipulating a chunk in memory. Transporting a `Chunk` happens in the next PR. - Fixes #1981 --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441

This is a fork of the old `DataTable` batcher, and works very similarly. Like before, this batcher will micro-batch using both space and time thresholds. There are two main differences: - This batcher maintains a dataframe per-entity, as opposed to the old one which worked globally. - Once a threshold is reached, this batcher further splits the incoming batch in order to fulfill these invariants: ```rust /// In particular, a [`Chunk`] cannot: /// * contain data for more than one entity path /// * contain rows with different sets of timelines /// * use more than one datatype for a given component /// * contain more rows than a pre-configured threshold if one or more timelines are unsorted ``` Most of the code is the same, the real interesting piece is `PendingRow::many_into_chunks`, as well as the newly added tests. - Fixes #4431 --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441

Integrate the new chunk batcher in all SDKs, and get rid of the old one. On the backend, we make sure to deserialize incoming chunks into the old `DataTable`s, so business can continue as usual. Although the new batcher has a much more complicated task with all these sub-splits to manage, it is somehow already more performant than the old one 🤷‍♂️: ```bash # this branch cargo b -p log_benchmark --release && hyperfine --runs 15 './target/release/log_benchmark --benchmarks points3d_many_individual' Benchmark 1: ./target/release/log_benchmark --benchmarks points3d_many_individual Time (mean ± σ): 4.499 s ± 0.117 s [User: 5.544 s, System: 1.836 s] Range (min … max): 4.226 s … 4.640 s 15 runs # main cargo b -p log_benchmark --release && hyperfine --runs 15 './target/release/log_benchmark --benchmarks points3d_many_individual' Benchmark 1: ./target/release/log_benchmark --benchmarks points3d_many_individual Time (mean ± σ): 4.407 s ± 0.773 s [User: 8.423 s, System: 0.880 s] Range (min … max): 2.997 s … 6.148 s 15 runs ``` Notice the massive difference in user time. --- Part of a PR series to implement our new chunk-based data model on the client-side (SDKs): - #6437 - #6438 - #6439 - #6440 - #6441

teh-cmc added 🏹 arrow concerning arrow ⛃ re_datastore affects the datastore itself do-not-merge Do not merge this PR include in changelog 🪵 Log & send APIs Affects the user-facing API for all languages labels May 27, 2024

This was referenced May 27, 2024

Client-side chunks 0: improved arrow chunk formatters #6437

Merged

Client-side chunks 1: introduce Chunk and its suffle/sort routines #6438

Merged

Client-side chunks 3: micro-batching #6440

Merged

Client-side chunks 4: integrations #6441

Merged

teh-cmc force-pushed the cmc/dense_chunks_2_transport branch from ced515c to 8342ebb Compare May 27, 2024 16:45

teh-cmc marked this pull request as ready for review May 27, 2024 16:46

teh-cmc force-pushed the cmc/dense_chunks_1_intro branch from 7276a7c to 94a5e7f Compare May 29, 2024 07:31

teh-cmc force-pushed the cmc/dense_chunks_2_transport branch from 8342ebb to 4a9b5cd Compare May 29, 2024 07:33

jleibs approved these changes May 30, 2024

View reviewed changes

teh-cmc force-pushed the cmc/dense_chunks_2_transport branch from 4a9b5cd to 05cdde7 Compare May 31, 2024 07:46

teh-cmc removed the do-not-merge Do not merge this PR label May 31, 2024

teh-cmc force-pushed the cmc/dense_chunks_1_intro branch from 1981f31 to e22ff58 Compare May 31, 2024 08:27

Base automatically changed from cmc/dense_chunks_1_intro to main May 31, 2024 08:39

teh-cmc added 4 commits May 31, 2024 10:42

add MyPoint64 example component

51afbc4

Tuid::from_u128

3f11f85

implement TransportChunk

6bf04a8

review

3be1f77

teh-cmc force-pushed the cmc/dense_chunks_2_transport branch from 05cdde7 to 3be1f77 Compare May 31, 2024 08:42

teh-cmc merged commit b4b7ec4 into main May 31, 2024
24 of 28 checks passed

teh-cmc deleted the cmc/dense_chunks_2_transport branch May 31, 2024 08:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client-side chunks 2: introduce `TransportChunk` #6439

Client-side chunks 2: introduce `TransportChunk` #6439

teh-cmc commented May 27, 2024 •

edited by github-actions bot

Loading

jleibs May 30, 2024

jleibs May 30, 2024 •

edited

Loading

teh-cmc May 31, 2024

Client-side chunks 2: introduce TransportChunk #6439

Client-side chunks 2: introduce TransportChunk #6439

Conversation

teh-cmc commented May 27, 2024 • edited by github-actions bot Loading

Checklist

jleibs May 30, 2024

Choose a reason for hiding this comment

jleibs May 30, 2024 • edited Loading

Choose a reason for hiding this comment

teh-cmc May 31, 2024

Choose a reason for hiding this comment

Client-side chunks 2: introduce `TransportChunk` #6439

Client-side chunks 2: introduce `TransportChunk` #6439

teh-cmc commented May 27, 2024 •

edited by github-actions bot

Loading

jleibs May 30, 2024 •

edited

Loading