Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client-side chunks 3: micro-batching #6440

Merged
merged 2 commits into from
May 31, 2024
Merged

Conversation

teh-cmc
Copy link
Member

@teh-cmc teh-cmc commented May 27, 2024

This is a fork of the old DataTable batcher, and works very similarly.

Like before, this batcher will micro-batch using both space and time thresholds.
There are two main differences:

  • This batcher maintains a dataframe per-entity, as opposed to the old one which worked globally.
  • Once a threshold is reached, this batcher further splits the incoming batch in order to fulfill these invariants:
    /// In particular, a [`Chunk`] cannot:
    /// * contain data for more than one entity path
    /// * contain rows with different sets of timelines
    /// * use more than one datatype for a given component
    /// * contain more rows than a pre-configured threshold if one or more timelines are unsorted
    

Most of the code is the same, the real interesting piece is PendingRow::many_into_chunks, as well as the newly added tests.


Part of a PR series to implement our new chunk-based data model on the client-side (SDKs):

Checklist

  • I have read and agree to Contributor Guide and the Code of Conduct
  • I've included a screenshot or gif (if applicable)
  • I have tested the web demo (if applicable):
  • The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG
  • If applicable, add a new check to the release checklist!

To run all checks from main, comment on the PR with @rerun-bot full-check.

@teh-cmc teh-cmc added 🚀 performance Optimization, memory use, etc do-not-merge Do not merge this PR include in changelog 🔩 data model 🪵 Log & send APIs Affects the user-facing API for all languages labels May 27, 2024
@teh-cmc teh-cmc force-pushed the cmc/dense_chunks_3_batching branch from 07fdd69 to 9d0d4bc Compare May 27, 2024 15:39
@teh-cmc teh-cmc force-pushed the cmc/dense_chunks_2_transport branch from ced515c to 8342ebb Compare May 27, 2024 16:45
@teh-cmc teh-cmc force-pushed the cmc/dense_chunks_3_batching branch from 9d0d4bc to 69895e0 Compare May 27, 2024 16:45
@teh-cmc teh-cmc marked this pull request as ready for review May 27, 2024 16:53
@teh-cmc teh-cmc force-pushed the cmc/dense_chunks_2_transport branch from 8342ebb to 4a9b5cd Compare May 29, 2024 07:33
@teh-cmc teh-cmc force-pushed the cmc/dense_chunks_3_batching branch from 69895e0 to 7b87a78 Compare May 29, 2024 07:35
flush_tick: Duration::MAX,
flush_num_bytes: u64::MAX,
flush_num_rows: u64::MAX,
max_chunk_rows_if_unsorted: 256,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interaction between this and "never" seems a bit odd. Is this also considered one of the built-in invariants?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- in general, only global time and space act as batching thresholds, everything else just splits the result further down into smaller pieces.

crates/re_chunk/src/batcher.rs Show resolved Hide resolved
config(&acc.pending_rows);
}

if acc.pending_rows.len() as u64 >= config.flush_num_rows {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice side-effect of this refactor is the potential for per-entity-flushing config in the future.

@teh-cmc teh-cmc force-pushed the cmc/dense_chunks_2_transport branch from 4a9b5cd to 05cdde7 Compare May 31, 2024 07:46
@teh-cmc teh-cmc force-pushed the cmc/dense_chunks_3_batching branch from 7b87a78 to 08999d7 Compare May 31, 2024 07:56
@teh-cmc teh-cmc removed the do-not-merge Do not merge this PR label May 31, 2024
teh-cmc added a commit that referenced this pull request May 31, 2024
This new and improved `re_format_arrow` ™️ brings two major
improvements:
- It is now designed to format standard Arrow dataframes (aka chunks or
batches), i.e. a `Schema` and a `Chunk`.
In particular: chunk-level and field-level schema metadata will now be
rendered properly with the rest of the table.
- Tables larger than your terminal will now do their best to fit in,
while making sure to still show just enough data.

E.g. here's an excerpt of a real-world Rerun dataframe from our `helix`
example:
```
cargo r -p rerun-cli --no-default-features --features native_viewer -- print helix.rrd --verbose
```

before (`main`):

![image](https://github.com/rerun-io/rerun/assets/2910679/99169b2a-d972-439d-900a-8f122a4d5ca3)

and after:

![image](https://github.com/rerun-io/rerun/assets/2910679/3fe7acce-d646-4ff2-bfae-eb5073d17741)


---

Part of a PR series to implement our new chunk-based data model on the
client-side (SDKs):
- #6437
- #6438
- #6439
- #6440
- #6441
teh-cmc added a commit that referenced this pull request May 31, 2024
…6438)

Introduces the new `re_chunk` crate:
> A chunk of Rerun data, encoded using Arrow. Used for logging,
transport, storage and compute.

Specifically, it introduces the `Chunk` type itself, and all methods and
helpers related to sorting.
A `Chunk` is self-describing: it contains all the data _and_ metadata
needed to index it into storage.

There are a lot of things that need to be sorted within a `Chunk`, and
as such we must make sure to keep track of what is or isn't sorted at
all times, to avoid needlessly re-sorting things everytime a chunk
changes hands.
This necessitates a bunch of sanity checking all over the place to make
sure we never end up in undefined states.

`Chunk` is not about transport, it's about providing a nice-to-work with
representation when manipulating a chunk in memory.
Transporting a `Chunk` happens in the next PR.

- Fixes #1981

---

Part of a PR series to implement our new chunk-based data model on the
client-side (SDKs):
- #6437
- #6438
- #6439
- #6440
- #6441
@teh-cmc teh-cmc force-pushed the cmc/dense_chunks_2_transport branch from 05cdde7 to 3be1f77 Compare May 31, 2024 08:42
Base automatically changed from cmc/dense_chunks_2_transport to main May 31, 2024 08:42
teh-cmc added a commit that referenced this pull request May 31, 2024
A `TransportChunk` is a `Chunk` that is ready for transport and/or
storage.
It is very cheap to go from `Chunk` to a `TransportChunk` and
vice-versa.

A `TransportChunk` maps 1:1 to a native Arrow `RecordBatch`. It has a
stable ABI, and can be cheaply send across process boundaries.
`arrow2` has no `RecordBatch` type; we will get one once we migrate to
`arrow-rs`.

A `TransportChunk` is self-describing: it contains all the data _and_
metadata needed to index it into storage.

We rely heavily on chunk-level and field-level metadata to communicate
Rerun-specific semantics over the wire, e.g. whether some columns are
already properly sorted.

The Arrow metadata system is fairly limited -- it's all untyped strings
--, but for now that seems good enough. It will be trivial to switch to
something else later, if need be.

- Fixes #1760
- Fixes #1692
- Fixes #3360 
- Fixes #1696

---

Part of a PR series to implement our new chunk-based data model on the
client-side (SDKs):
- #6437
- #6438
- #6439
- #6440
- #6441
@teh-cmc teh-cmc force-pushed the cmc/dense_chunks_3_batching branch from 08999d7 to 22f7e61 Compare May 31, 2024 08:44
@teh-cmc teh-cmc merged commit fde4a87 into main May 31, 2024
27 of 28 checks passed
@teh-cmc teh-cmc deleted the cmc/dense_chunks_3_batching branch May 31, 2024 08:46
teh-cmc added a commit that referenced this pull request May 31, 2024
Integrate the new chunk batcher in all SDKs, and get rid of the old one.

On the backend, we make sure to deserialize incoming chunks into the old
`DataTable`s, so business can continue as usual.


Although the new batcher has a much more complicated task with all these
sub-splits to manage, it is somehow already more performant than the old
one 🤷‍♂️:
```bash
# this branch
cargo b -p log_benchmark --release && hyperfine --runs 15 './target/release/log_benchmark --benchmarks points3d_many_individual'
Benchmark 1: ./target/release/log_benchmark --benchmarks points3d_many_individual
  Time (mean ± σ):      4.499 s ±  0.117 s    [User: 5.544 s, System: 1.836 s]
  Range (min … max):    4.226 s …  4.640 s    15 runs

# main
cargo b -p log_benchmark --release && hyperfine --runs 15 './target/release/log_benchmark --benchmarks points3d_many_individual'
Benchmark 1: ./target/release/log_benchmark --benchmarks points3d_many_individual
  Time (mean ± σ):      4.407 s ±  0.773 s    [User: 8.423 s, System: 0.880 s]
  Range (min … max):    2.997 s …  6.148 s    15 runs
```
Notice the massive difference in user time.

---

Part of a PR series to implement our new chunk-based data model on the
client-side (SDKs):
- #6437
- #6438
- #6439
- #6440
- #6441
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔩 data model include in changelog 🪵 Log & send APIs Affects the user-facing API for all languages 🚀 performance Optimization, memory use, etc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Batcher should sort data in addition to batching
2 participants