Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue: end-to-end batching #1619

Closed
14 of 18 tasks
teh-cmc opened this issue Mar 20, 2023 · 2 comments · Fixed by #1985
Closed
14 of 18 tasks

Tracking issue: end-to-end batching #1619

teh-cmc opened this issue Mar 20, 2023 · 2 comments · Fixed by #1985
Assignees
Labels
🏹 arrow concerning arrow enhancement New feature or request 🚀 performance Optimization, memory use, etc 🐍 Python API Python logging API ⛃ re_datastore affects the datastore itself 🦀 Rust API Rust logging API 🎄 tracking issue issue that tracks a bunch of subissues

Comments

@teh-cmc
Copy link
Member

teh-cmc commented Mar 20, 2023

  • Will create individual issues as the need arises.
  • Most likely an evolving document.

RFC


  • Move DataStore sanity checks and formatting tools to separate files
    store.rs is supposed to be the place where one can get an overview of all the datastructures involved in the store, except it has slowly become a mess over time and is now pretty much unreadable.

  • Implement all the needed tests & benchmarks
    We need to be able to check for regressions at every step, so make sure we have all the tests and benchmarks we need for that.
    We should already be 95% of the way there at this point.

  • Replace MsgBundle & ComponentBundle with the new types (DataCell, DataRow, DataTable, EventId, BatchId...)
    No actual batching features nor any kind of behavior changes of any sort: just define the new types and use them everywhere.

  • Pass entity path as a column rather than as metadata
    Replace the current entity_path that is passed in the metadata map with an actual column instead. This will also requires us to make EntityPath a proper arrow datatype (..datatype, not component!!).

  • Make sure implicit instance counts have been wiped everywhere #1892
    Issue created; not blocking for batching.

  • Eliminate legacy splats #1893
    Issue created; not blocking for batching.

  • Get rid of component buckets altogether
    Update the store implementation to remove component tables, remove the get APIs, introduce slicing on the write path, etc. Still no batching in sight!

  • SDK-side log batching #1880

  • Implement the coalescing/accumulation logic in the SDK
    Add the required logic/thread/timers/whatever-else in the SDKs to accumulate data and just send it all as many LogMsgs (i.e. no batching yet).

  • Implement full-on batching
    End-to-end: transport, storage, the whele shebang.

  • Sort the batch before sending ((event_id, entity_path))
    Keep that in its own PR to keep track of the benchmarks.

  • Implement new GC
    The complete implementation; should close all existing GC issues.

  • Dump directly from the store into an rrd file
    No rebatching yet, just dump every event in its own LogMsg.

  • Remove LogMsgs from LogDb
    We shouldn't need to keep track of events outside the store past this point: clean it all up.
    Reminder: the timeline widget keeps track of timepoints directly, not events.

  • Rebatch aggressively while dumping the store to a stream of LogMsg #1894
    Issue created; not blocking for batching.

  • Make log_time column implicit and potentially introduce ingest_time #1891
    Issue created; not blocking for batching.

  • A Component's DataType should embed its metadata #1696
    Issue created; not blocking for batching.

  • re_datastore: replace anyhow::Error usage with a thiserror derived Error type #527

@teh-cmc teh-cmc added enhancement New feature or request 🐍 Python API Python logging API 🏹 arrow concerning arrow 🦀 Rust API Rust logging API 🎄 tracking issue issue that tracks a bunch of subissues ⛃ re_datastore affects the datastore itself 🚀 performance Optimization, memory use, etc labels Mar 20, 2023
@teh-cmc teh-cmc self-assigned this Mar 20, 2023
@teh-cmc
Copy link
Member Author

teh-cmc commented Apr 17, 2023

Copy pasting the discord thread regarding Data{Cell,Row,Table} and the new datastore for posterity.

===

Hey folks, quick update on the data front: we've been putting a lot of effort into redesigning and reimplementing some of our core data structures and pipelines lately. The goal is to make them align better with the user-facing data model that we've been refining for the past year.

In practice, this already translates into very significant compute & memory performance improvements across the stack starting today, and paves the way for even more of those in the future (ingestion speed, query speed, memory usage, network bandwidth, garbage collection throughput & latency...).

These changes are available right now on latest main (starting with 925f531), and should ship as part of the next (0.5.0) release.
Note: this breaks compatibility for .rrd files, you'll have to regenerate those!


The first big chunk of this work was the introduction of new core data types to abstract over raw Arrow data: DataCell, DataRow & DataTable (#1634, #1636, #1673, #1679).

These new abstractions make it much more manageable to work efficiently with raw Arrow data across the entire stack (SDK, transport, datastore, query layer... all the way from the clients up to the renderer!), as well as guard against common Arrow pitfalls.
It is now easier to implement new data centric features, one example of which is micro-batching: an upcoming feature (#1619) for our SDKs that will significantly improve network bandwidth and ingestion speeds.


Then comes the new datastore itself (#1727, #1735, #1739, #1785, #1791, #1795, #1801), which builds upon these new types and gets the store's internals closer to the overarching data model.

The result is much faster query speeds and drastically reduced memory usage.
This new store also comes with a precise garbage collector that should never miss a single byte, meaning you can now use our memory limit feature (https://www.rerun.io/docs/howto/limit-ram) to stream in never-ending workloads.

Applications that put the most stress on the store will of course be the one benefiting the most from these changes.
Since the performance of the store scales relative to the number of events (i.e. log calls) being stored rather than the size of the data, this means applications that are using a large amount of scalars/plots, text logs, range queries (e.g. the Visible History feature), and other things of that nature (i.e. many small events rather than a few large ones) will see the most drastic improvements.


To demonstrate all of this we can use our official clocks example, coupled with the Visible History feature, which is the ultimate stress test for our datastore.

Running the simulation for 50'000 frames, then replaying it at 180x speed with 1000 frames of visible history buffer for the minute hand of the clock 👇

Before: ~15ms per frame / ~4.5GiB of RAM required:

23-04-13_144446.patched.mp4

After: ~7ms per frame / ~920MiB of RAM required:

23-04-13_144915.patched.mp4

So, roughly a ~2x improvement in frame times and ~5x in memory usage!

@emilk
Copy link
Member

emilk commented Apr 17, 2023

The win from not logging log_time is quite small. The RowId is 16 bytes, and the log_time column is 8 bytes, so even for zero-sized components the memory wins will be at most 33%. I'm not sure that justifies the added complexity at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏹 arrow concerning arrow enhancement New feature or request 🚀 performance Optimization, memory use, etc 🐍 Python API Python logging API ⛃ re_datastore affects the datastore itself 🦀 Rust API Rust logging API 🎄 tracking issue issue that tracks a bunch of subissues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants