You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
got the C++ sdk a lot faster. But compared to Rust we're still behind for individual log calls (like time series scalars!).
An obvious candidate to improve is not building & sending the schema every time: Right now on every log call we convert the schema to C FFI and then create a Rust/arrow2 representation from it. Add a simple lazy schema registry/handle system for this!
Should do a little bit more profiling though to get an idea where the perf goes. E.g. there's likely many many other needless allocs on the way.
The text was updated successfully, but these errors were encountered:
…ing a component type registry (#4296)
### What
* Fixes#4287
* Follow-up to #4273
As expected, not doing the C++ datatype -> C FFI schema -> Rust datatype
roundtrip for each log call helps perf quite a bit, especially when we
do a lot of smaller log calls.
The registry a single RwLock protected Vec (we never deregister) which
is exposed via a single c entry point.
On the C++ side we use the local `static` variable mechanism for
threadsafe lazy registration (slight codegen adjustment).
Indicator components had some special handling before and were
refactored to fit in this system - in the process I made their arrow
array shared across all instantiations, further cutting down on per-log
work.
---
Benchmark results:
* large point cloud: `0.15s` -> `0.14s`
* many points: `7.52s` -> `4.52s`
* large images: `0.57s` -> `0.51s`
Old values from previous PR. New values are median over three runs,
single executable run (this makes more and more of a difference with all
these registries!), timings without prepare step, same M1 macbook.
A quick look over the profiler for running `log_benchmark
points3d_many_individual` in isolation tells us that of the actual
benchmark running time we spend..
* 35% of the the time in `rr_recording_stream_log` (of which in turn
20%, so 7% overall, is still arrow FFI translation of the array!!)
* 30% in the various `to_data_cell` methods
* 10% in exporting arrow arrays to C FFI
* 6% in setting the time
* the rest in various allocations along the way
(taken via `Instruments` on my Mac)
<img width="969" alt="image"
src="https://github.com/rerun-io/rerun/assets/1220815/5632589f-52b1-4e92-b7a0-1482e69528ad">
---
### Checklist
* [x] I have read and agree to [Contributor
Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and
the [Code of
Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md)
* [x] I've included a screenshot or gif (if applicable)
* [x] I have tested [demo.rerun.io](https://demo.rerun.io/pr/4296) (if
applicable)
* [x] The PR title and labels are set such as to maximize their
usefulness for the next release's CHANGELOG
- [PR Build Summary](https://build.rerun.io/pr/4296)
- [Docs
preview](https://rerun.io/preview/8bf1ee59d9a2bc5e192c1c8169c98dd40b621100/docs)
<!--DOCS-PREVIEW-->
- [Examples
preview](https://rerun.io/preview/8bf1ee59d9a2bc5e192c1c8169c98dd40b621100/examples)
<!--EXAMPLES-PREVIEW-->
- [Recent benchmark results](https://build.rerun.io/graphs/crates.html)
- [Wasm size tracking](https://build.rerun.io/graphs/sizes.html)
The recent performance improvement
got the C++ sdk a lot faster. But compared to Rust we're still behind for individual log calls (like time series scalars!).
An obvious candidate to improve is not building & sending the schema every time: Right now on every log call we convert the schema to C FFI and then create a Rust/arrow2 representation from it. Add a simple lazy schema registry/handle system for this!
Should do a little bit more profiling though to get an idea where the perf goes. E.g. there's likely many many other needless allocs on the way.
The text was updated successfully, but these errors were encountered: