-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate from re_arrow2
to arrow
#3741
Comments
This PR introduces a new crate: `re_types_core`. `re_types_core` only contains the fundamental traits and types that make up Rerun's data model. It is split off from the existing `re_types`. This makes it possible to work with our data model abstractions without having to depend on the `re_types` behemoth. This is more than a DX improvement: since so many things depend directly or indirectly on `re_types`, it is very easy to end-up with unsolvable dependency cycles. This helps with that in some cases (though certainly not all). In particular, `re_tuid` (and by extension `re_format`) are now completely free of `re_types`. For convenience, `re_types` reexports all of `re_types_core`, so the public API looks unchanged. In a handful of instances (`re_arrow_store`, `re_data_store`, `re_log_types`, `re_query`), I've went the extra mile and started porting these crates towards raw `re_types_core` rather than relying on the reexports. The reason is that, upon closer inspection, these crates are very close to being able to live free of `re_types`. In the future, the custom crate and custom module attributes coming with #3741 might allow us to make these independent. Similarly, the codegen now uses `re_types_core` directly, as that makes the life of the upcoming "serde-codegen" work much easier.
**Commit by commit** This is necessary refactoring work for the upcoming `attr.rust.custom_crate` attribute, itself necessary for the upcoming serde-codegen support, itself necessary for the upcoming blueprint experimentations as well as #3741. ### Changes 1. The `CodeGenerator` trait as well as all post-processing helpers (gitattributes, orphan detection...) are now I/O-free. ```rust pub type GeneratedFiles = std::collections::BTreeMap<camino::Utf8PathBuf, String>; pub trait CodeGenerator { fn generate( &mut self, reporter: &crate::Reporter, objects: &crate::Objects, arrow_registry: &crate::ArrowRegistry, ) -> GeneratedFiles; } ``` 2. All post-processing helpers are now agnostic to the location output. This is very important as it makes it possible to generate e.g. rust code out of the `re_types` crate without everything crumbling down. A side-effect is that gitattributes files are now finer-grained. 3. The Rust codegen pass is now crate agnostic: it is driven by the workspace path rather than a specific crate path. Necessary for the upcoming `attr.rust.custom_crate`. 4. All codegen passes now follow the exact same 4-step structure: ``` // 1. Generate in-memory code files. let mut gen = MyGenerator::new(); let mut files = gen.generate(reporter, objects, arrow_registry); // 2. Generate in-memory attribute files. generate_gitattributes_for_generated_files(&mut files); // 3. Write all in-memory files to disk. write_files(&gen.pkg_path, &gen.testing_pkg_path, &files); // 4. Remove orphaned files. crate::codegen::common::remove_orphaned_files(reporter, &files); ``` 5. The documentation codegen pass now removes its orphans, which is why some `md` files were removed in this PR. --- - Unblocks #3741 - Unblocks #3495
Using that we can start this migration piece-wise. It would have double the dependencies for a transitionary period, leading to longer compilation times and bigger .wasm binary, but I think that is an ok tradeoff. Potential roadmap:
After de-chunkfification:
As of 2024-07-08, there are only around 300 lines of Rust referencing the string ignored pathscrates/re_types/**, crates/re_types_core/src/archetypes/**, crates/re_types_core/src/datatypes/**, crates/re_types_core/src/components/**, crates/re_types_blueprint/src/blueprint/components/**, crates/re_types_blueprint/src/blueprint/archetypes/** |
arrow2{-convert}
re_arrow2
to arrow
|
It doesn't make any sense for a `ComponentBatch` to have any say in what the final `ArrowField` should look like. An `ArrowField` is a `Chunk`/`RecordBatch`/`Schema`-level concern that only makes sense during IO/transport/FFI/storage/etc, and which requires external context that a single `ComponentBatch` on its own has no idea of. --- Part of a lot of clean up I want to while we head towards: * #7245 * #3741
New blocker: |
Also added a regression test * Part of #3741 --------- Co-authored-by: Clement Rey <cr.rey.clement@gmail.com>
### Related * Part of #3741 ### Details Adds crate `re_arrow_util`. Adds two traits for downcasting `arrow` and `arrow2` arrays in such a way that we cannot accidentally cast one into another. This will be very important for the arrow migration. It also makes the code shorter.
* Part of #3741 This will make it easier to switch out TransportChunk for RecordBatch
* Part of #3741 Will make it easier to switch out the `TransportChunk` for an `RecordBatch` and still have the same test output
* Part of #3741 * [x] Tested that it does not regress #8668 This makes `TransportChunk` a wrapper around an arrow `RecordBatch`. ### Future work * Remove `TransportChunk` and replace it with an extension trait on `RecordBatch` * Simplify the dataframe API to always return a full `RecordBatch` (adding a schema to the rows is basically free now)
* Part of #3741 Just one small piece at a time
Blockers
shrink_to_fit
toArray
apache/arrow-rs#6360 (for lowering memory use)DataType::Extension
for Tuid)Multiple end-goals:
RERUN:component_name
)RERUN:component_name
#3360half
forf16
TODO (split into sub-issues as needed):
arrow1
SizeBytes
to own cratefn arrow_ui
tore_ui
PendingRow
toarrow-rs
#8617TimeColumn
TransportChunk
toarrow-rs
#8700dataplatform
repore_types_core
re_dataframe
arrow2::Datype
inre_types_builder
Chunk
ChunkBuilder
arrow2
(codegen, data{cell,row,table},ArrowBuffer
, etc)ArrowBuffer
wrapper to exposearrow-rs
buffers rather thanarrow2
#2978RERUN:component_name
(Clean up Arrow extension hell, implementRERUN:component_name
#3360)DataCell::component_name
On the way there we might hit a few bumps because we have a lot of redundant ad-hoc code that integrates with
polars
(which is built on top ofarrow2
).The solution to this is to make sure we only integrate with
polars
in one single place: theData{Cell,Row,Table}
layer (#1692).Once that's done, we can remove all ad-hoc polars code everywhere and just build a
Data{Row,Cell,Table}
anytime we want apolars::Series
/polars::DataFrame
(#1759).Internally, the conversion from
DataTable
topolars::DataFrame
will require a zero-copy tri-stage conversion fromarrow1
->arrow2
->polars
.arrow2
does _not_ refcount schema metadata #1805The text was updated successfully, but these errors were encountered: