Migrate from `re_arrow2` to `arrow` #3741

teh-cmc · 2023-10-09T10:42:29Z

Blockers

Soft-blocked on Add shrink_to_fit to Array apache/arrow-rs#6360 (for lowering memory use)
Semi-blocked on finding a work-around for Add an ExtensionType to DataType enum apache/arrow-rs#4472 (we use DataType::Extension for Tuid)

Multiple end-goals:

Use same arrow lib as the rest of the ecosystem, which is where all the bug & perf fixes are actually happening
Use inifinitely less space to store Arrow metadata (schema deduplication)
- Implement minimal datatype registry for cross-batch deduplication #1809
Make it possible to send raw Arrow data to Rerun and have it just work (RERUN:component_name)
- Clean up Arrow extension hell, implement RERUN:component_name #3360
- Also frees up usage of Arrow extensions for actual native extensions (e.g. Compatability with arrow fixed-shape-tensor extension #3004)
Native integration with half for f16
Etc etc

TODO (split into sub-issues as needed):

On the way there we might hit a few bumps because we have a lot of redundant ad-hoc code that integrates with polars (which is built on top of arrow2).

The solution to this is to make sure we only integrate with polars in one single place: the Data{Cell,Row,Table} layer (#1692).
Once that's done, we can remove all ad-hoc polars code everywhere and just build a Data{Row,Cell,Table} anytime we want a polars::Series/polars::DataFrame (#1759).

Internally, the conversion from DataTable to polars::DataFrame will require a zero-copy tri-stage conversion from arrow1->arrow2->polars.

Supersedes arrow2 does _not_ refcount schema metadata #1805
Supersedes Switch to arrow-rs #2354

The text was updated successfully, but these errors were encountered:

This PR introduces a new crate: `re_types_core`. `re_types_core` only contains the fundamental traits and types that make up Rerun's data model. It is split off from the existing `re_types`. This makes it possible to work with our data model abstractions without having to depend on the `re_types` behemoth. This is more than a DX improvement: since so many things depend directly or indirectly on `re_types`, it is very easy to end-up with unsolvable dependency cycles. This helps with that in some cases (though certainly not all). In particular, `re_tuid` (and by extension `re_format`) are now completely free of `re_types`. For convenience, `re_types` reexports all of `re_types_core`, so the public API looks unchanged. In a handful of instances (`re_arrow_store`, `re_data_store`, `re_log_types`, `re_query`), I've went the extra mile and started porting these crates towards raw `re_types_core` rather than relying on the reexports. The reason is that, upon closer inspection, these crates are very close to being able to live free of `re_types`. In the future, the custom crate and custom module attributes coming with #3741 might allow us to make these independent. Similarly, the codegen now uses `re_types_core` directly, as that makes the life of the upcoming "serde-codegen" work much easier.

**Commit by commit** This is necessary refactoring work for the upcoming `attr.rust.custom_crate` attribute, itself necessary for the upcoming serde-codegen support, itself necessary for the upcoming blueprint experimentations as well as #3741. ### Changes 1. The `CodeGenerator` trait as well as all post-processing helpers (gitattributes, orphan detection...) are now I/O-free. ```rust pub type GeneratedFiles = std::collections::BTreeMap<camino::Utf8PathBuf, String>; pub trait CodeGenerator { fn generate( &mut self, reporter: &crate::Reporter, objects: &crate::Objects, arrow_registry: &crate::ArrowRegistry, ) -> GeneratedFiles; } ``` 2. All post-processing helpers are now agnostic to the location output. This is very important as it makes it possible to generate e.g. rust code out of the `re_types` crate without everything crumbling down. A side-effect is that gitattributes files are now finer-grained. 3. The Rust codegen pass is now crate agnostic: it is driven by the workspace path rather than a specific crate path. Necessary for the upcoming `attr.rust.custom_crate`. 4. All codegen passes now follow the exact same 4-step structure: ``` // 1. Generate in-memory code files. let mut gen = MyGenerator::new(); let mut files = gen.generate(reporter, objects, arrow_registry); // 2. Generate in-memory attribute files. generate_gitattributes_for_generated_files(&mut files); // 3. Write all in-memory files to disk. write_files(&gen.pkg_path, &gen.testing_pkg_path, &files); // 4. Remove orphaned files. crate::codegen::common::remove_orphaned_files(reporter, &files); ``` 5. The documentation codegen pass now removes its orphans, which is why some `md` files were removed in this PR. --- - Unblocks #3741 - Unblocks #3495

emilk · 2024-07-08T12:29:35Z

re_arrow2 has an arrow feature, with glue for converting data between arrow and re_arrow2: https://docs.rs/re_arrow2/0.17.4/re_arrow2/array/trait.Arrow2Arrow.html

Using that we can start this migration piece-wise. It would have double the dependencies for a transitionary period, leading to longer compilation times and bigger .wasm binary, but I think that is an ok tradeoff.

Potential roadmap:

Verify that Arrow2Arrow is zero-copy
- Benchmark arrow2arrow re_arrow2#6
Remove support for nullable components #6819
Move SizeBytes to own crate, with separate arrow and arrow2 feature flags
Rename to_arrow/from_arrow/… to to_arrow2/from_arrow2/…
Add poly-filled to_arrow/from_arrow using the glue
Migrate codegenned serialization

After de-chunkfification:

Migrate codegenned deserialization
Migrate everything else

As of 2024-07-08, there are only around 300 lines of Rust referencing the string arrow2 directly, when one ignores generated code.

ignored paths

crates/re_types/**, crates/re_types_core/src/archetypes/**, crates/re_types_core/src/datatypes/**, crates/re_types_core/src/components/**, crates/re_types_blueprint/src/blueprint/components/**, crates/re_types_blueprint/src/blueprint/archetypes/**

jleibs · 2024-07-10T17:11:23Z

I believe Experimental DataFusion integration #6807 also requires bringing in a dependency on arrow

Remove unused old traits. Part of a lot of clean up I want to while we head towards: * #7245 * #3741

It doesn't make any sense for a `ComponentBatch` to have any say in what the final `ArrowField` should look like. An `ArrowField` is a `Chunk`/`RecordBatch`/`Schema`-level concern that only makes sense during IO/transport/FFI/storage/etc, and which requires external context that a single `ComponentBatch` on its own has no idea of. --- Part of a lot of clean up I want to while we head towards: * #7245 * #3741

teh-cmc · 2024-08-31T18:19:40Z

Blocked on:

Fix MutableBuffer::into_buffer leaking its extra capacity into the final buffer apache/arrow-rs#6300

teh-cmc · 2024-09-05T13:44:09Z

New blocker:

Add shrink_to_fit to Array apache/arrow-rs#6360

### Related * #3741

Also added a regression test * Part of #3741 --------- Co-authored-by: Clement Rey <cr.rey.clement@gmail.com>

### Related * Part of #3741

* Part of #3741

* Part of #3741 The "Print Datastore" feature will produce a slightly different result.

### Related * Poart of #3741 ### Details I'd like to replace our `TransportChunk` with arrow's `RecordBatch`. This is a good step in that direction. --------- Co-authored-by: Jeremy Leibs <jeremy@rerun.io>

* Part of #3741

### Related * Part of #3741 ### Details Adds crate `re_arrow_util`. Adds two traits for downcasting `arrow` and `arrow2` arrays in such a way that we cannot accidentally cast one into another. This will be very important for the arrow migration. It also makes the code shorter.

* Part of #3741 This will make it easier to switch out TransportChunk for RecordBatch

* Part of #3741 Will make it easier to switch out the `TransportChunk` for an `RecordBatch` and still have the same test output

* Part of #3741 * [x] Tested that it does not regress #8668 This makes `TransportChunk` a wrapper around an arrow `RecordBatch`. ### Future work * Remove `TransportChunk` and replace it with an extension trait on `RecordBatch` * Simplify the dataframe API to always return a full `RecordBatch` (adding a schema to the rows is basically free now)

* Part of #3741

* Part of #3741 Just one small piece at a time

teh-cmc added the 🏹 arrow concerning arrow label Oct 9, 2023

teh-cmc mentioned this issue Oct 16, 2023

Introduce re_types_core #3878

Merged

4 tasks

teh-cmc mentioned this issue Oct 17, 2023

Make codegen I/O-free and agnostic to output location #3888

Merged

4 tasks

teh-cmc mentioned this issue Jan 9, 2024

Dataframe extension for Chunk #1692

Closed

emilk mentioned this issue Jan 11, 2024

Fork arrow2 and get rid of polars #4789

Closed

emilk mentioned this issue Jan 23, 2024

We need a better data slicing mechanism than Box<dyn Array> #4884

Closed

teh-cmc mentioned this issue May 31, 2024

Client-side chunks 1: introduce Chunk and its suffle/sort routines #6438

Merged

5 tasks

emilk self-assigned this Jul 8, 2024

emilk changed the title ~~Tracking issue: arrow cleanup & migration away from arrow2{-convert}~~ Tracking issue: Migrate from re_arrow2 to arrow Jul 8, 2024

emilk added dependencies concerning crates, pip packages etc 🦀 Rust API Rust logging API labels Jul 9, 2024

emilk removed their assignment Jul 9, 2024

teh-cmc mentioned this issue Aug 23, 2024

Remove unused Datatype and DatatypeBatch #7256

Merged

6 tasks

teh-cmc added a commit that referenced this issue Aug 23, 2024

Remove Datatype and DatatypeBatch (#7256)

2e2a988

Remove unused old traits. Part of a lot of clean up I want to while we head towards: * #7245 * #3741

teh-cmc mentioned this issue Aug 23, 2024

Remove unused Loggable{Batch}::arrow_field #7257

Merged

6 tasks

This was referenced Aug 23, 2024

Remove unused LoggableBatch::num_instances #7258

Merged

Remove unused Loggable::extended_arrow_datatype #7260

Merged

teh-cmc added the blocked can't make progress right now label Aug 31, 2024

emilk mentioned this issue Jan 10, 2025

Port TimeColumn to arrow-rs #8638

Merged

emilk added a commit that referenced this issue Jan 10, 2025

Port TimeColumn to arrow-rs (#8638)

cdf0181

### Related * #3741

This was referenced Jan 10, 2025

Port row_ids to arrow1 #8657

Merged

Fix formatting of RowId/Tuid when printing ChunkStore #8656

Merged

emilk added a commit that referenced this issue Jan 13, 2025

Fix formatting of RowId/Tuid when printing ChunkStore (#8656)

92949cf

Also added a regression test * Part of #3741 --------- Co-authored-by: Clement Rey <cr.rey.clement@gmail.com>

emilk mentioned this issue Jan 13, 2025

Refactor TimeColumnDescriptor #8658

Merged

emilk added a commit that referenced this issue Jan 13, 2025

Port row_ids to arrow1 (#8657)

d07a374

### Related * Part of #3741

emilk mentioned this issue Jan 13, 2025

Port re_format_arrow to arrow-rs #8664

Merged

Wumpf mentioned this issue Jan 13, 2025

Large performance regression (compared to 0.21) for missing query cache #8668

Closed

emilk mentioned this issue Jan 13, 2025

Port ArrowMsg to using arrow::RecordBatch #8669

Merged

emilk added a commit that referenced this issue Jan 14, 2025

Refactor TimeColumnDescriptor (#8658)

1ffe586

* Part of #3741

emilk added a commit that referenced this issue Jan 14, 2025

Port re_format_arrow to arrow-rs (#8664)

e4f8df4

* Part of #3741 The "Print Datastore" feature will produce a slightly different result.

emilk mentioned this issue Jan 14, 2025

Port TransportChunk::schema to arrow-rs #8687

Merged

emilk added a commit that referenced this issue Jan 14, 2025

Port TransportChunk::schema to arrow-rs (#8687)

c07eb6e

* Part of #3741

emilk mentioned this issue Jan 14, 2025

Create re_arrow_util #8689

Merged

emilk mentioned this issue Jan 15, 2025

Use insta for dataframe snapshot tests #8696

Merged

emilk added a commit that referenced this issue Jan 15, 2025

Use insta for dataframe snapshot tests (#8696)

ec07b14

* Part of #3741 This will make it easier to switch out TransportChunk for RecordBatch

This was referenced Jan 15, 2025

Use full chunk table format for snapshot tests #8699

Merged

Port TransportChunk to arrow-rs #8700

Merged

emilk added a commit that referenced this issue Jan 15, 2025

Use full chunk table format for snapshot tests (#8699)

ac80fbc

* Part of #3741 Will make it easier to switch out the `TransportChunk` for an `RecordBatch` and still have the same test output

emilk mentioned this issue Jan 16, 2025

Less use of TransportChunk #8707

Merged

emilk added a commit that referenced this issue Jan 16, 2025

Less use of TransportChunk (#8707)

7e6b79d

* Part of #3741

emilk mentioned this issue Jan 16, 2025

Even less arrow2 #8719

Merged

emilk added a commit that referenced this issue Jan 17, 2025

Even less arrow2 (#8719)

fdf065f

* Part of #3741 Just one small piece at a time

This was referenced Jan 17, 2025

Port component storage to arrow-rs #8725

Open

Remove use of into_arrow2_buffer in generated code #8731

Open

Update to arrow2 53.4; use it for IPC serialization #8733

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate from `re_arrow2` to `arrow` #3741

Migrate from `re_arrow2` to `arrow` #3741

teh-cmc commented Oct 9, 2023 •

edited by emilk

Loading

emilk commented Jul 8, 2024 •

edited

Loading

jleibs commented Jul 10, 2024

teh-cmc commented Aug 31, 2024

teh-cmc commented Sep 5, 2024

Migrate from re_arrow2 to arrow #3741

Migrate from re_arrow2 to arrow #3741

Comments

teh-cmc commented Oct 9, 2023 • edited by emilk Loading

emilk commented Jul 8, 2024 • edited Loading

jleibs commented Jul 10, 2024

teh-cmc commented Aug 31, 2024

teh-cmc commented Sep 5, 2024

Migrate from `re_arrow2` to `arrow` #3741

Migrate from `re_arrow2` to `arrow` #3741

teh-cmc commented Oct 9, 2023 •

edited by emilk

Loading

emilk commented Jul 8, 2024 •

edited

Loading