Skip to content

Commit

Permalink
Add a How-to guide for the dataframe API (#7727)
Browse files Browse the repository at this point in the history
### What

This PR introduces a How-to guid for the dataframe API which includes:

- a reference-style coverage of all dataframe api features
- recipes to ingest data in a pyarrow table, pandas df, polars df, or
duckdb relation

Needs this to pass CI:
- #7720 

### Checklist
* [x] I have read and agree to [Contributor
Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and
the [Code of
Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md)
* [x] I've included a screenshot or gif (if applicable)
* [x] I have tested the web demo (if applicable):
* Using examples from latest `main` build:
[rerun.io/viewer](https://rerun.io/viewer/pr/7727?manifest_url=https://app.rerun.io/version/main/examples_manifest.json)
* Using full set of examples from `nightly` build:
[rerun.io/viewer](https://rerun.io/viewer/pr/7727?manifest_url=https://app.rerun.io/version/nightly/examples_manifest.json)
* [x] The PR title and labels are set such as to maximize their
usefulness for the next release's CHANGELOG
* [x] If applicable, add a new check to the [release
checklist](https://github.com/rerun-io/rerun/blob/main/tests/python/release_checklist)!
* [x] If have noted any breaking changes to the log API in
`CHANGELOG.md` and the migration guide

- [PR Build Summary](https://build.rerun.io/pr/7727)
- [Recent benchmark results](https://build.rerun.io/graphs/crates.html)
- [Wasm size tracking](https://build.rerun.io/graphs/sizes.html)

To run all checks from `main`, comment on the PR with `@rerun-bot
full-check`.

---------

Co-authored-by: Andreas Reich <r_andreas2@web.de>
  • Loading branch information
abey79 and Wumpf authored Oct 15, 2024
1 parent 8fccf39 commit 6535073
Show file tree
Hide file tree
Showing 4 changed files with 247 additions and 4 deletions.
1 change: 1 addition & 0 deletions docs/content/howto.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ Guides for using Rerun in more advanced ways.
- [By logging custom data](howto/extend/custom-data.md)
- [By implementing custom visualizations (Rust only)](howto/extend/extend-ui.md)
- [Efficiently log time series data using `send_columns`](howto/send_columns.md)
- [Get data out from Rerun with code](howto/dataframe-api.md)
228 changes: 228 additions & 0 deletions docs/content/howto/dataframe-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
---
title: Get data out from Rerun with code
order: 1600
---

Rerun comes with a Dataframe API, which enables getting data out of Rerun from code. This page provides an overview of the API, as well as recipes to load the data in popular packages such as [Pandas](https://pandas.pydata.org), [Polars](https://pola.rs), and [DuckDB](https://duckdb.org).

<!-- TODO(#7499): add links to the Python SDK documentation where appropriate -->

## The dataframe API

### Loading a recording

A recording can be loaded from a RRD using the `load_recording()` function:

```python
import rerun as rr

recording = rr.dataframe.load_recording("/path/to/file.rrd")
```

Although RRD files generally contain a single recording, they may occasionally contain 2 or more. This can happen, for example, if the RRD includes a blueprint, which is stored as a recording that is separate from the data.

For such RRD, the `load_archive()` function can be used:


<!-- NOLINT_START -->
```python
import rerun as rr

archive = rr.dataframe.load_archive("/pat/to/file.rrd")

print(f"The archive contains {archive.num_recordings()} recordings.")

for recording in archive.all_recordings():
...
```
<!-- NOLINT_END -->

The overall content of the recording can be inspected using the `schema()` method:

```python
schema = recording.schema()
schema.index_columns() # list of all index columns (timelines)
schema.component_columns() # list of all component columns
```


### Creating a view

The first step for getting data out of a recording is to create a view, which requires specifying an index column and what content to include.

As of Rerun 0.19, views must have exactly one index column, which can be any of the recording timelines.
Each row of the view will correspond to a unique value of the index column.
If a row has a `null` in the returned index (time) column, it means that data was static.
In the future, it will be possible to have other kinds of column as index, and more than a single index column.

The `contents` define which columns are included in the view and can be flexibly specified as entity expression,
optionally providing a corresponding list of components.

These are all valid ways to specify view content:

```python
# everything in the recording
view = recording.view(index="frame_nr", contents="/**")

# everything in the recording, except the /world/robot subtree
view = recording.view(index="frame_nr", contents="/**\n- /world/robot/**")

# all `Scalar` components in the recording
view = recording.view(index="frame_nr", contents={"/**": ["Scalar"]})

# some components in an entity subtree and a specific component
# of a specific entity
view = recording.view(index="frame_nr", contents={
"/world/robot/**": ["Position3D", "Color"],
"/world/scene": ["Text"],
})
```

### Filtering rows in a view

A view has several APIs to further filter the rows it will return.

<!-- TODO(rerun-io/landing#521): change these headers to h4 when these are properly supported -->

**Filtering by time range**

Rows may be filtered to keep only a given range of values from its index column:

```python
# only keep rows for frames 0 to 10
view = view.filter_range_sequence(0, 10)
```

This API exists for both temporal and sequence timeline, and for various units:
- `view.filter_range_sequence(start_frame, end_frame)` (takes `int` arguments)
- `view.filter_range_seconds(stat_second, end_second)` (takes `float` arguments)
- `view.filter_range_nanos(start_nano, end_nano)` (takes `int` arguments)

(all ranges are including both start and end values)

**Filtering by index value**

Rows may be filtered to keep only those whose index corresponds to a specific set of value:

```python
view = view.filter_index_values([0, 5, 10])
```

Note that a precise match is required.
Since Rerun internally stores times as `int64`, this method is only available for integer arguments (nanos or sequence number).
Floating point seconds would risk false mismatch due to numerical conversion.


**Filtering by column not null**

Rows where a specific column has null values may be filtered out using the `filter_is_not_null()` method. When using this method, only rows for which a logging event exist for the provided column are returned.

```python
# only keep rows where a position is available for the robot
view = view.filter_is_not_null(rr.dataframe.ComponentColumnSelector("/world/robot", "Position3D"))
```

### Specifying rows

Instead of filtering rows based on the existing data, it is possible to specify exactly which rows must be returned by the view using the `using_index_values()` method:

```python
# resample the first second of data at every millisecond
view = view.using_index_values(range(0, 1_000_000, 1_000_0000_000))
```

In this case, the view will return rows in multiples of 1e6 nanoseconds (i.e. for each millisecond) over a period of one second.
A precise match on the index value is required for data to be produced on the row.
For this reason, a floating point version of this method is not provided for this feature.

Note that this feature is typically used in conjunction with `fill_latest_at()` (see next paragraph) to enable arbitrary resampling of the original data.


### Filling empty values with latest-at data

By default, the rows returned by the view may be sparse and contain values only for the columns where a logging event actually occurred at the corresponding index value.
The view can optionally replace these empty cells using a latest-at query. This means that, for each such empty cell, the view traces back to find the last logged value and uses it instead. This is enabled by calling the `fill_latest_at()` method:

```python
view = view.fill_latest_at()
```

### Reading the data

Once the view is fully set up (possibly using the filtering features previously described), its content can be read using the `select()` method. This method optionally allows specifying which subset of columns should be produced:


```python
# select all columns
record_batches = view.select()

# select only the specified columns
record_batches = view.select(
[
rr.dataframe.IndexColumnSelector("frame_nr"),
rr.dataframe.ComponentColumnSelector("/world/robot", "Position3D"),
],
)
```

The `select()` method returns a [`pyarrow.RecordBatchReader`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html), which is essentially an iterator over a stream of [`pyarrow.RecordBatch`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html#pyarrow-recordbatch)es containing the actual data. See the [PyArrow documentation](https://arrow.apache.org/docs/python/index.html) for more information.

For the rest of this page, we explore how these `RecordBatch`es can be ingested in some of the popular data science packages.


## Load data to a PyArrow `Table`

The `RecordBatchReader` provides a [`read_all()`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html#pyarrow.RecordBatchReader.read_all) method which directly produces a [`pyarrow.Table`](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table):

```python
import rerun as rr

recording = rr.dataframe.load_recording("/path/to/file.rrd")
view = recording.view(index="frame_nr", contents="/**")

table = view.select().read_all()
```


## Load data to a Pandas dataframe

The `RecordBatchReader` provides a [`read_pandas()`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html#pyarrow.RecordBatchReader.read_pandas) method which returns a [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html):


```python
import rerun as rr

recording = rr.dataframe.load_recording("/path/to/file.rrd")
view = recording.view(index="frame_nr", contents="/**")

df = view.select().read_pandas()
```

## Load data to a Polars dataframe

A [Polars dataframe](https://docs.pola.rs/api/python/stable/reference/dataframe/index.html) can be created from a PyArrow table:

```python
import rerun as rr
import polars as pl

recording = rr.dataframe.load_recording("/path/to/file.rrd")
view = recording.view(index="frame_nr", contents="/**")

df = pl.from_arrow(view.select().read_all())
```


## Load data to a DuckDB relation

A [DuckDB](https://duckdb.org) relation can be created directly using the `pyarrow.RecordBatchReader` returned by `select()`:

```python
import rerun as rr
import duckdb

recording = rr.dataframe.load_recording("/path/to/file.rrd")
view = recording.view(index="frame_nr", contents="/**")

rel = duckdb.arrow(view.select())
```
18 changes: 15 additions & 3 deletions rerun_py/rerun_bindings/rerun_bindings.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -43,15 +43,27 @@ class RecordingView:
"""

def filter_range_sequence(self, start: int, end: int) -> RecordingView:
"""Filter the view to only include data between the given index sequence numbers."""
"""
Filter the view to only include data between the given index sequence numbers.
This is including both the value at the start and the value at the end.
"""
...

def filter_range_seconds(self, start: float, end: float) -> RecordingView:
"""Filter the view to only include data between the given index time values."""
"""
Filter the view to only include data between the given index time values.
This is including both the value at the start and the value at the end.
"""
...

def filter_range_nanos(self, start: int, end: int) -> RecordingView:
"""Filter the view to only include data between the given index time values."""
"""
Filter the view to only include data between the given index time values.
This is including both the value at the start and the value at the end.
"""
...

def filter_index_values(self, values: IndexValuesLike) -> RecordingView:
Expand Down
4 changes: 3 additions & 1 deletion scripts/lint.py
Original file line number Diff line number Diff line change
Expand Up @@ -688,8 +688,10 @@ def lint_workspace_lints(cargo_file_content: str) -> str | None:
"ML",
"Numpy",
"nuScenes",
"Pixi",
"Pandas",
"PDF",
"Pixi",
"Polars",
"Python",
"Q1",
"Q2",
Expand Down

0 comments on commit 6535073

Please sign in to comment.