Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper viewer support for dataframes #8443

Open
jleibs opened this issue Dec 12, 2024 · 0 comments
Open

Proper viewer support for dataframes #8443

jleibs opened this issue Dec 12, 2024 · 0 comments
Labels
feat-dataframe-view Everything related to the dataframe view 📺 re_viewer affects re_viewer itself

Comments

@jleibs
Copy link
Member

jleibs commented Dec 12, 2024

Context

We now have two main types of data:

Recording-like

  • Recording-like data is our historical Rerun "logged" concept.
  • Data in a single recording is divided into "Rerun Chunks".
  • Unlike an Arrow-IPC Stream, each Chunk in a recording stream is allowed to have a different schema.
  • Rerun chunks depend on the existence of certain required columns related to rows and indexes:
    • -> Every Rerun Chunk is a valid RecordBatch, but not every RecordBatch is a valid Rerun Cunk
  • A Rerun ChunkStore indexes these chunks and allows for flexible querying operations that return chunks.

Table-like

  • Table-like data has started showing up in our new APIs where we want to map things to a single data-frame.
    • Query Results
    • Catalog
  • This exactly matches the traditional Arrow concept
  • Any Table-like data exposed as a user-facing python API should map to a pa.RecordBatchReader

Improved Viewer Support

In principal, any Dataframe can be converted to an equivalent set of Rerun chunks by:

  • Injecting a row-id based timeline if one doesn't exist already
  • Splitting apart columns that belong to separate entities
  • Wrapping any non-list types as arrow list arrays.

However, the question is where we should apply this transformation. Doing this on the viewer-ingest side (rather than send side) would both simplify a lot of logging code as well as data-platform implementation.

Proposal

For incoming client streams (e.g. TCP, notebook, etc.)

  • Some header in the stream (maybe part of StoreInfo) should determine whether a stream is a "RerunChunk" stream or a "Dataframe" stream.

For gRPC responses we have the abillity to type these more directly.

In the short term, if the stream is a dataframe stream, each DataframePart read from the stream should be converted to 1 or more Rerun chunks, which are injected into a Store.

Longer-term we might introduce an alternative to the ChunkStore for working with these Dataframe stores more directly.

@jleibs jleibs added feat-dataframe-view Everything related to the dataframe view 📺 re_viewer affects re_viewer itself labels Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat-dataframe-view Everything related to the dataframe view 📺 re_viewer affects re_viewer itself
Projects
None yet
Development

No branches or pull requests

1 participant