Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added PyarrowTableResult #830

Merged
merged 1 commit into from
Apr 22, 2024
Merged

added PyarrowTableResult #830

merged 1 commit into from
Apr 22, 2024

Conversation

zilto
Copy link
Collaborator

@zilto zilto commented Apr 17, 2024

You can pass to.SAVER(dependencies=["NODE_NAME"], combine=PyarrowTableResult()) to convert the specified node to the pyarrow.Table before materialization. The first motivation was to support more than pd.DataFrame and pyarrow.Table with the dlt DataSaver plugin. More generally, it can be useful for platform teams that want to have a "single way to store parquet files" that is independent of the specific API of a library (e.g., pandas, polars)

see #829 for more details

Changes

  • added h_pyarrow and tests
  • updated the dlt plugin example notebook

How I tested this

  • added 2 tests

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

Comment on lines +15 to +21
for example:
- pandas
- polars
- dask
- vaex
- ibis
- duckdb results
Copy link
Collaborator

@skrawcz skrawcz Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to be stricter on types...

Copy link
Collaborator

@skrawcz skrawcz Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g.

def input_types(self) -> List[Type[Type]]:
    """Gives the applicable types to this result builder.
    This is optional for backwards compatibility, but is recommended.

    :return: A list of types that this can apply to.
    """
    _types = []
    try:
       import ...
   except ...
    return _types

Copy link
Collaborator Author

@zilto zilto Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, the real check is if it implements __dataframe__(), which is done through pyarrow.interchange.from_dataframe() under build_result(). The PyarrowTableResult serve a slightly different role of "universal adapter" to help us avoid maintaining an explicit list of types (which is bound to grow). I opted to not include input_types() if it was to return Any.

@skrawcz skrawcz merged commit 26bc1cc into main Apr 22, 2024
23 checks passed
@skrawcz skrawcz deleted the feat/pyarrow-result-builder branch April 22, 2024 03:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants