-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): add pandas.DeltaSharingDataset #832
base: main
Are you sure you want to change the base?
feat(datasets): add pandas.DeltaSharingDataset #832
Conversation
Signed-off-by: Hugo Carvalho <hugodanielsilvacarvalho.hc@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @hugodscarvalho, thank you so much for this contribution! Leaving a minor comment.
Would you mind also adding this to the release notes and the docs API .rst
file here - https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/docs/source/api/kedro_datasets_experimental.rst
Raises: | ||
NotImplementedError: Saving to Delta Sharing shared tables is not supported. | ||
""" | ||
raise NotImplementedError("Saving to Delta Sharing shared tables is not supported.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be DatasetError
which can be imported from kedro.io.core
as I see some other datasets where one of the operations are not supported do the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love what Delta is doing in the Python space and thank you for contributing the dataset. While we don't require fully tested dataset for experimental dataset, is is possible to share a runnable example that can be copy & paste? It would really help and make the review process easier as I am not able to get it run so it's tricky to review the code. It's looking very good though!
@@ -290,7 +291,8 @@ experimental = [ | |||
"netcdf4>=1.6.4", | |||
"xarray>=2023.1.0", | |||
"rioxarray", | |||
"torch" | |||
"torch", | |||
"delta-sharing>=1.1.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why version 1.1.1?
>>> from kedro_datasets import DeltaSharingDataset | ||
>>> import pandas as pd | ||
>>> | ||
>>> credentials = { | ||
... "profile_file": "conf/local/config.share" | ||
... } | ||
>>> load_args = { | ||
... "version": 1, | ||
... "limit": 10, | ||
... "use_delta_format": True | ||
... } | ||
>>> dataset = DeltaSharingDataset( | ||
... share="example_share", | ||
... schema="example_schema", | ||
... table="example_table", | ||
... credentials=credentials, | ||
... load_args=load_args | ||
... ) | ||
>>> data = dataset.load() | ||
>>> print(data) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to create an example that can run locally? or is it expected to connect to somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hugodscarvalho, thank you for the contribution, it looks great!
Agree with the point about testing - it would be helpful for us to get some notes on how to test it locally if possible.
Happy to approve when opened nit comments are resolved 🙂
Overview
This PR introduces a new dataset called
DeltaSharingDataset
, designed to load data from Delta Sharing shared tables into Pandas DataFrames. Delta Sharing is an open protocol that allows organizations to securely exchange large datasets in real-time, independent of the computing platforms they use. The dataset supports read-only operations and provides a way to integrate Delta Sharing data into Kedro workflows for data analysis and processing.Features
DeltaSharingDataset
is built using the Delta Sharing open protocol, enabling secure real-time data exchange.delta_sharing.load_as_pandas
function, allowing for easy data manipulation and analysis within Kedro pipelines.use_delta_format
argument.Example Usage
YAML API:
Python API:
Key Configuration Parameters
share
: The Delta Sharing share name.schema
: The schema name within the share.table
: The table name to load data from.credentials.profile_file
: Path to the Delta Sharing profile file.load_args.version
: The version of the table snapshot to load. If not provided, the latest version is loaded.load_args.limit
: Maximum number of rows to load. Useful for data previews.load_args.use_delta_format
: Whether to use Delta format for loading data. Defaults toFalse
.Limitations
DeltaSharingDataset
is read-only and does not support saving data back to Delta Sharing tables.Impact
This new dataset offers a simple, cost-effective way to incorporate Delta Sharing data into Kedro projects. It is especially useful in environments where shared data is accessed frequently for analysis, enabling users to leverage Delta Sharing's protocol for data interoperability without the need for heavy compute resources.
Why Delta Sharing?
By adding this dataset, users can connect to Delta Sharing shared tables and manage large datasets in Pandas for data science tasks, making Kedro more versatile in handling modern data-sharing use cases.
Future Improvements