- Supported passing
database
toibis.TableDataset
for load and save operations. - Added functionality to save pandas DataFrames directly to Snowflake, facilitating seamless
.csv
ingestion. - Added Python 3.9, 3.10 and 3.11 support for
snowflake.SnowflakeTableDataset
. - Enabled connection sharing between
ibis.FileDataset
andibis.TableDataset
instances, thereby allowing nodes to save data loaded by one to the other (as long as they share the same connection configuration). - Added the following new experimental datasets:
Type | Description | Location |
---|---|---|
databricks.ExternalTableDataset |
A dataset for accessing external tables in Databricks. | kedro_datasets_experimental.databricks |
safetensors.SafetensorsDataset |
A dataset for securely saving and loading files in the SafeTensors format. | kedro_datasets_experimental.safetensors |
- Delayed backend connection for
pandas.GBQTableDataset
. In practice, this means that a dataset's connection details aren't used (or validated) until the dataset is accessed. On the plus side, the cost of connection isn't incurred regardless of when or whether the dataset is used. Furthermore, this makes the dataset object serializable (e.g. for use withParallelRunner
), because the unserializable client isn't part of it. - Removed the unused BigQuery client created in
pandas.GBQQueryDataset
. This makes the dataset object serializable (e.g. for use withParallelRunner
) by removing the unserializable object. - Implemented Snowflake's local testing framework for testing purposes.
- Improved the dependency management for Spark-based datasets by refactoring the Spark and Databricks utility functions used across the datasets.
- Added deprecation warning for
tracking.MetricsDataset
andtracking.JSONDataset
. - Moved
kedro-catalog
JSON schemas from Kedro core tokedro-datasets
.
- Demoted
video.VideoDataset
from core to experimental dataset. - Removed file handling capabilities from
ibis.TableDataset
. Useibis.FileDataset
to load and save files with an Ibis backend instead.
Many thanks to the following Kedroids for contributing PRs to this release:
- Added the following new core datasets:
Type | Description | Location |
---|---|---|
ibis.FileDataset |
A dataset for loading and saving files using Ibis's backends. | kedro_datasets.ibis |
- Changed Ibis datasets to connect to an in-memory DuckDB database if connection configuration is not provided.
- Removed support for Python 3.9.
- Added the following new experimental datasets:
Type | Description | Location |
---|---|---|
pytorch.PyTorchDataset |
A dataset for securely saving and loading PyTorch models. | kedro_datasets_experimental.pytorch |
prophet.ProphetModelDataset |
A dataset for Meta's Prophet model for time series forecasting. | kedro_datasets_experimental.prophet |
- Added the following new core datasets:
Type | Description | Location |
---|---|---|
plotly.HTMLDataset |
A dataset for saving a plotly figure as HTML. |
kedro_datasets.plotly |
- Refactored all datasets to set
fs_args
defaults in the same way asload_args
andsave_args
and not have hardcoded values in the save methods. - Fixed bug related to loading/saving models from/to remote storage using
TensorFlowModelDataset
. - Fixed deprecated load and save approaches of
GBQTableDataset
andGBQQueryDataset
by invoking save and load directly overpandas-gbq
lib. - Fixed incorrect
pandas
optional dependency.
- Exposed
load
andsave
publicly for each dataset. This requires Kedro version 0.19.7 or higher. - Replaced the
geopandas.GeoJSONDataset
withgeopandas.GenericDataset
to support parquet and feather file formats.
Many thanks to the following Kedroids for contributing PRs to this release:
- Brandon Meek
- yury-fedotov
- gitgud5000
- janickspirig
- Galen Seilis
- Mariusz Wojakowski
- harm-matthias-harms
- Felix Scherz
- Improved
partitions.PartitionedDataset
representation when printing.
- Updated
ibis.TableDataset
to make sure credentials are not printed in interactive environment.
- Added the following new experimental datasets:
Type | Description | Location |
---|---|---|
langchain.ChatAnthropicDataset |
A dataset for loading a ChatAnthropic langchain model. | kedro_datasets_experimental.langchain |
langchain.ChatCohereDataset |
A dataset for loading a ChatCohere langchain model. | kedro_datasets_experimental.langchain |
langchain.OpenAIEmbeddingsDataset |
A dataset for loading a OpenAIEmbeddings langchain model. | kedro_datasets_experimental.langchain |
langchain.ChatOpenAIDataset |
A dataset for loading a ChatOpenAI langchain model. | kedro_datasets_experimental.langchain |
rioxarray.GeoTIFFDataset |
A dataset for loading and saving geotiff raster data | kedro_datasets_experimental.rioxarray |
netcdf.NetCDFDataset |
A dataset for loading and saving "*.nc" files. | kedro_datasets_experimental.netcdf |
- Added the following new core datasets:
Type | Description | Location |
---|---|---|
dask.CSVDataset |
A dataset for loading a CSV files using dask |
kedro_datasets.dask |
- Extended preview feature to
yaml.YAMLDataset
.
- Added
metadata
parameter for a few datasets
netcdf.NetCDFDataset
moved fromkedro_datasets
tokedro_datasets_experimental
.
Many thanks to the following Kedroids for contributing PRs to this release:
- Removed arbitrary upper bound for
s3fs
. - Added support for NetCDF4 via
engine="netcdf4"
andengine="h5netcdf"
tonetcdf.NetCDFDataset
.
Many thanks to the following Kedroids for contributing PRs to this release:
- Added the following new datasets:
Type | Description | Location |
---|---|---|
netcdf.NetCDFDataset |
A dataset for loading and saving *.nc files. |
kedro_datasets.netcdf |
ibis.TableDataset |
A dataset for loading and saving using Ibis's backends. | kedro_datasets.ibis |
- Added support for Python 3.12.
- Normalised optional dependencies names for datasets to follow PEP 685. The
.
characters have been replaced with-
in the optional dependencies names. Note that this might be breaking for some users. For example, users should now install optional dependencies forpandas.ParquetDataset
fromkedro-datasets
like this:
pip install kedro-datasets[pandas-parquetdataset]
- Removed
setup.py
and move topyproject.toml
completely forkedro-datasets
.
- If using MSSQL,
load_args:params
will be typecasted as tuple. - Fixed bug with loading datasets from Hugging Face. Now allows passing parameters to the load_dataset function.
- Made
connection_args
argument optional when callingcreate_connection()
insql_dataset.py
.
Many thanks to the following Kedroids for contributing PRs to this release:
- Added the following new datasets:
Type | Description | Location |
---|---|---|
matlab.MatlabDataset |
A dataset which uses scipy to save and load .mat files. |
kedro_datasets.matlab |
- Extended preview feature for matplotlib, plotly and tracking datasets.
- Allowed additional parameters for sqlalchemy engine when using sql datasets.
- Removed Windows specific conditions in
pandas.HDFDataset
extra dependencies
Many thanks to the following Kedroids for contributing PRs to this release:
- Added the following new datasets:
Type | Description | Location |
---|---|---|
huggingface.HFDataset |
A dataset to load Hugging Face datasets using the datasets library. | kedro_datasets.huggingface |
huggingface.HFTransformerPipelineDataset |
A dataset to load pretrained Hugging Face transformers using the transformers library. | kedro_datasets.huggingface |
- Removed Dataset classes ending with "DataSet", use the "Dataset" spelling instead.
- Removed support for Python 3.7 and 3.8.
- Added databricks-connect>=13.0 support for Spark- and Databricks-based datasets.
- Bumped
s3fs
to latest calendar-versioned release. PartitionedDataset
andIncrementalDataset
now both support versioning of the underlying dataset.
- Fixed bug with loading models saved with
TensorFlowModelDataset
. - Made dataset parameters keyword-only.
- Corrected pandas-gbq as py311 dependency.
Many thanks to the following Kedroids for contributing PRs to this release:
- Added the following new datasets:
Type | Description | Location |
---|---|---|
polars.LazyPolarsDataset |
A LazyPolarsDataset using polars's Lazy API. |
kedro_datasets.polars |
- Moved
PartitionedDataSet
andIncrementalDataSet
from the core Kedro repo tokedro-datasets
and renamed toPartitionedDataset
andIncrementalDataset
. - Renamed
polars.GenericDataSet
topolars.EagerPolarsDataset
to better reflect the difference between the two dataset classes. - Added a deprecation warning when using
polars.GenericDataSet
orpolars.GenericDataset
that these have been renamed topolars.EagerPolarsDataset
- Delayed backend connection for
pandas.SQLTableDataset
,pandas.SQLQueryDataset
, andsnowflake.SnowparkTableDataset
. In practice, this means that a dataset's connection details aren't used (or validated) until the dataset is accessed. On the plus side, the cost of connection isn't incurred regardless of when or whether the dataset is used.
- Fixed erroneous warning when using an cloud protocol file path with SparkDataSet on Databricks.
- Updated
PickleDataset
to explicitly mentioncloudpickle
support.
Many thanks to the following Kedroids for contributing PRs to this release:
- Pinned
tables
version onkedro-datasets
for Python < 3.8.
- Renamed dataset and error classes, in accordance with the Kedro lexicon. Dataset classes ending with "DataSet" are deprecated and will be removed in 2.0.0.
- Added the following new datasets:
Type | Description | Location |
---|---|---|
polars.GenericDataSet |
A GenericDataSet backed by polars, a lightning fast dataframe package built entirely using Rust. |
kedro_datasets.polars |
- Fixed broken links in docstrings.
- Reverted PySpark pin to <4.0.
Many thanks to the following Kedroids for contributing PRs to this release:
- Added support for Python 3.11.
- Made
databricks.ManagedTableDataSet
read-only by default.- The user needs to specify
write_mode
to allowsave
on the data set.
- The user needs to specify
- Fixed an issue on
api.APIDataSet
where the sent data was doubly converted to json string (once by us and once by therequests
library). - Fixed problematic
kedro-datasets
optional dependencies, revert tosetup.py
- Fixed problematic
kedro-datasets
optional dependencies.
- Fixed problematic docstrings in
pandas.DeltaTableDataSet
causing Read the Docs builds on Kedro to fail.
- Added the following new datasets:
Type | Description | Location |
---|---|---|
pandas.DeltaTableDataSet |
A dataset to work with delta tables. | kedro_datasets.pandas |
- Implemented lazy loading of dataset subpackages and classes.
- Suppose that SQLAlchemy, a Python SQL toolkit, is installed in your Python environment. With this change, the SQLAlchemy library will not be loaded (for
pandas.SQLQueryDataSet
orpandas.SQLTableDataSet
) if you load a different pandas dataset (e.g.pandas.CSVDataSet
).
- Suppose that SQLAlchemy, a Python SQL toolkit, is installed in your Python environment. With this change, the SQLAlchemy library will not be loaded (for
- Added automatic inference of file format for
pillow.ImageDataSet
to be passed tosave()
.
- Improved error messages for missing dataset dependencies.
- Suppose that SQLAlchemy, a Python SQL toolkit, is not installed in your Python environment. Previously,
from kedro_datasets.pandas import SQLQueryDataSet
orfrom kedro_datasets.pandas import SQLTableDataSet
would result inImportError: cannot import name 'SQLTableDataSet' from 'kedro_datasets.pandas'
. Now, the same imports raise the more helpful and intuitiveModuleNotFoundError: No module named 'sqlalchemy'
.
- Suppose that SQLAlchemy, a Python SQL toolkit, is not installed in your Python environment. Previously,
Many thanks to the following Kedroids for contributing PRs to this release:
- Fixed documentations of
GeoJSONDataSet
andSparkStreamingDataSet
. - Fixed problematic docstrings causing Read the Docs builds on Kedro to fail.
- Fixed missing
pickle.PickleDataSet
extras insetup.py
.
- Added the following new datasets:
Type | Description | Location |
---|---|---|
spark.SparkStreamingDataSet |
A dataset to work with PySpark Streaming DataFrame. | kedro_datasets.spark |
- Fixed problematic docstrings of
APIDataSet
.
- Added the following new datasets:
Type | Description | Location |
---|---|---|
databricks.ManagedTableDataSet |
A dataset to access managed delta tables in Databricks. | kedro_datasets.databricks |
- Added pandas 2.0 support.
- Added SQLAlchemy 2.0 support (and dropped support for versions below 1.4).
- Added a save method to
APIDataSet
. - Reduced constructor arguments for
APIDataSet
by replacing most arguments with a single constructor argumentload_args
. This makes it more consistent with other Kedro DataSets and the underlyingrequests
API, and automatically enables the full configuration domain: stream, certificates, proxies, and more. - Relaxed Kedro version pin to
>=0.16
. - Added
metadata
attribute to all existing datasets. This is ignored by Kedro, but may be consumed by users or external plugins.
- Relaxed
delta-spark
upper bound to allow compatibility with Spark 3.1.x and 3.2.x. - Upgraded required
polars
version to 0.17. - Renamed
TensorFlowModelDataset
toTensorFlowModelDataSet
to be consistent with all other plugins in Kedro-Datasets.
Many thanks to the following Kedroids for contributing PRs to this release:
- Added
fsspec
resolution inSparkDataSet
to support more filesystems. - Added the
_preview
method to the PandasExcelDataSet
andCSVDataSet
classes.
- Fixed a docstring in the Pandas
SQLQueryDataSet
as part of the Sphinx revamp on Kedro.
- Fixed problematic docstrings causing Read the Docs builds on Kedro to fail.
- Added the following new datasets:
Type | Description | Location |
---|---|---|
polars.CSVDataSet |
A CSVDataSet backed by polars, a lighting fast dataframe package built entirely using Rust. |
kedro_datasets.polars |
snowflake.SnowparkTableDataSet |
Work with Snowpark DataFrames from tables in Snowflake. | kedro_datasets.snowflake |
- Add
mssql
backend to theSQLQueryDataSet
DataSet usingpyodbc
library. - Added a warning when the user tries to use
SparkDataSet
on Databricks without specifying a file path with the/dbfs/
prefix.
- Change reference to
kedro.pipeline.Pipeline
object throughout test suite withkedro.modular_pipeline.pipeline
factory. - Relaxed PyArrow range in line with pandas.
- Fixed outdated links to the dill package documentation.
- Fixed docstring formatting in
VideoDataSet
that was causing the documentation builds to fail.
First official release of Kedro-Datasets.
Datasets are Kedro’s way of dealing with input and output in a data and machine-learning pipeline. Kedro supports numerous datasets out of the box to allow you to process different data formats including Pandas, Plotly, Spark and more.
The datasets have always been part of the core Kedro Framework project inside kedro.extras
. In Kedro 0.19.0
, we will remove datasets from Kedro to reduce breaking changes associated with dataset dependencies. Instead, users will need to use the datasets from the kedro-datasets
repository instead.
- Changed
pandas.ParquetDataSet
to load data using pandas instead of parquet.
The initial release of Kedro-Datasets.
We are also grateful to everyone who advised and supported us, filed issues or helped resolve them, asked and answered questions and were part of inspiring discussions.