Skip to content

Latest commit

 

History

History
executable file
·
503 lines (315 loc) · 23.7 KB

RELEASE.md

File metadata and controls

executable file
·
503 lines (315 loc) · 23.7 KB

Upcoming Release

Major features and improvements

Bug fixes and other changes

Breaking Changes

Community contributions

Release 6.0.0

Major features and improvements

  • Supported passing database to ibis.TableDataset for load and save operations.
  • Added functionality to save pandas DataFrames directly to Snowflake, facilitating seamless .csv ingestion.
  • Added Python 3.9, 3.10 and 3.11 support for snowflake.SnowflakeTableDataset.
  • Enabled connection sharing between ibis.FileDataset and ibis.TableDataset instances, thereby allowing nodes to save data loaded by one to the other (as long as they share the same connection configuration).
  • Added the following new experimental datasets:
Type Description Location
databricks.ExternalTableDataset A dataset for accessing external tables in Databricks. kedro_datasets_experimental.databricks
safetensors.SafetensorsDataset A dataset for securely saving and loading files in the SafeTensors format. kedro_datasets_experimental.safetensors

Bug fixes and other changes

  • Delayed backend connection for pandas.GBQTableDataset. In practice, this means that a dataset's connection details aren't used (or validated) until the dataset is accessed. On the plus side, the cost of connection isn't incurred regardless of when or whether the dataset is used. Furthermore, this makes the dataset object serializable (e.g. for use with ParallelRunner), because the unserializable client isn't part of it.
  • Removed the unused BigQuery client created in pandas.GBQQueryDataset. This makes the dataset object serializable (e.g. for use with ParallelRunner) by removing the unserializable object.
  • Implemented Snowflake's local testing framework for testing purposes.
  • Improved the dependency management for Spark-based datasets by refactoring the Spark and Databricks utility functions used across the datasets.
  • Added deprecation warning for tracking.MetricsDataset and tracking.JSONDataset.
  • Moved kedro-catalog JSON schemas from Kedro core to kedro-datasets.

Breaking Changes

  • Demoted video.VideoDataset from core to experimental dataset.
  • Removed file handling capabilities from ibis.TableDataset. Use ibis.FileDataset to load and save files with an Ibis backend instead.

Community contributions

Many thanks to the following Kedroids for contributing PRs to this release:

Release 5.1.0

Major features and improvements

  • Added the following new core datasets:
Type Description Location
ibis.FileDataset A dataset for loading and saving files using Ibis's backends. kedro_datasets.ibis

Bug fixes and other changes

  • Changed Ibis datasets to connect to an in-memory DuckDB database if connection configuration is not provided.

Release 5.0.0

Major features and improvements

  • Removed support for Python 3.9.
  • Added the following new experimental datasets:
Type Description Location
pytorch.PyTorchDataset A dataset for securely saving and loading PyTorch models. kedro_datasets_experimental.pytorch
prophet.ProphetModelDataset A dataset for Meta's Prophet model for time series forecasting. kedro_datasets_experimental.prophet
  • Added the following new core datasets:
Type Description Location
plotly.HTMLDataset A dataset for saving a plotly figure as HTML. kedro_datasets.plotly

Bug fixes and other changes

  • Refactored all datasets to set fs_args defaults in the same way as load_args and save_args and not have hardcoded values in the save methods.
  • Fixed bug related to loading/saving models from/to remote storage using TensorFlowModelDataset.
  • Fixed deprecated load and save approaches of GBQTableDataset and GBQQueryDataset by invoking save and load directly over pandas-gbq lib.
  • Fixed incorrect pandas optional dependency.

Breaking Changes

  • Exposed load and save publicly for each dataset. This requires Kedro version 0.19.7 or higher.
  • Replaced the geopandas.GeoJSONDataset with geopandas.GenericDataset to support parquet and feather file formats.

Community contributions

Many thanks to the following Kedroids for contributing PRs to this release:

Release 4.1.0

Major features and improvements

  • Improved partitions.PartitionedDataset representation when printing.

Bug fixes and other changes

  • Updated ibis.TableDataset to make sure credentials are not printed in interactive environment.

Breaking Changes

Community contributions

Release 4.0.0

Major features and improvements

  • Added the following new experimental datasets:
Type Description Location
langchain.ChatAnthropicDataset A dataset for loading a ChatAnthropic langchain model. kedro_datasets_experimental.langchain
langchain.ChatCohereDataset A dataset for loading a ChatCohere langchain model. kedro_datasets_experimental.langchain
langchain.OpenAIEmbeddingsDataset A dataset for loading a OpenAIEmbeddings langchain model. kedro_datasets_experimental.langchain
langchain.ChatOpenAIDataset A dataset for loading a ChatOpenAI langchain model. kedro_datasets_experimental.langchain
rioxarray.GeoTIFFDataset A dataset for loading and saving geotiff raster data kedro_datasets_experimental.rioxarray
netcdf.NetCDFDataset A dataset for loading and saving "*.nc" files. kedro_datasets_experimental.netcdf
  • Added the following new core datasets:
Type Description Location
dask.CSVDataset A dataset for loading a CSV files using dask kedro_datasets.dask
  • Extended preview feature to yaml.YAMLDataset.

Bug fixes and other changes

  • Added metadata parameter for a few datasets

Breaking Changes

  • netcdf.NetCDFDataset moved from kedro_datasets to kedro_datasets_experimental.

Community contributions

Many thanks to the following Kedroids for contributing PRs to this release:

Release 3.0.1

Bug fixes and other changes

  • Removed arbitrary upper bound for s3fs.
  • Added support for NetCDF4 via engine="netcdf4" and engine="h5netcdf" to netcdf.NetCDFDataset.

Community contributions

Many thanks to the following Kedroids for contributing PRs to this release:

Release 3.0.0

Major features and improvements

  • Added the following new datasets:
Type Description Location
netcdf.NetCDFDataset A dataset for loading and saving *.nc files. kedro_datasets.netcdf
ibis.TableDataset A dataset for loading and saving using Ibis's backends. kedro_datasets.ibis
  • Added support for Python 3.12.
  • Normalised optional dependencies names for datasets to follow PEP 685. The . characters have been replaced with - in the optional dependencies names. Note that this might be breaking for some users. For example, users should now install optional dependencies for pandas.ParquetDataset from kedro-datasets like this:
pip install kedro-datasets[pandas-parquetdataset]
  • Removed setup.py and move to pyproject.toml completely for kedro-datasets.

Bug fixes and other changes

  • If using MSSQL, load_args:params will be typecasted as tuple.
  • Fixed bug with loading datasets from Hugging Face. Now allows passing parameters to the load_dataset function.
  • Made connection_args argument optional when calling create_connection() in sql_dataset.py.

Community contributions

Many thanks to the following Kedroids for contributing PRs to this release:

Release 2.1.0

Major features and improvements

  • Added the following new datasets:
Type Description Location
matlab.MatlabDataset A dataset which uses scipy to save and load .mat files. kedro_datasets.matlab
  • Extended preview feature for matplotlib, plotly and tracking datasets.
  • Allowed additional parameters for sqlalchemy engine when using sql datasets.

Bug fixes and other changes

  • Removed Windows specific conditions in pandas.HDFDataset extra dependencies

Community contributions

Many thanks to the following Kedroids for contributing PRs to this release:

Release 2.0.0

Major features and improvements

  • Added the following new datasets:
Type Description Location
huggingface.HFDataset A dataset to load Hugging Face datasets using the datasets library. kedro_datasets.huggingface
huggingface.HFTransformerPipelineDataset A dataset to load pretrained Hugging Face transformers using the transformers library. kedro_datasets.huggingface
  • Removed Dataset classes ending with "DataSet", use the "Dataset" spelling instead.
  • Removed support for Python 3.7 and 3.8.
  • Added databricks-connect>=13.0 support for Spark- and Databricks-based datasets.
  • Bumped s3fs to latest calendar-versioned release.
  • PartitionedDataset and IncrementalDataset now both support versioning of the underlying dataset.

Bug fixes and other changes

  • Fixed bug with loading models saved with TensorFlowModelDataset.
  • Made dataset parameters keyword-only.
  • Corrected pandas-gbq as py311 dependency.

Community contributions

Many thanks to the following Kedroids for contributing PRs to this release:

Release 1.8.0

Major features and improvements

  • Added the following new datasets:
Type Description Location
polars.LazyPolarsDataset A LazyPolarsDataset using polars's Lazy API. kedro_datasets.polars
  • Moved PartitionedDataSet and IncrementalDataSet from the core Kedro repo to kedro-datasets and renamed to PartitionedDataset and IncrementalDataset.
  • Renamed polars.GenericDataSet to polars.EagerPolarsDataset to better reflect the difference between the two dataset classes.
  • Added a deprecation warning when using polars.GenericDataSet or polars.GenericDataset that these have been renamed to polars.EagerPolarsDataset
  • Delayed backend connection for pandas.SQLTableDataset, pandas.SQLQueryDataset, and snowflake.SnowparkTableDataset. In practice, this means that a dataset's connection details aren't used (or validated) until the dataset is accessed. On the plus side, the cost of connection isn't incurred regardless of when or whether the dataset is used.

Bug fixes and other changes

  • Fixed erroneous warning when using an cloud protocol file path with SparkDataSet on Databricks.
  • Updated PickleDataset to explicitly mention cloudpickle support.

Community contributions

Many thanks to the following Kedroids for contributing PRs to this release:

Release 1.7.1

Bug fixes and other changes

  • Pinned tables version on kedro-datasets for Python < 3.8.

Upcoming deprecations for Kedro-Datasets 2.0.0

  • Renamed dataset and error classes, in accordance with the Kedro lexicon. Dataset classes ending with "DataSet" are deprecated and will be removed in 2.0.0.

Release 1.7.0:

Major features and improvements

  • Added the following new datasets:
Type Description Location
polars.GenericDataSet A GenericDataSet backed by polars, a lightning fast dataframe package built entirely using Rust. kedro_datasets.polars

Bug fixes and other changes

  • Fixed broken links in docstrings.
  • Reverted PySpark pin to <4.0.

Community contributions

Many thanks to the following Kedroids for contributing PRs to this release:

Release 1.6.0:

Major features and improvements

  • Added support for Python 3.11.

Release 1.5.3:

Bug fixes and other changes

  • Made databricks.ManagedTableDataSet read-only by default.
    • The user needs to specify write_mode to allow save on the data set.
  • Fixed an issue on api.APIDataSet where the sent data was doubly converted to json string (once by us and once by the requests library).
  • Fixed problematic kedro-datasets optional dependencies, revert to setup.py

Community contributions

Release 1.5.2:

Bug fixes and other changes

  • Fixed problematic kedro-datasets optional dependencies.

Release 1.5.1:

Bug fixes and other changes

  • Fixed problematic docstrings in pandas.DeltaTableDataSet causing Read the Docs builds on Kedro to fail.

Release 1.5.0

Major features and improvements

  • Added the following new datasets:
Type Description Location
pandas.DeltaTableDataSet A dataset to work with delta tables. kedro_datasets.pandas
  • Implemented lazy loading of dataset subpackages and classes.
    • Suppose that SQLAlchemy, a Python SQL toolkit, is installed in your Python environment. With this change, the SQLAlchemy library will not be loaded (for pandas.SQLQueryDataSet or pandas.SQLTableDataSet) if you load a different pandas dataset (e.g. pandas.CSVDataSet).
  • Added automatic inference of file format for pillow.ImageDataSet to be passed to save().

Bug fixes and other changes

  • Improved error messages for missing dataset dependencies.
    • Suppose that SQLAlchemy, a Python SQL toolkit, is not installed in your Python environment. Previously, from kedro_datasets.pandas import SQLQueryDataSet or from kedro_datasets.pandas import SQLTableDataSet would result in ImportError: cannot import name 'SQLTableDataSet' from 'kedro_datasets.pandas'. Now, the same imports raise the more helpful and intuitive ModuleNotFoundError: No module named 'sqlalchemy'.

Community contributions

Many thanks to the following Kedroids for contributing PRs to this release:

Release 1.4.2

Bug fixes and other changes

  • Fixed documentations of GeoJSONDataSet and SparkStreamingDataSet.
  • Fixed problematic docstrings causing Read the Docs builds on Kedro to fail.

Release 1.4.1:

Bug fixes and other changes

  • Fixed missing pickle.PickleDataSet extras in setup.py.

Release 1.4.0:

Major features and improvements

  • Added the following new datasets:
Type Description Location
spark.SparkStreamingDataSet A dataset to work with PySpark Streaming DataFrame. kedro_datasets.spark

Bug fixes and other changes

  • Fixed problematic docstrings of APIDataSet.

Release 1.3.0:

Major features and improvements

  • Added the following new datasets:
Type Description Location
databricks.ManagedTableDataSet A dataset to access managed delta tables in Databricks. kedro_datasets.databricks
  • Added pandas 2.0 support.
  • Added SQLAlchemy 2.0 support (and dropped support for versions below 1.4).
  • Added a save method to APIDataSet.
  • Reduced constructor arguments for APIDataSet by replacing most arguments with a single constructor argument load_args. This makes it more consistent with other Kedro DataSets and the underlying requests API, and automatically enables the full configuration domain: stream, certificates, proxies, and more.
  • Relaxed Kedro version pin to >=0.16.
  • Added metadata attribute to all existing datasets. This is ignored by Kedro, but may be consumed by users or external plugins.

Bug fixes and other changes

  • Relaxed delta-spark upper bound to allow compatibility with Spark 3.1.x and 3.2.x.
  • Upgraded required polars version to 0.17.
  • Renamed TensorFlowModelDataset to TensorFlowModelDataSet to be consistent with all other plugins in Kedro-Datasets.

Community contributions

Many thanks to the following Kedroids for contributing PRs to this release:

Release 1.2.0:

Major features and improvements

  • Added fsspec resolution in SparkDataSet to support more filesystems.
  • Added the _preview method to the Pandas ExcelDataSet and CSVDataSet classes.

Bug fixes and other changes

  • Fixed a docstring in the Pandas SQLQueryDataSet as part of the Sphinx revamp on Kedro.

Release 1.1.1:

Bug fixes and other changes

  • Fixed problematic docstrings causing Read the Docs builds on Kedro to fail.

Release 1.1.0:

Major features and improvements

  • Added the following new datasets:
Type Description Location
polars.CSVDataSet A CSVDataSet backed by polars, a lighting fast dataframe package built entirely using Rust. kedro_datasets.polars
snowflake.SnowparkTableDataSet Work with Snowpark DataFrames from tables in Snowflake. kedro_datasets.snowflake

Bug fixes and other changes

  • Add mssql backend to the SQLQueryDataSet DataSet using pyodbc library.
  • Added a warning when the user tries to use SparkDataSet on Databricks without specifying a file path with the /dbfs/ prefix.

Release 1.0.2:

Bug fixes and other changes

  • Change reference to kedro.pipeline.Pipeline object throughout test suite with kedro.modular_pipeline.pipeline factory.
  • Relaxed PyArrow range in line with pandas.
  • Fixed outdated links to the dill package documentation.

Release 1.0.1:

Bug fixes and other changes

  • Fixed docstring formatting in VideoDataSet that was causing the documentation builds to fail.

Release 1.0.0:

First official release of Kedro-Datasets.

Datasets are Kedro’s way of dealing with input and output in a data and machine-learning pipeline. Kedro supports numerous datasets out of the box to allow you to process different data formats including Pandas, Plotly, Spark and more.

The datasets have always been part of the core Kedro Framework project inside kedro.extras. In Kedro 0.19.0, we will remove datasets from Kedro to reduce breaking changes associated with dataset dependencies. Instead, users will need to use the datasets from the kedro-datasets repository instead.

Major features and improvements

  • Changed pandas.ParquetDataSet to load data using pandas instead of parquet.

Release 0.1.0:

The initial release of Kedro-Datasets.

Thanks to our main contributors

We are also grateful to everyone who advised and supported us, filed issues or helped resolve them, asked and answered questions and were part of inspiring discussions.