Skip to content

Commit

Permalink
perf(datasets): lazily load datasets in init files (#277)
Browse files Browse the repository at this point in the history
* perf(datasets): lazily load datasets in init files (api)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (pandas)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* fix(datasets): fix no name in module in api/pandas

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (biosequence)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (dask)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (databricks)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (email)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (geopandas)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (holoviews)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (json)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* fix(datasets): resolve "too few public attributes"

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (matplotlib)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (networkx)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (pickle)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (pillow)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (plotly)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (polars)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (redis)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (snowflake)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (spark)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (svmlight)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (tensorflow)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (text)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (tracking)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (video)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* perf(datasets): lazily load datasets in init files (yaml)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Update RELEASE.md

---------

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
  • Loading branch information
deepyaman authored Jul 31, 2023
1 parent fd4b2be commit 3aad425
Show file tree
Hide file tree
Showing 28 changed files with 240 additions and 169 deletions.
18 changes: 11 additions & 7 deletions kedro-datasets/RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
# Upcoming Release:

## Major features and improvements
* Added automatic inference of file format for `pillow.ImageDataSet` to be passed to `save()`
* Implemented lazy loading of dataset subpackages and classes.
* Suppose that SQLAlchemy, a Python SQL toolkit, is installed in your Python environment. With this change, the SQLAlchemy library will not be loaded (for `pandas.SQLQueryDataSet` or `pandas.SQLTableDataSet`) if you load a different pandas dataset (e.g. `pandas.CSVDataSet`).
* Added automatic inference of file format for `pillow.ImageDataSet` to be passed to `save()`.

## Bug fixes and other changes
* Improved error messages for missing dataset dependencies.
* Suppose that SQLAlchemy, a Python SQL toolkit, is not installed in your Python environment. Previously, `from kedro_datasets.pandas import SQLQueryDataSet` or `from kedro_datasets.pandas import SQLTableDataSet` would result in `ImportError: cannot import name 'SQLTableDataSet' from 'kedro_datasets.pandas'`. Now, the same imports raise the more helpful and intuitive `ModuleNotFoundError: No module named 'sqlalchemy'`.

## Community contributions
Many thanks to the following Kedroids for contributing PRs to this release:
Expand All @@ -12,7 +16,7 @@ Many thanks to the following Kedroids for contributing PRs to this release:

# Release 1.4.2
## Bug fixes and other changes
* Fixed documentations of `GeoJSONDataSet` and `SparkStreamingDataSet`
* Fixed documentations of `GeoJSONDataSet` and `SparkStreamingDataSet`.
* Fixed problematic docstrings causing Read the Docs builds on Kedro to fail.

# Release 1.4.1:
Expand All @@ -32,16 +36,16 @@ Many thanks to the following Kedroids for contributing PRs to this release:
## Major features and improvements
* Added pandas 2.0 support.
* Added SQLAlchemy 2.0 support (and dropped support for versions below 1.4).
* Added a save method to the APIDataSet
* Added a save method to `APIDataSet`.
* Reduced constructor arguments for `APIDataSet` by replacing most arguments with a single constructor argument `load_args`. This makes it more consistent with other Kedro DataSets and the underlying `requests` API, and automatically enables the full configuration domain: stream, certificates, proxies, and more.
* Relaxed Kedro version pin to `>=0.16`
* Relaxed Kedro version pin to `>=0.16`.
* Added `metadata` attribute to all existing datasets. This is ignored by Kedro, but may be consumed by users or external plugins.
* Added `ManagedTableDataSet` for managed delta tables on Databricks.

## Bug fixes and other changes
* Relaxed `delta-spark` upper bound to allow compatibility with Spark 3.1.x and 3.2.x.
* Upgraded required `polars` version to 0.17.
* Renamed `TensorFlowModelDataset` to `TensorFlowModelDataSet` to be consistent with all other plugins in kedro-datasets.
* Renamed `TensorFlowModelDataset` to `TensorFlowModelDataSet` to be consistent with all other plugins in Kedro-Datasets.

## Community contributions
Many thanks to the following Kedroids for contributing PRs to this release:
Expand Down Expand Up @@ -102,11 +106,11 @@ Datasets are Kedro’s way of dealing with input and output in a data and machin
The datasets have always been part of the core Kedro Framework project inside `kedro.extras`. In Kedro `0.19.0`, we will remove datasets from Kedro to reduce breaking changes associated with dataset dependencies. Instead, users will need to use the datasets from the `kedro-datasets` repository instead.

## Major features and improvements
* Changed `pandas.ParquetDataSet` to load data using pandas instead of parquet
* Changed `pandas.ParquetDataSet` to load data using pandas instead of parquet.

# Release 0.1.0:

The initial release of `kedro-datasets`.
The initial release of Kedro-Datasets.

## Thanks to our main contributors

Expand Down
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/api/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,13 @@
and returns them into either as string or json Dict.
It uses the python requests library: https://requests.readthedocs.io/en/latest/
"""
from typing import Any

__all__ = ["APIDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
APIDataSet: Any

with suppress(ImportError):
from .api_dataset import APIDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"api_dataset": ["APIDataSet"]}
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/biosequence/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``AbstractDataSet`` implementation to read/write from/to a sequence file."""
from typing import Any

__all__ = ["BioSequenceDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
BioSequenceDataSet: Any

with suppress(ImportError):
from .biosequence_dataset import BioSequenceDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"biosequence_dataset": ["BioSequenceDataSet"]}
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/dask/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""Provides I/O modules using dask dataframe."""
from typing import Any

__all__ = ["ParquetDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
ParquetDataSet: Any

with suppress(ImportError):
from .parquet_dataset import ParquetDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"parquet_dataset": ["ParquetDataSet"]}
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/databricks/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""Provides interface to Unity Catalog Tables."""
from typing import Any

__all__ = ["ManagedTableDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
ManagedTableDataSet: Any

with suppress(ImportError):
from .managed_table_dataset import ManagedTableDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"managed_table_dataset": ["ManagedTableDataSet"]}
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/email/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``AbstractDataSet`` implementations for managing email messages."""
from typing import Any

__all__ = ["EmailMessageDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
EmailMessageDataSet: Any

with suppress(ImportError):
from .message_dataset import EmailMessageDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"message_dataset": ["EmailMessageDataSet"]}
)
15 changes: 9 additions & 6 deletions kedro-datasets/kedro_datasets/geopandas/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``GeoJSONDataSet`` is an ``AbstractVersionedDataSet`` to save and load GeoJSON files.
"""
__all__ = ["GeoJSONDataSet"]
"""``GeoJSONDataSet`` is an ``AbstractVersionedDataSet`` to save and load GeoJSON files."""
from typing import Any

from contextlib import suppress
import lazy_loader as lazy

with suppress(ImportError):
from .geojson_dataset import GeoJSONDataSet
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
GeoJSONDataSet: Any

__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"geojson_dataset": ["GeoJSONDataSet"]}
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/holoviews/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``AbstractDataSet`` implementation to save Holoviews objects as image files."""
from typing import Any

__all__ = ["HoloviewsWriter"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
HoloviewsWriter: Any

with suppress(ImportError):
from .holoviews_writer import HoloviewsWriter
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"holoviews_writer": ["HoloviewsWriter"]}
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/json/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``AbstractDataSet`` implementation to load/save data from/to a JSON file."""
from typing import Any

__all__ = ["JSONDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
JSONDataSet: Any

with suppress(ImportError):
from .json_dataset import JSONDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"json_dataset": ["JSONDataSet"]}
)
10 changes: 6 additions & 4 deletions kedro-datasets/kedro_datasets/matplotlib/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
"""``AbstractDataSet`` implementation to save matplotlib objects as image files."""
from typing import Any

__all__ = ["MatplotlibWriter"]
import lazy_loader as lazy

from contextlib import suppress
MatplotlibWriter: Any

with suppress(ImportError):
from .matplotlib_writer import MatplotlibWriter
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"matplotlib_writer": ["MatplotlibWriter"]}
)
28 changes: 16 additions & 12 deletions kedro-datasets/kedro_datasets/networkx/__init__.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,19 @@
"""``AbstractDataSet`` implementation to save and load NetworkX graphs in JSON
, GraphML and GML formats using ``NetworkX``."""
"""``AbstractDataSet`` implementation to save and load NetworkX graphs in JSON,
GraphML and GML formats using ``NetworkX``."""
from typing import Any

__all__ = ["GMLDataSet", "GraphMLDataSet", "JSONDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
GMLDataSet: Any
GraphMLDataSet: Any
JSONDataSet: Any

with suppress(ImportError):
from .gml_dataset import GMLDataSet

with suppress(ImportError):
from .graphml_dataset import GraphMLDataSet

with suppress(ImportError):
from .json_dataset import JSONDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__,
submod_attrs={
"gml_dataset": ["GMLDataSet"],
"graphml_dataset": ["GraphMLDataSet"],
"json_dataset": ["JSONDataSet"],
},
)
70 changes: 32 additions & 38 deletions kedro-datasets/kedro_datasets/pandas/__init__.py
Original file line number Diff line number Diff line change
@@ -1,42 +1,36 @@
"""``AbstractDataSet`` implementations that produce pandas DataFrames."""
from typing import Any

__all__ = [
"CSVDataSet",
"DeltaTableDataSet",
"ExcelDataSet",
"FeatherDataSet",
"GBQTableDataSet",
"GBQQueryDataSet",
"HDFDataSet",
"JSONDataSet",
"ParquetDataSet",
"SQLQueryDataSet",
"SQLTableDataSet",
"XMLDataSet",
"GenericDataSet",
]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
CSVDataSet: Any
DeltaTableDataSet: Any
ExcelDataSet: Any
FeatherDataSet: Any
GBQQueryDataSet: Any
GBQTableDataSet: Any
GenericDataSet: Any
HDFDataSet: Any
JSONDataSet: Any
ParquetDataSet: Any
SQLQueryDataSet: Any
SQLTableDataSet: Any
XMLDataSet: Any

with suppress(ImportError):
from .csv_dataset import CSVDataSet
with suppress(ImportError):
from .deltatable_dataset import DeltaTableDataSet
with suppress(ImportError):
from .excel_dataset import ExcelDataSet
with suppress(ImportError):
from .feather_dataset import FeatherDataSet
with suppress(ImportError):
from .gbq_dataset import GBQQueryDataSet, GBQTableDataSet
with suppress(ImportError):
from .hdf_dataset import HDFDataSet
with suppress(ImportError):
from .json_dataset import JSONDataSet
with suppress(ImportError):
from .parquet_dataset import ParquetDataSet
with suppress(ImportError):
from .sql_dataset import SQLQueryDataSet, SQLTableDataSet
with suppress(ImportError):
from .xml_dataset import XMLDataSet
with suppress(ImportError):
from .generic_dataset import GenericDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__,
submod_attrs={
"csv_dataset": ["CSVDataSet"],
"deltatable_dataset": ["DeltaTableDataSet"],
"excel_dataset": ["ExcelDataSet"],
"feather_dataset": ["FeatherDataSet"],
"gbq_dataset": ["GBQQueryDataSet", "GBQTableDataSet"],
"generic_dataset": ["GenericDataSet"],
"hdf_dataset": ["HDFDataSet"],
"json_dataset": ["JSONDataSet"],
"parquet_dataset": ["ParquetDataSet"],
"sql_dataset": ["SQLQueryDataSet", "SQLTableDataSet"],
"xml_dataset": ["XMLDataSet"],
},
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/pickle/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``AbstractDataSet`` implementation to load/save data from/to a Pickle file."""
from typing import Any

__all__ = ["PickleDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
PickleDataSet: Any

with suppress(ImportError):
from .pickle_dataset import PickleDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"pickle_dataset": ["PickleDataSet"]}
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/pillow/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``AbstractDataSet`` implementation to load/save image data."""
from typing import Any

__all__ = ["ImageDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
ImageDataSet: Any

with suppress(ImportError):
from .image_dataset import ImageDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"image_dataset": ["ImageDataSet"]}
)
15 changes: 9 additions & 6 deletions kedro-datasets/kedro_datasets/plotly/__init__.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
"""``AbstractDataSet`` implementations to load/save a plotly figure from/to a JSON
file."""
from typing import Any

__all__ = ["PlotlyDataSet", "JSONDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
JSONDataSet: Any
PlotlyDataSet: Any

with suppress(ImportError):
from .plotly_dataset import PlotlyDataSet
with suppress(ImportError):
from .json_dataset import JSONDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__,
submod_attrs={"json_dataset": ["JSONDataSet"], "plotly_dataset": ["PlotlyDataSet"]},
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/polars/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``AbstractDataSet`` implementations that produce pandas DataFrames."""
from typing import Any

__all__ = ["CSVDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
CSVDataSet: Any

with suppress(ImportError):
from .csv_dataset import CSVDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"csv_dataset": ["CSVDataSet"]}
)
11 changes: 7 additions & 4 deletions kedro-datasets/kedro_datasets/redis/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"""``AbstractDataSet`` implementation to load/save data from/to a redis db."""
from typing import Any

__all__ = ["PickleDataSet"]
import lazy_loader as lazy

from contextlib import suppress
# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
PickleDataSet: Any

with suppress(ImportError):
from .redis_dataset import PickleDataSet
__getattr__, __dir__, __all__ = lazy.attach(
__name__, submod_attrs={"redis_dataset": ["PickleDataSet"]}
)
Loading

0 comments on commit 3aad425

Please sign in to comment.