feat(datasets): Add NetCDFDataSet class (#360)

* initialize template and early additions Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add placeholder for remote file system load Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * switch to versioned dataset Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add initial remote -> local get for S3 Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * further generalize remote retrieval Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add in credentials Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * make temppath optional for remote datasets Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add initial idea for multifile glob Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * style: Introduce `ruff` for linting in all plugins. (#354) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add suggested style changes Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add temppath to attributes Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * more temppath fixes Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * more temppath updates Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add better tempfile deletion and work on saving files Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * make __del__ flexible Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * formatting Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * feat(datasets): create custom `DeprecationWarning` (#356) * feat(datasets): create custom `DeprecationWarning` Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * feat(datasets): use the custom deprecation warning Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * chore(datasets): show Kedro's deprecation warnings Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * fix(datasets): remove unused imports in test files Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> --------- Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * docs(datasets): add note about DataSet deprecation (#357) Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * test(datasets): skip `tensorflow` tests on Windows (#363) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * ci: Pin `tables` version (#370) * Pin tables version Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Also fix kedro-airflow Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Revert trying to fix airflow Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> --------- Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * build(datasets): Release `1.7.1` (#378) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * docs: Update CONTRIBUTING.md and add one for `kedro-datasets` (#379) Update CONTRIBUTING.md + add one for kedro-datasets Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * ci(datasets): Run tensorflow tests separately from other dataset tests (#377) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * feat: Kedro-Airflow convert all pipelines option (#335) * feat: kedro airflow convert --all option Signed-off-by: Simon Brugman <sfbbrugman@gmail.com> * docs: release docs Signed-off-by: Simon Brugman <sfbbrugman@gmail.com> --------- Signed-off-by: Simon Brugman <sfbbrugman@gmail.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * docs(datasets): blacken code in rst literal blocks (#362) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * docs: cloudpickle is an interesting extension of the pickle functionality (#361) Signed-off-by: H. Felix Wittmann <hfwittmann@gmail.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * fix(datasets): Fix secret scan entropy error (#383) Fix secret scan entropy error Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * style: Rename mentions of `DataSet` to `Dataset` in `kedro-airflow` and `kedro-telemetry` (#384) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * feat(datasets): Migrated `PartitionedDataSet` and `IncrementalDataSet` from main repository to kedro-datasets (#253) Signed-off-by: Peter Bludau <ptrbld.dev@gmail.com> Co-authored-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * fix: backwards compatibility for `kedro-airflow` (#381) Signed-off-by: Simon Brugman <sfbbrugman@gmail.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * fix(datasets): Don't warn for SparkDataset on Databricks when using s3 (#341) Signed-off-by: Alistair McKelvie <alistair.mckelvie@gmail.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * update docs API and release notes Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add netcdf requirements to setup Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * lint Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add initial tests Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * update dataset exists for multifile Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * Add full test suite for NetCDFDataSet Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * Add docstring examples Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * change xarray version req Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * update dask req Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * rename DataSet -> Dataset Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * Update xarray reqs for earlier python versions Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * fix setup Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * update test coverage Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * exclude init from test coverage Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * Sub in pathlib for os.remove Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add metadata to dataset Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add doctest for the new datasets Signed-off-by: Nok <nok.lam.chan@quantumblack.com> * add patch for supporting http/https Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * Small fixes post-merge Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> * Lint Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> * Fix import Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> * Un-ignore NetCDF doctest Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> * Add fixture Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Mark problematic test as xfail Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> * Skip problematic test instead of making it fail Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> * Skip problematic tests and fix failing tests Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Remove comment Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> --------- Signed-off-by: Riley Brady <riley_brady@mckinsey.com> Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> Signed-off-by: Simon Brugman <sfbbrugman@gmail.com> Signed-off-by: H. Felix Wittmann <hfwittmann@gmail.com> Signed-off-by: Peter Bludau <ptrbld.dev@gmail.com> Signed-off-by: Alistair McKelvie <alistair.mckelvie@gmail.com> Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com> Signed-off-by: Nok <nok.lam.chan@quantumblack.com> Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Co-authored-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com> Co-authored-by: Simon Brugman <sbrugman@users.noreply.github.com> Co-authored-by: Felix Wittmann <hfwittmann@users.noreply.github.com> Co-authored-by: PtrBld <7523956+PtrBld@users.noreply.github.com> Co-authored-by: Merel Theisen <merel.theisen@quantumblack.com> Co-authored-by: Alistair McKelvie <alistair.mckelvie@gmail.com> Co-authored-by: Nok Lam Chan <nok.lam.chan@quantumblack.com> Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Co-authored-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
kedro-org · Feb 28, 2024 · 16dfbdd · 16dfbdd
1 parent db92892
commit 16dfbdd
Show file tree

Hide file tree

Showing 8 changed files with 523 additions and 1 deletion.
diff --git a/kedro-datasets/RELEASE.md b/kedro-datasets/RELEASE.md
@@ -1,8 +1,12 @@
 # Upcoming Release
 ## Major features and improvements
+* Added `NetCDFDataset` for loading and saving `*.nc` files.
 
 ## Bug fixes and other changes
 ## Community contributions
+Many thanks to the following Kedroids for contributing PRs to this release:
+* [Riley Brady](https://github.com/riley-brady)
+
 
 # Release 2.1.0
 ## Major features and improvements

diff --git a/kedro-datasets/docs/source/api/kedro_datasets.rst b/kedro-datasets/docs/source/api/kedro_datasets.rst
@@ -23,6 +23,7 @@ kedro_datasets
    kedro_datasets.json.JSONDataset
    kedro_datasets.matlab.MatlabDataset
    kedro_datasets.matplotlib.MatplotlibWriter
+   kedro_datasets.netcdf.NetCDFDataset
    kedro_datasets.networkx.GMLDataset
    kedro_datasets.networkx.GraphMLDataset
    kedro_datasets.networkx.JSONDataset

diff --git a/kedro-datasets/kedro_datasets/netcdf/__init__.py b/kedro-datasets/kedro_datasets/netcdf/__init__.py
@@ -0,0 +1,14 @@
+"""``NetCDFDataset`` is an ``AbstractDataset`` to save and load NetCDF files."""
+from __future__ import annotations
+
+from typing import Any
+
+import lazy_loader as lazy
+
+# https://github.com/pylint-dev/pylint/issues/4300#issuecomment-1043601901
+NetCDFDataset: type[NetCDFDataset]
+NetCDFDataset: Any
+
+__getattr__, __dir__, __all__ = lazy.attach(
+    __name__, submod_attrs={"netcdf_dataset": ["NetCDFDataset"]}
+)
diff --git a/kedro-datasets/kedro_datasets/netcdf/netcdf_dataset.py b/kedro-datasets/kedro_datasets/netcdf/netcdf_dataset.py
@@ -0,0 +1,218 @@
+"""NetCDFDataset loads and saves data to a local netcdf (.nc) file."""
+import logging
+from copy import deepcopy
+from glob import glob
+from pathlib import Path, PurePosixPath
+from typing import Any
+
+import fsspec
+import xarray as xr
+from kedro.io.core import (
+    AbstractDataset,
+    DatasetError,
+    get_protocol_and_path,
+)
+
+logger = logging.getLogger(__name__)
+
+
+class NetCDFDataset(AbstractDataset):
+    """``NetCDFDataset`` loads/saves data from/to a NetCDF file using an underlying
+    filesystem (e.g.: local, S3, GCS). It uses xarray to handle the NetCDF file.
+
+    Example usage for the
+    `YAML API <https://kedro.readthedocs.io/en/stable/data/\
+    data_catalog_yaml_examples.html>`_:
+
+    .. code-block:: yaml
+
+        single-file:
+          type: netcdf.NetCDFDataset
+          filepath: s3://bucket_name/path/to/folder/data.nc
+          save_args:
+            mode: a
+          load_args:
+            decode_times: False
+
+        multi-file:
+          type: netcdf.NetCDFDataset
+          filepath: s3://bucket_name/path/to/folder/data*.nc
+          load_args:
+            concat_dim: time
+            combine: nested
+            parallel: True
+
+    Example usage for the
+    `Python API <https://kedro.readthedocs.io/en/stable/data/\
+    advanced_data_catalog_usage.html>`_:
+
+    .. code-block:: pycon
+
+        >>> from kedro_datasets.netcdf import NetCDFDataset
+        >>> import xarray as xr
+        >>> ds = xr.DataArray(
+        ...     [0, 1, 2], dims=["x"], coords={"x": [0, 1, 2]}, name="data"
+        ... ).to_dataset()
+        >>> dataset = NetCDFDataset(
+        ...     filepath="path/to/folder",
+        ...     save_args={"mode": "w"},
+        ... )
+        >>> dataset.save(ds)
+        >>> reloaded = dataset.load()
+    """
+
+    DEFAULT_LOAD_ARGS: dict[str, Any] = {}
+    DEFAULT_SAVE_ARGS: dict[str, Any] = {}
+
+    def __init__(  # noqa
+        self,
+        *,
+        filepath: str,
+        temppath: str = None,
+        load_args: dict[str, Any] = None,
+        save_args: dict[str, Any] = None,
+        fs_args: dict[str, Any] = None,
+        credentials: dict[str, Any] = None,
+        metadata: dict[str, Any] = None,
+    ):
+        """Creates a new instance of ``NetCDFDataset`` pointing to a concrete NetCDF
+        file on a specific filesystem
+
+        Args:
+            filepath: Filepath in POSIX format to a NetCDF file prefixed with a
+                protocol like `s3://`. If prefix is not provided, `file` protocol
+                (local filesystem) will be used. The prefix should be any protocol
+                supported by ``fsspec``. It can also be a path to a glob. If a
+                glob is provided then it can be used for reading multiple NetCDF
+                files.
+            temppath: Local temporary directory, used when reading from remote storage,
+                since NetCDF files cannot be directly read from remote storage.
+            load_args: Additional options for loading NetCDF file(s).
+                Here you can find all available arguments when reading single file:
+                https://xarray.pydata.org/en/stable/generated/xarray.open_dataset.html
+                Here you can find all available arguments when reading multiple files:
+                https://xarray.pydata.org/en/stable/generated/xarray.open_mfdataset.html
+                All defaults are preserved.
+            save_args: Additional saving options for saving NetCDF file(s).
+                Here you can find all available arguments:
+                https://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_netcdf.html
+                All defaults are preserved.
+            fs_args: Extra arguments to pass into underlying filesystem class
+                constructor (e.g. `{"cache_regions": "us-east-1"}` for
+                ``s3fs.S3FileSystem``).
+            credentials: Credentials required to get access to the underlying filesystem.
+                E.g. for ``GCSFileSystem`` it should look like `{"token": None}`.
+            metadata: Any arbitrary metadata.
+                This is ignored by Kedro, but may be consumed by users or external plugins.
+        """
+        self._fs_args = deepcopy(fs_args) or {}
+        self._credentials = deepcopy(credentials) or {}
+        self._temppath = Path(temppath) if temppath is not None else None
+        protocol, path = get_protocol_and_path(filepath)
+        if protocol == "file":
+            self._fs_args.setdefault("auto_mkdir", True)
+        elif protocol != "file" and self._temppath is None:
+            raise ValueError(
+                "Need to set temppath in catalog if NetCDF file exists on remote "
+                + "filesystem"
+            )
+        self._protocol = protocol
+        self._filepath = filepath
+
+        self._storage_options = {**self._credentials, **self._fs_args}
+        self._fs = fsspec.filesystem(self._protocol, **self._storage_options)
+
+        self.metadata = metadata
+
+        # Handle default load and save arguments
+        self._load_args = deepcopy(self.DEFAULT_LOAD_ARGS)
+        if load_args is not None:
+            self._load_args.update(load_args)
+        self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
+        if save_args is not None:
+            self._save_args.update(save_args)
+
+        # Determine if multiple NetCDF files are being loaded in.
+        self._is_multifile = (
+            True if "*" in str(PurePosixPath(self._filepath).stem) else False
+        )
+
+    def _load(self) -> xr.Dataset:
+        load_path = self._filepath
+
+        # If NetCDF(s) are on any type of remote storage, need to sync to local to open.
+        # Kerchunk could be implemented here in the future for direct remote reading.
+        if self._protocol != "file":
+            logger.info("Syncing remote NetCDF file to local storage.")
+
+            if self._is_multifile:
+                load_path = sorted(self._fs.glob(load_path))
+
+            self._fs.get(load_path, f"{self._temppath}/")
+            load_path = f"{self._temppath}/{self._filepath.stem}.nc"
+
+        if self._is_multifile:
+            data = xr.open_mfdataset(str(load_path), **self._load_args)
+        else:
+            data = xr.open_dataset(load_path, **self._load_args)
+
+        return data
+
+    def _save(self, data: xr.Dataset):
+        if self._is_multifile:
+            raise DatasetError(
+                "Globbed multifile datasets with '*' in filepath cannot be saved. "
+                + "Create an alternate NetCDFDataset with a single .nc output file."
+            )
+        else:
+            save_path = self._filepath
+            bytes_buffer = data.to_netcdf(**self._save_args)
+
+            with self._fs.open(save_path, mode="wb") as fs_file:
+                fs_file.write(bytes_buffer)
+
+            self._invalidate_cache()
+
+    def _describe(self) -> dict[str, Any]:
+        return dict(
+            filepath=self._filepath,
+            protocol=self._protocol,
+            load_args=self._load_args,
+            save_args=self._save_args,
+        )
+
+    def _exists(self) -> bool:
+        load_path = self._filepath
+
+        if self._is_multifile:
+            files = self._fs.glob(load_path)
+            exists = True if files else False
+        else:
+            exists = self._fs.exists(load_path)
+
+        return exists
+
+    def _invalidate_cache(self):
+        """Invalidate underlying filesystem caches."""
+        self._fs.invalidate_cache(self._filepath)
+
+    def __del__(self):
+        """Cleanup temporary directory"""
+        if self._temppath is not None:
+            logger.info("Deleting local temporary files.")
+            temp_filepath = self._temppath / PurePosixPath(self._filepath).stem
+            if self._is_multifile:
+                temp_files = glob(str(temp_filepath))
+                for file in temp_files:
+                    try:
+                        Path(file).unlink()
+                    except FileNotFoundError:  # pragma: no cover
+                        pass  # pragma: no cover
+            else:
+                temp_filepath = (
+                    str(temp_filepath) + "/" + PurePosixPath(self._filepath).name
+                )
+                try:
+                    Path(temp_filepath).unlink()
+                except FileNotFoundError:
+                    pass
diff --git a/kedro-datasets/pyproject.toml b/kedro-datasets/pyproject.toml
@@ -32,7 +32,7 @@ version = {attr = "kedro_datasets.__version__"}
 fail_under = 100
 show_missing = true
 # temporarily ignore kedro_datasets/__init__.py in coverage report
-omit = ["tests/*", "kedro_datasets/holoviews/*", "kedro_datasets/snowflake/*", "kedro_datasets/tensorflow/*", "kedro_datasets/__init__.py", "kedro_datasets/conftest.py", "kedro_datasets/databricks/*"]
+omit = ["tests/*", "kedro_datasets/holoviews/*", "kedro_datasets/netcdf/*", "kedro_datasets/snowflake/*", "kedro_datasets/tensorflow/*", "kedro_datasets/__init__.py", "kedro_datasets/conftest.py", "kedro_datasets/databricks/*"]
 exclude_lines = ["pragma: no cover", "raise NotImplementedError", "if TYPE_CHECKING:"]
 
 [tool.pytest.ini_options]

diff --git a/kedro-datasets/setup.py b/kedro-datasets/setup.py
@@ -31,6 +31,14 @@ def _collect_requirements(requires):
 }
 matplotlib_require = {"matplotlib.MatplotlibWriter": ["matplotlib>=3.0.3, <4.0"]}
 matlab_require = {"matlab.MatlabDataset": ["scipy"]}
+netcdf_require = {
+    "netcdf.NetCDFDataset": [
+        "h5netcdf>=1.2.0",
+        "netcdf4>=1.6.4",
+        "xarray<=0.20.2; python_version == '3.7'",
+        "xarray>=2023.1.0; python_version >= '3.8'",
+    ]
+}
 networkx_require = {"networkx.NetworkXDataset": ["networkx~=2.4"]}
 pandas_require = {
     "pandas.CSVDataset": [PANDAS],
@@ -118,6 +126,7 @@ def _collect_requirements(requires):
     "huggingface": _collect_requirements(huggingface_require),
     "matlab": _collect_requirements(matlab_require),
     "matplotlib": _collect_requirements(matplotlib_require),
+    "netcdf": _collect_requirements(netcdf_require),
     "networkx": _collect_requirements(networkx_require),
     "pandas": _collect_requirements(pandas_require),
     "pickle": _collect_requirements(pickle_require),
@@ -235,6 +244,7 @@ def _collect_requirements(requires):
     "tensorflow~=2.0; platform_system != 'Darwin' or platform_machine != 'arm64'",
     "triad>=0.6.7, <1.0",
     "trufflehog~=2.1",
+    "xarray>=2023.1.0",
     "xlsxwriter~=1.0",
     # huggingface
     "datasets",

diff --git a/kedro-datasets/tests/netcdf/__init__.py b/kedro-datasets/tests/netcdf/__init__.py