Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Add NetCDFDataSet class #360

Merged
merged 68 commits into from
Feb 28, 2024

Conversation

riley-brady
Copy link
Contributor

@riley-brady riley-brady commented Sep 29, 2023

Description

There's a large geoscience/astrophysics and beyond community that leverage NetCDF for datasets that are stored with structured coordinates and metadata. A massive trove of the existing climate and weather data exists as NetCDF (.nc) files, for example.

See #165.

Development notes

I'd like to get this implemented with file syncing for load-from-remote, since it's most straight-forward. A future PR could work with kerchunk to allow direct loading from remote storage. This is a really nice toolkit, but requires management of a lot of JSON metadata files that are generated, and which sometimes can be quite slow to generate. It will take a little bit of tweaking to implement this nicely, since the first run would need to generate and cache/store all of the reference JSONs to make future loads much much faster.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

@riley-brady riley-brady reopened this Sep 29, 2023
@riley-brady riley-brady marked this pull request as draft September 29, 2023 19:06
@riley-brady riley-brady changed the title Add NetCDF and Zarr datasets Add NetCDF dataset Oct 2, 2023
@riley-brady riley-brady marked this pull request as ready for review October 2, 2023 21:13
log.info("Syncing remote NetCDF file to local storage.")

# `get_filepath_str` drops remote protocol prefix.
load_path = self._protocol + "://" + load_path
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with AWS and GCS, but not sure if this is generalized enough. get_filepath_str consistently drops the prefix of URI's from object storage, which is problematic for the subsequent fs.get

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah filepath mangling in fsspec is tricky... we have a related open issue about this kedro-org/kedro#3196

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for opening this PR as contribution @riley-brady ! I had a quick look and left some minor comments. The main being that we've now renamed all our datasets to end in Dataset instead of DataSet, so that's something that needs to be changed here.

Would it be helpful if I post in our Slack channel to see if there's more interest from the community for this dataset?

kedro-datasets/kedro_datasets/netcdf/netcdf_dataset.py Outdated Show resolved Hide resolved
kedro-datasets/kedro_datasets/netcdf/netcdf_dataset.py Outdated Show resolved Hide resolved
kedro-datasets/kedro_datasets/netcdf/netcdf_dataset.py Outdated Show resolved Hide resolved
@merelcht merelcht mentioned this pull request Oct 11, 2023
3 tasks
@riley-brady
Copy link
Contributor Author

Would it be helpful if I post in our Slack channel to see if there's more interest from the community for this dataset?

Thanks so much @merelcht! There's a lot of interest at #165 and I slacked with @astrojuanlu. It would be great if you posted on slack as well.

I have added your feedback and finished out and tested locally the PR. Going to add unit testing now and it should be ready for a thorough review. I'll update on that issue thread when the tests are added.

@riley-brady riley-brady mentioned this pull request Oct 13, 2023
@riley-brady
Copy link
Contributor Author

@merelcht the PR is fully implemented with testing and is ready for final review.

riley-brady and others added 17 commits October 15, 2023 12:21
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
* feat(datasets): create custom `DeprecationWarning`

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* feat(datasets): use the custom deprecation warning

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): show Kedro's deprecation warnings

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* fix(datasets): remove unused imports in test files

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

---------

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
@astrojuanlu
Copy link
Member

I spent a decent chunk of time today trying to understand what's breaking in the credentials test, and I still don't know what's going on. That test is literally copy-pasted from test_parquet_dataset.py, and yet it fails.

It's not clear at all what the test is doing though. The "Failed while loading data from data set NetCDFDataset" is hiding a quite puzzling error, that can be seen by enabling the logs:

$ pushd kedro-datasets && pytest -s --log-cli-level=DEBUG -vvv -k "pass_credentials and netcdf" --pdb; popd
...
DEBUG    s3fs:core.py:337 CALL: get_object - () - {'Bucket': 'test_bucket', 'Key': 'test.nc'}
                    DEBUG    Event before-parameter-build.s3.GetObject: calling handler <function sse_md5 at 0x136c36670>                                                                                                                                                                                    hooks.py:63
...
[02/15/24 14:42:42] DEBUG    Response headers: HTTPHeaderDict({'x-amz-request-id': 'C36M0ESFJX55TN4N', 'x-amz-id-2': 'b4c9yMZKWWSis5iwwE8suoOo3o5jO6LF/xxDnZqGFZwOinzNj+N+TIBHQQ3DaAc1Z6YkwPOo83I=', 'content-type': 'application/xml', 'transfer-encoding': 'chunked', 'date': 'Thu, 15 Feb 2024         parsers.py:239
                             13:42:42 GMT', 'server': 'AmazonS3'})                                                                                                                                                                                                                                                      
DEBUG    botocore.parsers:parsers.py:239 Response headers: HTTPHeaderDict({'x-amz-request-id': 'C36M0ESFJX55TN4N', 'x-amz-id-2': 'b4c9yMZKWWSis5iwwE8suoOo3o5jO6LF/xxDnZqGFZwOinzNj+N+TIBHQQ3DaAc1Z6YkwPOo83I=', 'content-type': 'application/xml', 'transfer-encoding': 'chunked', 'date': 'Thu, 15 Feb 2024 13:42:42 GMT', 'server': 'AmazonS3'})
                    DEBUG    Response body:                                                                                                                                                                                                                                                               parsers.py:240
                             <coroutine object AioAWSResponse._content_prop at 0x2bdbd1f40>                                                                                                                                                                                                                             
DEBUG    botocore.parsers:parsers.py:240 Response body:
<coroutine object AioAWSResponse._content_prop at 0x2bdbd1f40>
                    DEBUG    Nonretryable error: a bytes-like object is required, not 'coroutine'                                                                                                                                                                                                            core.py:125
DEBUG    s3fs:core.py:125 Nonretryable error: a bytes-like object is required, not 'coroutine'
FAILED

At what point in the stack is the create_client method supposed to be called, I have no idea.

Other datasets take a different approach and instead create a fake bucket, for example Spark and Video datasets.

@pytest.fixture
def mocked_s3_bucket():
    """Create a bucket for testing using moto."""
    with mock_s3():
        conn = boto3.client(
            "s3",
            region_name="us-east-1",
            aws_access_key_id=AWS_CREDENTIALS["key"],
            aws_secret_access_key=AWS_CREDENTIALS["secret"],
        )
        conn.create_bucket(Bucket=S3_BUCKET_NAME)
        yield conn

This PR has been around for 4 months and is almost there. Is there any reasonable way we can take over from @riley-brady and get it merged?

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
@astrojuanlu
Copy link
Member

That test is literally copy-pasted from test_parquet_dataset.py, and yet it fails.

Well, and now I realized that it was marked as xfail in #463 and #459 after @merelcht spent one week trying to understand what to do with it. We just fell in the same trap again...

@ankatiyar has done a fair share of debugging as well last week.

I'm marking the one on NetCDF as xfail too.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
@astrojuanlu
Copy link
Member

yeah the tests are broken. I'm taking this to another branch to do some debugging.

@riley-brady
Copy link
Contributor Author

Thanks for all the work on this, everyone.

@galenseilis
Copy link
Contributor

Thanks for all the work on this, everyone.

I'd like to echo that sentiment. As I PyMC user I often produce NetCDF files from xarrays. I really appreciate the efforts that have gone into this so far, and I am looking forward to PyMC integrating better with Kedro. :)

@noklam
Copy link
Contributor

noklam commented Feb 26, 2024

@galenseilis what would you like to see integrated better with kedro particularly?

@galenseilis
Copy link
Contributor

@galenseilis what would you like to see integrated better with kedro particularly?

I just meant the existence of NetCDFDataSet. NetCDF is the conventional choice for saving PyMC modelling results. I'm excited about that! 🤩

@merelcht
Copy link
Member

For transparency on why this PR isn't merged yet: there's some issues with the tests, they seem to just hang and it's not clear why. @ankatiyar is currently investigating the issue and finding a way to maybe re-write the tests. Thank you so much for your work and patience @riley-brady, the Kedro team will make sure this gets merged in asap.

ankatiyar and others added 2 commits February 28, 2024 15:43
Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
@ankatiyar
Copy link
Contributor

I've made some changes and skipped the tests that load the dataset from s3 because those are getting stuck.

# `get_filepath_str` drops remote protocol prefix.
save_path = self._protocol + "://" + save_path

save_path = self._filepath
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that this logic is simpler now 💯

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left only 1 question @ankatiyar , if it's okay proceed and merge

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
@astrojuanlu astrojuanlu enabled auto-merge (squash) February 28, 2024 16:22
@astrojuanlu astrojuanlu merged commit 16dfbdd into kedro-org:main Feb 28, 2024
12 checks passed
@riley-brady
Copy link
Contributor Author

Thanks everyone!!

tgoelles pushed a commit to tgoelles/kedro-plugins that referenced this pull request Jun 6, 2024
* initialize template and early additions

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* add placeholder for remote file system load

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* switch to versioned dataset

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* add initial remote -> local get for S3

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* further generalize remote retrieval

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* add in credentials

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* make temppath optional for remote datasets

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* add initial idea for multifile glob

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* style: Introduce `ruff` for linting in all plugins. (kedro-org#354)

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* add suggested style changes

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* add temppath to attributes

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* more temppath fixes

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* more temppath updates

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* add better tempfile deletion and work on saving files

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* make __del__ flexible

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* formatting

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* feat(datasets): create custom `DeprecationWarning` (kedro-org#356)

* feat(datasets): create custom `DeprecationWarning`

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* feat(datasets): use the custom deprecation warning

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): show Kedro's deprecation warnings

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* fix(datasets): remove unused imports in test files

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

---------

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* docs(datasets): add note about DataSet deprecation (kedro-org#357)

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* test(datasets): skip `tensorflow` tests on Windows (kedro-org#363)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* ci: Pin `tables` version (kedro-org#370)

* Pin tables version

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Also fix kedro-airflow

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Revert trying to fix airflow

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

---------

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* build(datasets): Release `1.7.1` (kedro-org#378)

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* docs: Update CONTRIBUTING.md and add one for `kedro-datasets` (kedro-org#379)

Update CONTRIBUTING.md + add one for kedro-datasets

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* ci(datasets): Run tensorflow tests separately from other dataset tests (kedro-org#377)

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* feat: Kedro-Airflow convert all pipelines option (kedro-org#335)

* feat: kedro airflow convert --all option

Signed-off-by: Simon Brugman <sfbbrugman@gmail.com>

* docs: release docs

Signed-off-by: Simon Brugman <sfbbrugman@gmail.com>

---------

Signed-off-by: Simon Brugman <sfbbrugman@gmail.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* docs(datasets): blacken code in rst literal blocks (kedro-org#362)

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* docs: cloudpickle is an interesting extension of the pickle functionality (kedro-org#361)

Signed-off-by: H. Felix Wittmann <hfwittmann@gmail.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* fix(datasets): Fix secret scan entropy error (kedro-org#383)

Fix secret scan entropy error

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* style: Rename mentions of `DataSet` to `Dataset` in `kedro-airflow` and `kedro-telemetry` (kedro-org#384)

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* feat(datasets): Migrated `PartitionedDataSet` and `IncrementalDataSet` from main repository to kedro-datasets (kedro-org#253)

Signed-off-by: Peter Bludau <ptrbld.dev@gmail.com>
Co-authored-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* fix: backwards compatibility for `kedro-airflow` (kedro-org#381)

Signed-off-by: Simon Brugman <sfbbrugman@gmail.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* fix(datasets): Don't warn for SparkDataset on Databricks when using s3 (kedro-org#341)

Signed-off-by: Alistair McKelvie <alistair.mckelvie@gmail.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* update docs API and release notes

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* add netcdf requirements to setup

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* lint

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* add initial tests

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* update dataset exists for multifile

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* Add full test suite for NetCDFDataSet

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* Add docstring examples

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* change xarray version req

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* update dask req

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* rename DataSet -> Dataset

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* Update xarray reqs for earlier python versions

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* fix setup

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* update test coverage

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* exclude init from test coverage

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* Sub in pathlib for os.remove

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* add metadata to dataset

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* add doctest for the new datasets

Signed-off-by: Nok <nok.lam.chan@quantumblack.com>

* add patch for supporting http/https

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>

* Small fixes post-merge

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Lint

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Fix import

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Un-ignore NetCDF doctest

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Add fixture

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Mark problematic test as xfail

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Skip problematic test instead of making it fail

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Skip problematic tests and fix failing tests

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Remove comment

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

---------

Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Simon Brugman <sfbbrugman@gmail.com>
Signed-off-by: H. Felix Wittmann <hfwittmann@gmail.com>
Signed-off-by: Peter Bludau <ptrbld.dev@gmail.com>
Signed-off-by: Alistair McKelvie <alistair.mckelvie@gmail.com>
Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com>
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Co-authored-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com>
Co-authored-by: Simon Brugman <sbrugman@users.noreply.github.com>
Co-authored-by: Felix Wittmann <hfwittmann@users.noreply.github.com>
Co-authored-by: PtrBld <7523956+PtrBld@users.noreply.github.com>
Co-authored-by: Merel Theisen <merel.theisen@quantumblack.com>
Co-authored-by: Alistair McKelvie <alistair.mckelvie@gmail.com>
Co-authored-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Co-authored-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: tgoelles <thomas.goelles@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.