-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): Add NetCDFDataSet class #360
Conversation
log.info("Syncing remote NetCDF file to local storage.") | ||
|
||
# `get_filepath_str` drops remote protocol prefix. | ||
load_path = self._protocol + "://" + load_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with AWS and GCS, but not sure if this is generalized enough. get_filepath_str
consistently drops the prefix of URI's from object storage, which is problematic for the subsequent fs.get
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah filepath mangling in fsspec is tricky... we have a related open issue about this kedro-org/kedro#3196
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for opening this PR as contribution @riley-brady ! I had a quick look and left some minor comments. The main being that we've now renamed all our datasets to end in Dataset
instead of DataSet
, so that's something that needs to be changed here.
Would it be helpful if I post in our Slack channel to see if there's more interest from the community for this dataset?
Thanks so much @merelcht! There's a lot of interest at #165 and I slacked with @astrojuanlu. It would be great if you posted on slack as well. I have added your feedback and finished out and tested locally the PR. Going to add unit testing now and it should be ready for a thorough review. I'll update on that issue thread when the tests are added. |
@merelcht the PR is fully implemented with testing and is ready for final review. |
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
* feat(datasets): create custom `DeprecationWarning` Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * feat(datasets): use the custom deprecation warning Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * chore(datasets): show Kedro's deprecation warnings Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * fix(datasets): remove unused imports in test files Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> --------- Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Riley Brady <riley_brady@mckinsey.com>
I spent a decent chunk of time today trying to understand what's breaking in the credentials test, and I still don't know what's going on. That test is literally copy-pasted from It's not clear at all what the test is doing though. The "Failed while loading data from data set NetCDFDataset" is hiding a quite puzzling error, that can be seen by enabling the logs:
At what point in the stack is the Other datasets take a different approach and instead create a fake bucket, for example Spark and Video datasets. @pytest.fixture
def mocked_s3_bucket():
"""Create a bucket for testing using moto."""
with mock_s3():
conn = boto3.client(
"s3",
region_name="us-east-1",
aws_access_key_id=AWS_CREDENTIALS["key"],
aws_secret_access_key=AWS_CREDENTIALS["secret"],
)
conn.create_bucket(Bucket=S3_BUCKET_NAME)
yield conn This PR has been around for 4 months and is almost there. Is there any reasonable way we can take over from @riley-brady and get it merged? |
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Well, and now I realized that it was marked as @ankatiyar has done a fair share of debugging as well last week. I'm marking the one on NetCDF as xfail too. |
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
yeah the tests are broken. I'm taking this to another branch to do some debugging. |
Thanks for all the work on this, everyone. |
I'd like to echo that sentiment. As I PyMC user I often produce NetCDF files from xarrays. I really appreciate the efforts that have gone into this so far, and I am looking forward to PyMC integrating better with Kedro. :) |
@galenseilis what would you like to see integrated better with kedro particularly? |
I just meant the existence of |
For transparency on why this PR isn't merged yet: there's some issues with the tests, they seem to just hang and it's not clear why. @ankatiyar is currently investigating the issue and finding a way to maybe re-write the tests. Thank you so much for your work and patience @riley-brady, the Kedro team will make sure this gets merged in asap. |
Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com>
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
I've made some changes and skipped the tests that load the dataset from s3 because those are getting stuck. |
# `get_filepath_str` drops remote protocol prefix. | ||
save_path = self._protocol + "://" + save_path | ||
|
||
save_path = self._filepath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love that this logic is simpler now 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left only 1 question @ankatiyar , if it's okay proceed and merge
Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Thanks everyone!! |
* initialize template and early additions Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add placeholder for remote file system load Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * switch to versioned dataset Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add initial remote -> local get for S3 Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * further generalize remote retrieval Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add in credentials Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * make temppath optional for remote datasets Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add initial idea for multifile glob Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * style: Introduce `ruff` for linting in all plugins. (kedro-org#354) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add suggested style changes Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add temppath to attributes Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * more temppath fixes Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * more temppath updates Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add better tempfile deletion and work on saving files Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * make __del__ flexible Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * formatting Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * feat(datasets): create custom `DeprecationWarning` (kedro-org#356) * feat(datasets): create custom `DeprecationWarning` Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * feat(datasets): use the custom deprecation warning Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * chore(datasets): show Kedro's deprecation warnings Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * fix(datasets): remove unused imports in test files Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> --------- Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * docs(datasets): add note about DataSet deprecation (kedro-org#357) Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * test(datasets): skip `tensorflow` tests on Windows (kedro-org#363) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * ci: Pin `tables` version (kedro-org#370) * Pin tables version Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Also fix kedro-airflow Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Revert trying to fix airflow Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> --------- Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * build(datasets): Release `1.7.1` (kedro-org#378) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * docs: Update CONTRIBUTING.md and add one for `kedro-datasets` (kedro-org#379) Update CONTRIBUTING.md + add one for kedro-datasets Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * ci(datasets): Run tensorflow tests separately from other dataset tests (kedro-org#377) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * feat: Kedro-Airflow convert all pipelines option (kedro-org#335) * feat: kedro airflow convert --all option Signed-off-by: Simon Brugman <sfbbrugman@gmail.com> * docs: release docs Signed-off-by: Simon Brugman <sfbbrugman@gmail.com> --------- Signed-off-by: Simon Brugman <sfbbrugman@gmail.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * docs(datasets): blacken code in rst literal blocks (kedro-org#362) Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * docs: cloudpickle is an interesting extension of the pickle functionality (kedro-org#361) Signed-off-by: H. Felix Wittmann <hfwittmann@gmail.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * fix(datasets): Fix secret scan entropy error (kedro-org#383) Fix secret scan entropy error Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * style: Rename mentions of `DataSet` to `Dataset` in `kedro-airflow` and `kedro-telemetry` (kedro-org#384) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * feat(datasets): Migrated `PartitionedDataSet` and `IncrementalDataSet` from main repository to kedro-datasets (kedro-org#253) Signed-off-by: Peter Bludau <ptrbld.dev@gmail.com> Co-authored-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * fix: backwards compatibility for `kedro-airflow` (kedro-org#381) Signed-off-by: Simon Brugman <sfbbrugman@gmail.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * fix(datasets): Don't warn for SparkDataset on Databricks when using s3 (kedro-org#341) Signed-off-by: Alistair McKelvie <alistair.mckelvie@gmail.com> Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * update docs API and release notes Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add netcdf requirements to setup Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * lint Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add initial tests Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * update dataset exists for multifile Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * Add full test suite for NetCDFDataSet Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * Add docstring examples Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * change xarray version req Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * update dask req Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * rename DataSet -> Dataset Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * Update xarray reqs for earlier python versions Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * fix setup Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * update test coverage Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * exclude init from test coverage Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * Sub in pathlib for os.remove Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add metadata to dataset Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * add doctest for the new datasets Signed-off-by: Nok <nok.lam.chan@quantumblack.com> * add patch for supporting http/https Signed-off-by: Riley Brady <riley_brady@mckinsey.com> * Small fixes post-merge Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> * Lint Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> * Fix import Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> * Un-ignore NetCDF doctest Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> * Add fixture Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Mark problematic test as xfail Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> * Skip problematic test instead of making it fail Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> * Skip problematic tests and fix failing tests Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Remove comment Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> --------- Signed-off-by: Riley Brady <riley_brady@mckinsey.com> Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> Signed-off-by: Simon Brugman <sfbbrugman@gmail.com> Signed-off-by: H. Felix Wittmann <hfwittmann@gmail.com> Signed-off-by: Peter Bludau <ptrbld.dev@gmail.com> Signed-off-by: Alistair McKelvie <alistair.mckelvie@gmail.com> Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com> Signed-off-by: Nok <nok.lam.chan@quantumblack.com> Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Co-authored-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com> Co-authored-by: Simon Brugman <sbrugman@users.noreply.github.com> Co-authored-by: Felix Wittmann <hfwittmann@users.noreply.github.com> Co-authored-by: PtrBld <7523956+PtrBld@users.noreply.github.com> Co-authored-by: Merel Theisen <merel.theisen@quantumblack.com> Co-authored-by: Alistair McKelvie <alistair.mckelvie@gmail.com> Co-authored-by: Nok Lam Chan <nok.lam.chan@quantumblack.com> Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Co-authored-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> Signed-off-by: tgoelles <thomas.goelles@gmail.com>
Description
There's a large geoscience/astrophysics and beyond community that leverage NetCDF for datasets that are stored with structured coordinates and metadata. A massive trove of the existing climate and weather data exists as NetCDF (.nc) files, for example.
See #165.
Development notes
I'd like to get this implemented with file syncing for load-from-remote, since it's most straight-forward. A future PR could work with kerchunk to allow direct loading from remote storage. This is a really nice toolkit, but requires management of a lot of JSON metadata files that are generated, and which sometimes can be quite slow to generate. It will take a little bit of tweaking to implement this nicely, since the first run would need to generate and cache/store all of the reference JSONs to make future loads much much faster.
Checklist
RELEASE.md
file