Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with cftime coordinates on sequence_dim #51

Closed
rabernat opened this issue Jan 24, 2021 · 21 comments
Closed

Problem with cftime coordinates on sequence_dim #51

rabernat opened this issue Jan 24, 2021 · 21 comments
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@rabernat
Copy link
Contributor

rabernat commented Jan 24, 2021

In #47, @naomi-henderson reported that cftime-based time coordinates did not work with her recipe. (Details in this notebook.

The error occurs on prepare_target. Some relevant traceback is:

---------------------------------------------------------------------------
OutOfBoundsDatetime                       Traceback (most recent call last)
/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/coding/times.py in decode_cf_datetime(num_dates, units, calendar, use_cftime)
    193         try:
--> 194             dates = _decode_datetime_with_pandas(flat_num_dates, units, calendar)
    195         except (KeyError, OutOfBoundsDatetime, OverflowError):

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/coding/times.py in _decode_datetime_with_pandas(flat_num_dates, units, calendar)
    141             "Cannot decode times from a non-standard calendar, {!r}, using "
--> 142             "pandas.".format(calendar)
    143         )

OutOfBoundsDatetime: Cannot decode times from a non-standard calendar, 'noleap', using pandas.

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/coding/times.py in _decode_cf_datetime_dtype(data, units, calendar, use_cftime)
    112     try:
--> 113         result = decode_cf_datetime(example_value, units, calendar, use_cftime)
    114     except Exception:

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/coding/times.py in decode_cf_datetime(num_dates, units, calendar, use_cftime)
    196             dates = _decode_datetime_with_cftime(
--> 197                 flat_num_dates.astype(float), units, calendar
    198             )

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/coding/times.py in _decode_datetime_with_cftime(num_dates, units, calendar)
    133     return np.asarray(
--> 134         cftime.num2date(num_dates, units, calendar, only_use_cftime_datetimes=True)
    135     )

src/cftime/_cftime.pyx in cftime._cftime.num2date()

TypeError: unsupported operand type(s) for +: 'cftime._cftime.DatetimeNoLeap' and 'NoneType'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-47-9bb7712f434d> in <module>
      1 # put basic info in target directory
----> 2 recipe.prepare_target()

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/pangeo_forge/recipe.py in _prepare_target()
    166 
    167             try:
--> 168                 ds = self.open_target()
    169                 logger.info("Found an existing dataset in target")
    170                 logger.debug(f"{ds}")

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/pangeo_forge/recipe.py in open_target(self)
    271     def open_target(self):
    272         target_mapper = self.target.get_mapper()
--> 273         return xr.open_zarr(target_mapper)
    274 
    275     def initialize_target(self, ds, **expand_dims):

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/backends/zarr.py in open_zarr(store, group, synchronizer, chunks, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, consolidated, overwrite_encoded_chunks, chunk_store, decode_timedelta, use_cftime, **kwargs)
    686         backend_kwargs=backend_kwargs,
    687         decode_timedelta=decode_timedelta,
--> 688         use_cftime=use_cftime,
    689     )
    690 

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs, use_cftime, decode_timedelta)
    573 
    574     with close_on_error(store):
--> 575         ds = maybe_decode_store(store, chunks)
    576 
    577     # Ensure source filename always stored in dataset object (GH issue #2550)

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/backends/api.py in maybe_decode_store(store, chunks)
    477             drop_variables=drop_variables,
    478             use_cftime=use_cftime,
--> 479             decode_timedelta=decode_timedelta,
    480         )
    481 

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/conventions.py in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta)
    596         drop_variables=drop_variables,
    597         use_cftime=use_cftime,
--> 598         decode_timedelta=decode_timedelta,
    599     )
    600     ds = Dataset(vars, attrs=attrs)

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/conventions.py in decode_cf_variables(variables, attributes, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta)
    498             stack_char_dim=stack_char_dim,
    499             use_cftime=use_cftime,
--> 500             decode_timedelta=decode_timedelta,
    501         )
    502         if decode_coords:

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/conventions.py in decode_cf_variable(name, var, concat_characters, mask_and_scale, decode_times, decode_endianness, stack_char_dim, use_cftime, decode_timedelta)
    338         var = times.CFTimedeltaCoder().decode(var, name=name)
    339     if decode_times:
--> 340         var = times.CFDatetimeCoder(use_cftime=use_cftime).decode(var, name=name)
    341 
    342     dimensions, data, attributes, encoding = variables.unpack_for_decoding(var)

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/coding/times.py in decode(self, variable, name)
    461             units = pop_to(attrs, encoding, "units")
    462             calendar = pop_to(attrs, encoding, "calendar")
--> 463             dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime)
    464             transform = partial(
    465                 decode_cf_datetime,

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/coding/times.py in _decode_cf_datetime_dtype(data, units, calendar, use_cftime)
    121             "if it is not installed."
    122         )
--> 123         raise ValueError(msg)
    124     else:
    125         dtype = getattr(result, "dtype", np.dtype("object"))

ValueError: unable to decode time units 'hours since 0001-01-16 12:00:00.000000' with "calendar 'noleap'". Try open

Examining this closely, it looks like there is already a dataset in the target, but it can't be opened. This makes it hard to debug. The notebook has been run non-sequentially, making it hard to debug. @naomi-henderson, it would be great if you could turn this into a reproducible example we can use to get the the bottom of the cftime issue.

@rabernat rabernat added bug Something isn't working good first issue Good for newcomers labels Jan 24, 2021
@naomi-henderson
Copy link
Contributor

Okay, here is my cftime trouble example.

This seems to be a tricky bug - which only shows up under certain conditions. For simplicity, I am considering the case of one chunk per netcdf file and only two netcdf files. The simplest case of a single netcdf file being converted into a single chunk of zarr is handled correctly with cftime. I needed to make an example with at least two netcdf files, even though the error shows up when storing the very first chunk/file.

For my own sanity, I found out that I can make this notebook work properly, generating an acceptable zarr store, with a minor edit of .../xarray/coding/times.py, adding 'noleap' to:
_STANDARD_CALENDARS = {"standard", "gregorian", "proleptic_gregorian"}
but this might not be a proper fix, as it might cause problems elsewhere ...

@rabernat, I have also noticed that, despite an indication to the contrary when printing out the recipe:

NetCDFtoZarrSequentialRecipe(sequence_dim='time', inputs_per_chunk=1, nitems_per_input=600, target=FSSpecTarget(fs=<fsspec.implementations.local.LocalFileSystem object at 0x7f17aa446a50>, root_path='/tmp/tmp37ghyji0'), input_cache=CacheFSSpecTarget(fs=<fsspec.implementations.local.LocalFileSystem object at 0x7f17aa446a50>, root_path='/tmp/tmpvna52uxv'), require_cache=True, consolidate_zarr=True, xarray_open_kwargs={'use_cftime': True}, xarray_concat_kwargs={}, delete_input_encoding=True)

there is no .zmetadata being saved - otherwize the zarr store looks fine.

@davidbrochart
Copy link
Contributor

I have a similar issue in the GPM IMERG recipe. For me recipe.prepare_target() works fine, the error is on store_chunk:

ValueError: unable to decode time units 'minutes since 2000-06-01 00:00:00.000000' with "calendar 'julian'". Try opening your dataset with decode_times=False or installing cftime if it is not installed.

I could share a notebook where it can be reproduced, but it needs my username/password and #59.

@rabernat
Copy link
Contributor Author

@davidbrochart - do you have cftime installed?

@rabernat
Copy link
Contributor Author

Naomi's full traceback is

---------------------------------------------------------------------------
OutOfBoundsDatetime                       Traceback (most recent call last)
/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/coding/times.py in decode_cf_datetime(num_dates, units, calendar, use_cftime)
    193         try:
--> 194             dates = _decode_datetime_with_pandas(flat_num_dates, units, calendar)
    195         except (KeyError, OutOfBoundsDatetime, OverflowError):

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/coding/times.py in _decode_datetime_with_pandas(flat_num_dates, units, calendar)
    141             "Cannot decode times from a non-standard calendar, {!r}, using "
--> 142             "pandas.".format(calendar)
    143         )

OutOfBoundsDatetime: Cannot decode times from a non-standard calendar, 'noleap', using pandas.

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/coding/times.py in _decode_cf_datetime_dtype(data, units, calendar, use_cftime)
    112     try:
--> 113         result = decode_cf_datetime(example_value, units, calendar, use_cftime)
    114     except Exception:

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/coding/times.py in decode_cf_datetime(num_dates, units, calendar, use_cftime)
    196             dates = _decode_datetime_with_cftime(
--> 197                 flat_num_dates.astype(float), units, calendar
    198             )

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/coding/times.py in _decode_datetime_with_cftime(num_dates, units, calendar)
    133     return np.asarray(
--> 134         cftime.num2date(num_dates, units, calendar, only_use_cftime_datetimes=True)
    135     )

src/cftime/_cftime.pyx in cftime._cftime.num2date()

TypeError: unsupported operand type(s) for +: 'cftime._cftime.DatetimeNoLeap' and 'NoneType'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-8-dd7bdaf49ef5> in <module>
      1 #from cftime import DatetimeNoLeap
      2 # store first chunk
----> 3 recipe.store_chunk(0)

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/pangeo_forge/recipe.py in _store_chunk(chunk_key)
    209             write_region = self.region_for_chunk(chunk_key)
    210             logger.info(f"Storing chunk '{chunk_key}' to Zarr region {write_region}")
--> 211             ds_chunk.to_zarr(target_mapper, region=write_region)
    212 
    213         return _store_chunk

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/core/dataset.py in to_zarr(self, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region)
   1754             consolidated=consolidated,
   1755             append_dim=append_dim,
-> 1756             region=region,
   1757         )
   1758 

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/backends/api.py in to_zarr(dataset, store, chunk_store, mode, synchronizer, group, encoding, compute, consolidated, append_dim, region)
   1479     writer = ArrayWriter()
   1480     # TODO: figure out how to properly handle unlimited_dims
-> 1481     dump_to_store(dataset, zstore, writer, encoding=encoding)
   1482     writes = writer.sync(compute=compute)
   1483 

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
   1156         variables, attrs = encoder(variables, attrs)
   1157 
-> 1158     store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
   1159 
   1160 

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/backends/zarr.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
    460             # there are variables to append
    461             # their encoding must be the same as in the store
--> 462             ds = open_zarr(self.ds.store, group=self.ds.path, chunks=None)
    463             variables_with_encoding = {}
    464             for vn in existing_variables:

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/backends/zarr.py in open_zarr(store, group, synchronizer, chunks, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, consolidated, overwrite_encoded_chunks, chunk_store, decode_timedelta, use_cftime, **kwargs)
    686         backend_kwargs=backend_kwargs,
    687         decode_timedelta=decode_timedelta,
--> 688         use_cftime=use_cftime,
    689     )
    690 

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables, backend_kwargs, use_cftime, decode_timedelta)
    573 
    574     with close_on_error(store):
--> 575         ds = maybe_decode_store(store, chunks)
    576 
    577     # Ensure source filename always stored in dataset object (GH issue #2550)

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/backends/api.py in maybe_decode_store(store, chunks)
    477             drop_variables=drop_variables,
    478             use_cftime=use_cftime,
--> 479             decode_timedelta=decode_timedelta,
    480         )
    481 

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/conventions.py in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta)
    596         drop_variables=drop_variables,
    597         use_cftime=use_cftime,
--> 598         decode_timedelta=decode_timedelta,
    599     )
    600     ds = Dataset(vars, attrs=attrs)

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/conventions.py in decode_cf_variables(variables, attributes, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables, use_cftime, decode_timedelta)
    498             stack_char_dim=stack_char_dim,
    499             use_cftime=use_cftime,
--> 500             decode_timedelta=decode_timedelta,
    501         )
    502         if decode_coords:

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/conventions.py in decode_cf_variable(name, var, concat_characters, mask_and_scale, decode_times, decode_endianness, stack_char_dim, use_cftime, decode_timedelta)
    338         var = times.CFTimedeltaCoder().decode(var, name=name)
    339     if decode_times:
--> 340         var = times.CFDatetimeCoder(use_cftime=use_cftime).decode(var, name=name)
    341 
    342     dimensions, data, attributes, encoding = variables.unpack_for_decoding(var)

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/coding/times.py in decode(self, variable, name)
    461             units = pop_to(attrs, encoding, "units")
    462             calendar = pop_to(attrs, encoding, "calendar")
--> 463             dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime)
    464             transform = partial(
    465                 decode_cf_datetime,

/usr/local/python/anaconda3/envs/pangeo-forge/lib/python3.7/site-packages/xarray/coding/times.py in _decode_cf_datetime_dtype(data, units, calendar, use_cftime)
    121             "if it is not installed."
    122         )
--> 123         raise ValueError(msg)
    124     else:
    125         dtype = getattr(result, "dtype", np.dtype("object"))

ValueError: unable to decode time units 'hours since 1850-01-15 12:00:00.000000' with "calendar 'noleap'". Try opening your dataset with decode_times=False or installing cftime if it is not installed.

@davidbrochart
Copy link
Contributor

@davidbrochart - do you have cftime installed?

Yes, and when I open_chunk I get cftime objects in the time dimension: cftime.DatetimeJulian(2000, 6, 1, 0, 0, 0, 0)...

@cisaacstern
Copy link
Member

@naomi-henderson, the link to your 'cftime trouble example' above (https://github.com/naomi-henderson/cmip6collect2/blob/main/NZS-cftime.ipynb) appears to be broken. (I get a 404 error when clicking it.)

Has this notebook been renamed or moved?

Curious to look into this issue, and would be helpful to have the minimal example as reference. Thanks!

@naomi-henderson
Copy link
Contributor

@cisaacstern , that notebook was in reference to a very old version of the NetCDFtoZarrSequentialRecipe. I haven't looked at the problem since then and I don't know if it is still an issue. I will try to get the latest recipe and see if I still have cftime issues and, if so, try find a better example for you

@rabernat
Copy link
Contributor Author

rabernat commented Apr 5, 2021

Naomi, I believe that everything needed to support your recipe is in place in current master branch. If you specify nitems_per_input=None in the recipe construction (and add a metadata_cache storage target), it should be able to deal with a variable number of timesteps per file. Handling of encoding has been refactored in #86. So I am optimistic that your recipe will now work. Would you mind giving a try and reporting back with any errors you find?

@naomi-henderson
Copy link
Contributor

@cisaacstern and @rabernat , I used nitems_per_input=None in the recipe construction and added a metadata_cache storage target, using the current master branch of pangeo-forge. My recipe, using just NetCDFtoZarrSequentialRecipe, now works on my four test cases, grabbing netcdf files from the GFDL AWS netcdf bucket, with no cftime issues! I was also able to pick the chunks for the time dimension and pass the join='exact as a xarray_concat_kwargs.

But there are some new issues:

  1. The caching of files in my '/tmp' is still hit-or-miss, especially for one of the datasets - sometimes it gets 1 or 2 of the netcdf files and then stalls out on the next file. I guess it could be my internet connection causing the trouble, but I am getting these netcdf files from S3 - not some podunk server. I don't have trouble with my regular scripts when downloading from S3 to a local directory using python requests.
  2. TypeError: to_zarr() got an unexpected keyword argument 'safe_chunks' so I had to comment out the safe_chunks option in the 'NetCDFtoZarrRecipe' of 'recipe.py'. I also tried with the current master branch of xarray (xarray-0.17.1.dev49+g903278a) - same problem.
  3. The 'https://pangeo-forge.readthedocs.io/en/latest/tutorials/netcdf_zarr_sequential.html' example is outdated in many ways ... in particular recipe.open_chunk returns a contextlib._GeneratorContextManager thing - not a dataset - so I didn't know how to look at a particular chunk to verify all was well. If I knew how to deal with a Generator Context Manager thing, I could try to fix the documentation 8-)

My basic recipe:

recipe = NetCDFtoZarrSequentialRecipe(
    input_urls=input_urls,
    sequence_dim="time", 
    target_chunks=target_chunks,
    nitems_per_input=None, 
    process_chunk=set_bnds_as_coords,
    xarray_open_kwargs={'use_cftime':True},
    xarray_concat_kwargs={'join':'exact'}
)

Storage Targets:

import tempfile
from fsspec.implementations.local import LocalFileSystem
from pangeo_forge.storage import FSSpecTarget, CacheFSSpecTarget

fs_local = LocalFileSystem()

target_dir = tempfile.TemporaryDirectory()
target = FSSpecTarget(fs_local, target_dir.name)

cache_dir = tempfile.TemporaryDirectory()
cache_target = CacheFSSpecTarget(fs_local, cache_dir.name)

meta_dir = tempfile.TemporaryDirectory()
meta_store = FSSpecTarget(fs_local, meta_dir.name)

recipe.target = target
recipe.input_cache = cache_target
recipe.metadata_cache = meta_store

Execute recipe:

for input_name in recipe.iter_inputs():
    recipe.cache_input(input_name)
recipe.prepare_target() 
for chunk in recipe.iter_chunks():
    recipe.store_chunk(chunk)
recipe.finalize_target()

@cisaacstern
Copy link
Member

Glad to hear that the initial cftime issue is resolved.

For the sake of organization, perhaps this issue should be closed, and the 3 new issues @naomi-henderson discovered opened as distinct, new Issues?

@rabernat
Copy link
Contributor Author

rabernat commented Apr 7, 2021

Thanks so much @naomi-henderson for trying this out!

Issues 1 and 2 are very likely related to your environment. The intermittent hanging in 1 sounds a lot like fsspec/filesystem_spec#565; this was a bug in filesystem spec that was surfaced in part by our work on Pangeo Forge. It has been fixed in the latest fsspec master. It would be great if you could verify this.

2 is because we are now dependent on an as-of-yet unmerged xarray PR (pydata/xarray#5065) which adds the safe_chunks option. Hopefully that will go in soon.

For development, you're probably best off creating a new environment that matches our CI, which should eliminate both problems:

https://github.com/pangeo-forge/pangeo-forge/blob/ba4dc7430137ae854b358698f0eb84fb4232c032/ci/py3.8.yml#L1-L39

(I have switched from conda to mamba and am never going back.)

3. example is outdated in many ways

Thanks for checking this. You're absolutely right that I have not bothered to update the tutorials after some recent changes. However, I hope that "outdated in many ways" is an exaggeration; I have strived to keep the API the same. The biggest change, as you noted, is the use of context managers for all openers. This allows us to keep better track of open / closed file objects and is in line with python best practice. So instead of.

ds = recipe.open_input(input_key)

you do

with recipe.open_input(input_key) as ds:
    # do something with ds
    display(ds)
    # If you want it in memory outside of the context manager, do
    ds.load()
# now the file is closed

(Same for open_chunk().) It would be fantastic if you could update the tutorials where needed. Going even further, perhaps you could turn your CMIP recipe into its own tutorial example notebook for the docs?

Thanks again for your helpful comments and real-world testing. Things are moving fast, so it's great to have this input.

@rabernat
Copy link
Contributor Author

rabernat commented Apr 7, 2021

For the sake of organization, perhaps this issue should be closed, and the 3 new issues @naomi-henderson discovered opened as distinct, new Issues?

Thanks for this suggestion @cisaacstern. I think 1&2 are related (see comment above). 3 is about stale documentation. Having dedicated issues for these would be useful; however, depending on @naomi-henderson's response, they might be resolved very quickly, so possibly not needed...

@naomi-henderson
Copy link
Contributor

@rabernat, As usual, your comments are very clarifying, thanks! This old lady brain gets easily bogged down!

Yes, "outdated in many ways" was an exaggeration, of course - that first example is very helpful! Now that I know about context managers I will go through it again and make a pull request with my suggestions.

Okay, will use the new pangeo-forge environment - I had made a kernel with the old one and then just updated xarray, fsspec and pangeo-forge. I agree it is best to make a new kernel at this point. I will also give mamba a try because conda is taking way.... too.... long....

As for making a new tutorial example with CMIP6 - yes, I will give it a try. I am concerned about all of the moving parts, GFDL's AWS collection included, but will try to create something robust.

@rabernat
Copy link
Contributor Author

@naomi-henderson - just checking whether things worked better with the updated environment?

@naomi-henderson
Copy link
Contributor

naomi-henderson commented Apr 13, 2021

@rabernat - I have been visiting grandkids in Virginia, so just getting back to this now. I had updated the environment, but then had trouble with the 'https://aws-cloudnode.esgfed.org/thredds/dodsC/' OPenDAP server not working last week, so had been trying to use the s3:// urls directly - which have to be explicitly opened (unlike the gs:// urls).

It was easy to switch to mamba - thanks for the suggestion! - but then also needed to python3 -m pip install ipykernel to add the py3.8.yml kernel, and then used mamba to add a few more packages - even matplotlib. But then the trouble with the esgfed opendap server not working while I was also changing the kernel was a bit hassle to sort out and I ran out of time. Now that I am back, I will work on the first tutorial and adding another tutorial for CMIP6.

Have you decided to rework the recipe and pattern codes? If so, perhaps I should wait?

EDIT: So I guess it is not a gs:// vs s3:// issue - it is a difference between the zarr stores in the s3://cmip6-pds bucket and netcdf files in the s3://esgf-world bucket.

@rabernat
Copy link
Contributor Author

I'm so happy you're able to visit your grandkids! How exciting! 😊

Sorry about your environment troubles.

so had been trying to use the s3:// urls directly - which have to be explicitly opened (unlike the gs:// urls).

This doesn't make sense to me. s3fs and gcsfs should be interchangeable here and behave identically. Could you give a more verbose example of what you mean?

Have you decided to rework the recipe and pattern codes? If so, perhaps I should wait?

I am working on that, but it's mostly an internal refactor. No reason not to try.

Anyway, nothing urgent here. Just checking in.

@naomi-henderson
Copy link
Contributor

@rabernat ,
Okay, it was sloppy speak - but here is what I meant.

# Connect to AWS S3 storage
import s3fs
fs_s3 = s3fs.S3FileSystem(anon=True)

and then, for the cmip6-pds bucket, xarray can open the zarr stores directly:

url = "s3://cmip6-pds/CMIP6/CMIP/NASA-GISS/GISS-E2-1-G/historical/r1i1p1f1/Amon/ua/gn/v20180827/"
ds = xr.open_zarr(url,consolidated=True)

but, for the esgf-world bucket, xarray cannot use the s3path directly, we must first use open:

s3path = "s3://esgf-world/CMIP6/CMIP/NASA-GISS/GISS-E2-1-G/historical/r1i1p1f1/Amon/ua/gn/v20180827/ua_Amon_GISS-E2-1-G_historical_r1i1p1f1_gn_185001-190012.nc"
url    = fs_s3.open(s3path, mode='rb')
ds = xr.open_dataset(url)

otherwise, just using xr.open_dataset(s3path), we get the error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    198             try:
--> 199                 file = self._cache[self._key]
    200             except KeyError:

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/backends/lru_cache.py in __getitem__(self, key)
     52         with self._lock:
---> 53             value = self._cache[key]
     54             self._cache.move_to_end(key)

KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('s3://esgf-world/CMIP6/CMIP/NASA-GISS/GISS-E2-1-G/historical/r1i1p1f1/Amon/ua/gn/v20180827/ua_Amon_GISS-E2-1-G_historical_r1i1p1f1_gn_185001-190012.nc',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False))]

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-64-0d8d5c8ceff2> in <module>
      1 s3path = "s3://esgf-world/CMIP6/CMIP/NASA-GISS/GISS-E2-1-G/historical/r1i1p1f1/Amon/ua/gn/v20180827/ua_Amon_GISS-E2-1-G_historical_r1i1p1f1_gn_185001-190012.nc"
      2 #url    = fs_s3.open(s3path, mode='rb')
----> 3 ds = xr.open_dataset(s3path)

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
    507 
    508     overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 509     backend_ds = backend.open_dataset(
    510         filename_or_obj,
    511         drop_variables=drop_variables,

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/backends/netCDF4_.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, format, clobber, diskless, persist, lock, autoclose)
    543     ):
    544 
--> 545         store = NetCDF4DataStore.open(
    546             filename_or_obj,
    547             mode=mode,

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/backends/netCDF4_.py in open(cls, filename, mode, format, group, clobber, diskless, persist, lock, lock_maker, autoclose)
    376             netCDF4.Dataset, filename, mode=mode, kwargs=kwargs
    377         )
--> 378         return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)
    379 
    380     def _acquire(self, needs_lock=True):

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/backends/netCDF4_.py in __init__(self, manager, group, mode, lock, autoclose)
    324         self._group = group
    325         self._mode = mode
--> 326         self.format = self.ds.data_model
    327         self._filename = self.ds.filepath()
    328         self.is_remote = is_remote_uri(self._filename)

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/backends/netCDF4_.py in ds(self)
    385     @property
    386     def ds(self):
--> 387         return self._acquire()
    388 
    389     def open_store_variable(self, name, var):

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/backends/netCDF4_.py in _acquire(self, needs_lock)
    379 
    380     def _acquire(self, needs_lock=True):
--> 381         with self._manager.acquire_context(needs_lock) as root:
    382             ds = _nc4_require_group(root, self._group, self._mode)
    383         return ds

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/contextlib.py in __enter__(self)
    111         del self.args, self.kwds, self.func
    112         try:
--> 113             return next(self.gen)
    114         except StopIteration:
    115             raise RuntimeError("generator didn't yield") from None

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/backends/file_manager.py in acquire_context(self, needs_lock)
    185     def acquire_context(self, needs_lock=True):
    186         """Context manager for acquiring a file."""
--> 187         file, cached = self._acquire_with_cache_info(needs_lock)
    188         try:
    189             yield file

/usr/local/python/anaconda3/envs/pangeo-forge3.8/lib/python3.8/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
    203                     kwargs = kwargs.copy()
    204                     kwargs["mode"] = self._mode
--> 205                 file = self._opener(*self._args, **kwargs)
    206                 if self._mode == "w":
    207                     # ensure file doesn't get overriden when opened again

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()

OSError: [Errno -36] NetCDF: Invalid argument: b's3://esgf-world/CMIP6/CMIP/NASA-GISS/GISS-E2-1-G/historical/r1i1p1f1/Amon/ua/gn/v20180827/ua_Amon_GISS-E2-1-G_historical_r1i1p1f1_gn_185001-190012.nc'

@naomi-henderson
Copy link
Contributor

so in my test cases, I was using the opendap URLs to avoid this extra step ...

@martindurant
Copy link
Contributor

The difference is between CDF and zarr, not s3fs and gcsfs. While I finally got my PR into xarray to use fsspec for the zarr backend, and interpret the path as necessary, this was not done for netCDF (et al) because:

  • the ambiguity around DAP paths, which look like fsspec remote paths, but should not be interpreted by fsspec
  • the mechanism for opening things is quite different for zarr (directories and no globs) versus netCDF

Since the method of passing file-like objects works OK, making everything work has not been a high priority.

@naomi-henderson
Copy link
Contributor

thanks, @martindurant , yes - that makes sense. So I am passing file-like objects for the CDF files (or using their OPeNDAP server) for my normal workflow - just thought it would be cleaner for the tutorial to not have the extra step. I agree it is not high priority

@naomi-henderson
Copy link
Contributor

So far so good! All of my CMIP6 netcdf -> zarr test cases went through with no issues (safe_chunks option still not available, but apparently that was not causing my other issues). It looks like we could start using it for constructing zarr datasets from the esgf-world netcdf files and putting them in GC! It might turn up some other interesting special cases. I have just one quick question and then this issue could be closed.

The time independent CMIP6 datasets are usually in a single netcdf file and are not very large, so we don't need to worry too much about the chunk sizes. The NetCDFtoZarrSequentialRecipe could probably be adapted to this degenerate case - is that what you suggest I use?

I will open pull requests for my suggested changes to the documentation and tutorials once I have had time to rework the first tutorial example and contribute a new CMIP6 specific example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants