Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use zarr to validate attrs when writing to zarr #6636

Merged
merged 6 commits into from
Jun 3, 2022
Merged

Use zarr to validate attrs when writing to zarr #6636

merged 6 commits into from
Jun 3, 2022

Conversation

malmans2
Copy link
Contributor

@malmans2 malmans2 commented May 25, 2022

I think we can just use zarr to validate attributes, so we can support all types allowed by zarr.

Note that I removed the checks on the keys, as I believe we can rely on zarr for that as well. However, there is an issue with mixed types (e.g., attrs={"a": "foo", 1: "foo"}), but I think that needs to be addressed in zarr. See: zarr-developers/zarr-python#1037

cc: @wankoelias @rabernat

@TomNicholas TomNicholas added the topic-zarr Related to zarr storage library label May 25, 2022
@dcherian dcherian requested a review from rabernat May 28, 2022 02:12
Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this PR because it simplifies things and just lets Zarr handle the attributes. One suggestion for simplifying even further.

Comment on lines 324 to 329
for key, value in attrs.items():
try:
zarr_obj.attrs[key] = value
except TypeError as e:
raise TypeError(f"Invalid attr {key!r}: {value!r}. {e!s}") from e
return zarr_obj
Copy link
Contributor

@rabernat rabernat Jun 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably avoid looping over individual attrs. Imagine we have 100s of attributes (not unheard of for some NetCDF datasets) and are using a high-latency store like HTTP; then we would be doing 100s of sequential PUT operations. This could take a really long time. Much better to leverage the .update method on zarr_obj.attrs.

At that point, I would be curious what happens when Zarr gets invalid attrs. Do we even need to catch the error in Xarray? Or could we just let Zarr raise it an be done. In which case, this whole method would become.

Suggested change
for key, value in attrs.items():
try:
zarr_obj.attrs[key] = value
except TypeError as e:
raise TypeError(f"Invalid attr {key!r}: {value!r}. {e!s}") from e
return zarr_obj
zarr_obj.attrs.update(attrs)

...and at that point do we even need a function for it?

Copy link
Contributor Author

@malmans2 malmans2 Jun 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I started with an error as close as possible to the one that was already implemented, but I also think that it's better to avoid the loop.

For example, if I assign a DataArray to an attribute, I get the following error from zarr:

TypeError: Object of type DataArray is not JSON serializable

I think it's OK if xarray doesn't explicitly say which attributes generate the problem (i.e., remove the loop), but maybe xarray should say that the issue is in the attributes (i.e., keep the try/except to raise a more informative error).

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the most careful and helpful thing for us to do would be to catch the error, and reraise it with a little more context, which is exactly what your latest commits now do. 😊

xarray/backends/zarr.py Outdated Show resolved Hide resolved
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
@malmans2
Copy link
Contributor Author

malmans2 commented Jun 1, 2022

All set. Thanks everyone!

@dcherian dcherian added the plan to merge Final call for comments label Jun 1, 2022
@dcherian dcherian merged commit b080349 into pydata:main Jun 3, 2022
@dcherian
Copy link
Contributor

dcherian commented Jun 3, 2022

Thanks @malmans2

dcherian added a commit to dcherian/xarray that referenced this pull request Jun 12, 2022
* main: (95 commits)
  Use `zarr` to validate attrs when writing to zarr (pydata#6636)
  Add pre-commit hook to check CITATION.cff (pydata#6658)
  Fix kwargs used for extrapolation in docs (pydata#6639)
  Fix notebooks' HTML links (pydata#6655)
  Doc index update (pydata#6530)
  CFTime support for polyval (pydata#6624)
  Support dask arrays in datetime_to_numeric (pydata#6556)
  [pre-commit.ci] pre-commit autoupdate (pydata#6654)
  0-padded month. (pydata#6653)
  [test-upstream] import `cleanup` fixture from `distributed` (pydata#6650)
  Allow all interp methods in typing (pydata#6647)
  Typing support for custom backends (pydata#6651)
  Improved DataArray typing (pydata#6637)
  Adjust code comments & types from pydata#6638 (pydata#6642)
  Typing of `str` and `dt` accessors (pydata#6641)
  Feature/to dict encoding (pydata#6635)
  fix {full,zeros,ones}_like overloads (pydata#6630)
  Mypy badge (pydata#6626)
  concatenate docs style (pydata#6621)
  Typing for open_dataset/array/mfdataset and to_netcdf/zarr (pydata#6612)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Writing GDAL ZARR _CRS attribute not possible
4 participants