Use `zarr` to validate attrs when writing to zarr #6636

malmans2 · 2022-05-25T16:46:03Z

Closes Writing GDAL ZARR _CRS attribute not possible #6448
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
~~New functions/methods are listed in api.rst~~

I think we can just use zarr to validate attributes, so we can support all types allowed by zarr.

Note that I removed the checks on the keys, as I believe we can rely on zarr for that as well. However, there is an issue with mixed types (e.g., attrs={"a": "foo", 1: "foo"}), but I think that needs to be addressed in zarr. See: zarr-developers/zarr-python#1037

cc: @wankoelias @rabernat

rabernat

I like this PR because it simplifies things and just lets Zarr handle the attributes. One suggestion for simplifying even further.

rabernat · 2022-06-01T16:59:55Z

xarray/backends/zarr.py

+    for key, value in attrs.items():
+        try:
+            zarr_obj.attrs[key] = value
+        except TypeError as e:
+            raise TypeError(f"Invalid attr {key!r}: {value!r}. {e!s}") from e
+    return zarr_obj


We should probably avoid looping over individual attrs. Imagine we have 100s of attributes (not unheard of for some NetCDF datasets) and are using a high-latency store like HTTP; then we would be doing 100s of sequential PUT operations. This could take a really long time. Much better to leverage the .update method on zarr_obj.attrs.

At that point, I would be curious what happens when Zarr gets invalid attrs. Do we even need to catch the error in Xarray? Or could we just let Zarr raise it an be done. In which case, this whole method would become.

Suggested change

for key, value in attrs.items():

try:

zarr_obj.attrs[key] = value

except TypeError as e:

raise TypeError(f"Invalid attr {key!r}: {value!r}. {e!s}") from e

return zarr_obj

zarr_obj.attrs.update(attrs)

...and at that point do we even need a function for it?

Yes - I started with an error as close as possible to the one that was already implemented, but I also think that it's better to avoid the loop.

For example, if I assign a DataArray to an attribute, I get the following error from zarr:

TypeError: Object of type DataArray is not JSON serializable

I think it's OK if xarray doesn't explicitly say which attributes generate the problem (i.e., remove the loop), but maybe xarray should say that the issue is in the attributes (i.e., keep the try/except to raise a more informative error).

What do you think?

I think the most careful and helpful thing for us to do would be to catch the error, and reraise it with a little more context, which is exactly what your latest commits now do. 😊

xarray/backends/zarr.py

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

malmans2 · 2022-06-01T21:40:42Z

All set. Thanks everyone!

dcherian · 2022-06-03T18:48:54Z

Thanks @malmans2

* main: (95 commits) Use `zarr` to validate attrs when writing to zarr (pydata#6636) Add pre-commit hook to check CITATION.cff (pydata#6658) Fix kwargs used for extrapolation in docs (pydata#6639) Fix notebooks' HTML links (pydata#6655) Doc index update (pydata#6530) CFTime support for polyval (pydata#6624) Support dask arrays in datetime_to_numeric (pydata#6556) [pre-commit.ci] pre-commit autoupdate (pydata#6654) 0-padded month. (pydata#6653) [test-upstream] import `cleanup` fixture from `distributed` (pydata#6650) Allow all interp methods in typing (pydata#6647) Typing support for custom backends (pydata#6651) Improved DataArray typing (pydata#6637) Adjust code comments & types from pydata#6638 (pydata#6642) Typing of `str` and `dt` accessors (pydata#6641) Feature/to dict encoding (pydata#6635) fix {full,zeros,ones}_like overloads (pydata#6630) Mypy badge (pydata#6626) concatenate docs style (pydata#6621) Typing for open_dataset/array/mfdataset and to_netcdf/zarr (pydata#6612) ...

use zarr to validate attrs

ddfcc78

TomNicholas added the topic-zarr Related to zarr storage library label May 25, 2022

dcherian requested a review from rabernat May 28, 2022 02:12

rabernat reviewed Jun 1, 2022

View reviewed changes

malmans2 added 2 commits June 1, 2022 22:49

remove for loop

c386308

update docs

a351685

dcherian reviewed Jun 1, 2022

View reviewed changes

xarray/backends/zarr.py Outdated Show resolved Hide resolved

Update xarray/backends/zarr.py

e2605e5

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

rabernat approved these changes Jun 1, 2022

View reviewed changes

malmans2 added 2 commits June 1, 2022 23:33

fix flake8

c8918df

fix tests

cc6786e

dcherian added the plan to merge Final call for comments label Jun 1, 2022

dcherian merged commit b080349 into pydata:main Jun 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `zarr` to validate attrs when writing to zarr #6636

Use `zarr` to validate attrs when writing to zarr #6636

malmans2 commented May 25, 2022 •

edited

Loading

rabernat left a comment

rabernat Jun 1, 2022 •

edited

Loading

malmans2 Jun 1, 2022 •

edited

Loading

rabernat Jun 1, 2022

malmans2 commented Jun 1, 2022

dcherian commented Jun 3, 2022

Use zarr to validate attrs when writing to zarr #6636

Use zarr to validate attrs when writing to zarr #6636

Conversation

malmans2 commented May 25, 2022 • edited Loading

rabernat left a comment

Choose a reason for hiding this comment

rabernat Jun 1, 2022 • edited Loading

Choose a reason for hiding this comment

malmans2 Jun 1, 2022 • edited Loading

Choose a reason for hiding this comment

rabernat Jun 1, 2022

Choose a reason for hiding this comment

malmans2 commented Jun 1, 2022

dcherian commented Jun 3, 2022

Use `zarr` to validate attrs when writing to zarr #6636

Use `zarr` to validate attrs when writing to zarr #6636

malmans2 commented May 25, 2022 •

edited

Loading

rabernat Jun 1, 2022 •

edited

Loading

malmans2 Jun 1, 2022 •

edited

Loading