-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use zarr
to validate attrs when writing to zarr
#6636
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this PR because it simplifies things and just lets Zarr handle the attributes. One suggestion for simplifying even further.
xarray/backends/zarr.py
Outdated
for key, value in attrs.items(): | ||
try: | ||
zarr_obj.attrs[key] = value | ||
except TypeError as e: | ||
raise TypeError(f"Invalid attr {key!r}: {value!r}. {e!s}") from e | ||
return zarr_obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably avoid looping over individual attrs. Imagine we have 100s of attributes (not unheard of for some NetCDF datasets) and are using a high-latency store like HTTP; then we would be doing 100s of sequential PUT operations. This could take a really long time. Much better to leverage the .update
method on zarr_obj.attrs
.
At that point, I would be curious what happens when Zarr gets invalid attrs. Do we even need to catch the error in Xarray? Or could we just let Zarr raise it an be done. In which case, this whole method would become.
for key, value in attrs.items(): | |
try: | |
zarr_obj.attrs[key] = value | |
except TypeError as e: | |
raise TypeError(f"Invalid attr {key!r}: {value!r}. {e!s}") from e | |
return zarr_obj | |
zarr_obj.attrs.update(attrs) |
...and at that point do we even need a function for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I started with an error as close as possible to the one that was already implemented, but I also think that it's better to avoid the loop.
For example, if I assign a DataArray to an attribute, I get the following error from zarr
:
TypeError: Object of type DataArray is not JSON serializable
I think it's OK if xarray
doesn't explicitly say which attributes generate the problem (i.e., remove the loop), but maybe xarray
should say that the issue is in the attributes (i.e., keep the try/except to raise a more informative error).
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the most careful and helpful thing for us to do would be to catch the error, and reraise it with a little more context, which is exactly what your latest commits now do. 😊
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
All set. Thanks everyone! |
Thanks @malmans2 |
* main: (95 commits) Use `zarr` to validate attrs when writing to zarr (pydata#6636) Add pre-commit hook to check CITATION.cff (pydata#6658) Fix kwargs used for extrapolation in docs (pydata#6639) Fix notebooks' HTML links (pydata#6655) Doc index update (pydata#6530) CFTime support for polyval (pydata#6624) Support dask arrays in datetime_to_numeric (pydata#6556) [pre-commit.ci] pre-commit autoupdate (pydata#6654) 0-padded month. (pydata#6653) [test-upstream] import `cleanup` fixture from `distributed` (pydata#6650) Allow all interp methods in typing (pydata#6647) Typing support for custom backends (pydata#6651) Improved DataArray typing (pydata#6637) Adjust code comments & types from pydata#6638 (pydata#6642) Typing of `str` and `dt` accessors (pydata#6641) Feature/to dict encoding (pydata#6635) fix {full,zeros,ones}_like overloads (pydata#6630) Mypy badge (pydata#6626) concatenate docs style (pydata#6621) Typing for open_dataset/array/mfdataset and to_netcdf/zarr (pydata#6612) ...
whats-new.rst
New functions/methods are listed inapi.rst
I think we can just use zarr to validate attributes, so we can support all types allowed by zarr.
Note that I removed the checks on the keys, as I believe we can rely on zarr for that as well. However, there is an issue with mixed types (e.g.,
attrs={"a": "foo", 1: "foo"}
), but I think that needs to be addressed in zarr. See: zarr-developers/zarr-python#1037cc: @wankoelias @rabernat