-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
numpy
2 compatibility in the netcdf4
and h5netcdf
backends
#9136
Conversation
numpy
2 compatibility in the netcdf4
backendnumpy
2 compatibility in the netcdf4
and h5netcdf
backends
@kmuehlbauer, would you have time to look into the two remaining failing tests? From what I can tell, this has been emitting a (in case you know what the issue is, feel free to push a fix to this PR) |
@keewis Yes, I'll have a closer look tomorrow. |
Adding error log for verbosity: FAILED xarray/tests/test_conventions.py::TestCFEncodedDataStore::test_roundtrip_mask_and_scale[dtype0-create_unsigned_masked_scaled_data-create_encoded_unsigned_masked_scaled_data] - OverflowError: Failed to decode variable 'x': Python integer -1 out of bounds for uint8
FAILED xarray/tests/test_conventions.py::TestCFEncodedDataStore::test_roundtrip_mask_and_scale[dtype1-create_unsigned_masked_scaled_data-create_encoded_unsigned_masked_scaled_data] - OverflowError: Failed to decode variable 'x': Python integer -1 out of bounds for uint8 |
OK, that looks like some error using a combination of packed data and Xref: https://docs.unidata.ucar.edu/nug/current/best_practices.html#bp_Unsigned-Data plus the following section on packed data values. |
@keewis I have a local reproducer now, looking into this later today. |
Citing from above link: NetCDF-3 does not have unsigned integer primitive types. The The implementation is somewhat hacky, as we have to consider also scale/offset. My current proposal includes using the correct |
…o prevent OverflowError.
thanks for looking into this and fixing it, @kmuehlbauer (and so quickly, too)! Just so I understand the implications of using Edit: the failing upstream-dev CI is unrelated, that's a |
Now that you ask that, I'm a bit unsure.
|
revisiting this, I believe this is fine, though this does fail if for example |
@keewis I'm not sure, if my fix is the ultimate solution. I'm on it with a better fitting solution today. |
well, feel free to improve/modify as you see fit! |
the mypy environment uses the standard environment, which means that unpinning here surfaces the issues in that CI as well. Merging sounds good to me (I just didn't want to merge my own PR), I just didn't want to merge my own PR. |
I think we should all get a little more comfortable doing this. It's immeasurably worse, to our user base, to leave finished PRs hanging around and unreleased. |
* main: exclude the bots from the release notes (pydata#9235) switch the documentation to run with `numpy>=2` (pydata#9177) `numpy` 2 compatibility in the iris code paths (pydata#9156) `numpy` 2 compatibility in the `netcdf4` and `h5netcdf` backends (pydata#9136) Fix time indexing regression in `convert_calendar` (pydata#9192) Use duckarray assertions in test_coding_times (pydata#9226) Use reshape and ravel from duck_array_ops in coding/times.py (pydata#9225) Cleanup test_coding_times.py (pydata#9223) Only use necessary dims when creating temporary dataarray (pydata#9206) Fix two bugs in DataTree.update() (pydata#9214) Use numpy 2.0-compat `np.complex64` dtype in test (pydata#9217)
FYI this PR seems to break the case where
|
Here's a reproducer: import xarray as xr
import numpy as np
v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": -1}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})
uic = xr.coding.variables.UnsignedIntegerCoder()
uic.encode(v, "test")
# ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged Basically I'm passing 32-bit float data so Edit: If I pass fill value as |
This works for me in latest main branch. import xarray as xr
import numpy as np
fillvalue = np.int16(-1)
v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": fillvalue}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})
uic = xr.coding.variables.UnsignedIntegerCoder()
uic.encode(v, "test") @djhoese Are you writing to NETCDF3 or NETCDF4_CLASSIC? You might also just skip the "_Unsigned" attribute and use |
@kmuehlbauer The netcdf files being produced are NetCDF4 and are being ingested by a third-party Java application so using unsigned types directly isn't allowed (Java doesn't have unsigned types). |
It seems I can do fill value as Edit: As far as I can find the |
Ok I've made a PR for Satpy that I think fixes my use case. I think the main misunderstanding here is that someone used to writing NetCDF files outside of xarray and possibly familiar with CF standards will think they need In my opinion the fix in this PR is incorrect, but I feel like I'm missing some use case for why it needed to be changed in the first place so I'm not sure I can suggest something better. You mentioned above that there were overflow cases being errored on in numpy 2 with the old code, but if the user is specifying |
Providing the signed int16 equivalent (-1) as |
This seems to work perfectly fine: import xarray as xr
import numpy as np
fillvalue = np.uint16(65535)
v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": fillvalue}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})
uic = xr.coding.variables.UnsignedIntegerCoder()
uic.encode(v, "test") <xarray.Variable (y: 10, x: 10)> Size: 200B
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int16)
Attributes:
_FillValue: -1
_Unsigned: true You can specify |
Yes, you're right. This works. I missed that that's what you were doing in your code. However, I'd say this isn't expected from a user point of view. I don't think I should have to specify a numpy scalar for |
@kmuehlbauer Further up you had said that the change to the new |
@kmuehlbauer I have another case I just discovered after working around the In [6]: v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.uint8), attrs={"_FillValue": 1}, encoding={"_Unsigned": "true", "dtype": "int8", "zlib": True})
In [7]: uic.encode(v, "test")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[7], line 1
----> 1 uic.encode(v, "test")
File ~/miniforge3/envs/satpy_py312/lib/python3.12/site-packages/xarray/coding/variables.py:524, in UnsignedIntegerCoder.encode(self, variable, name)
522 new_fill = np.array(attrs["_FillValue"])
523 # use view here to prevent OverflowError
--> 524 attrs["_FillValue"] = new_fill.view(signed_dtype).item()
525 # new_fill = signed_dtype.type(attrs["_FillValue"])
526 # attrs["_FillValue"] = new_fill
527 data = duck_array_ops.astype(duck_array_ops.around(data), signed_dtype)
ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged So the above fails because Note: I copied the code from this PR into my stable xarray environment so I could swap between the two behaviors quickly. |
@djhoese Obviously I did not take all possibilities into account, as it unfortunately looks like 😢. I have to admit, that the CF coding issues are getting increasingly annoying 🙄. I'm currently traveling and may not have time to dedicate for the next 2 weeks. If you or others see immediate fixes for this, please go ahead. |
I'll see what I can do. I think this issue I've brought up can be summarized as: |
@kmuehlbauer and others, question for you as I work on "fixing" this: Is the roundtrip the only thing that matters? Or can xarray accept input and assume it knows best about what the user wanted? Especially in this In the original Case 1: I have 32-bit float data in-memory. I want to write it to disk to fit in an 8-bit signed NetCDF variable with Maybe I should get more sleep before dealing with CF. |
Seems like the dtype of
|
Yes, but then what is the point of the |
I've worked on the CF coding part in the past and found it not easy (huh). The idea was to have different coders for the different parts of the CF coding pipeline. The problem is that a priori knowledge is needed for some Coders depending on other Coders. Here in this example in the decoding path the unsigned representation of _FillValue is needed in subsequent CFMaskCoder. We can't just run CFMaskCoder before UnsignedIntegerCoder as we would already transform to float there. Maybe we should handle the complete pipeline for _Unsigned in one Coder? |
This would be fine. |
@kmuehlbauer But isn't the CFMaskCoder run first? for coder in [
times.CFDatetimeCoder(),
times.CFTimedeltaCoder(),
variables.CFScaleOffsetCoder(),
variables.CFMaskCoder(),
variables.UnsignedIntegerCoder(),
variables.NativeEnumCoder(),
variables.NonStringCoder(),
variables.DefaultFillvalueCoder(),
variables.BooleanCoder(),
]:
var = coder.encode(var, name=name) I think I agree with the current behavior that the in-memory |
@djhoese Yes, for encoding. For decoding the other way round. |
First try: #9258 |
netcdf4
didn't havenumpy
2 compatible builds before the last release, so we couldn't test that yet. This may reveal a couple of issues.whats-new.rst