`numpy` 2 compatibility in the `netcdf4` and `h5netcdf` backends #9136

keewis · 2024-06-19T17:02:47Z

netcdf4 didn't have numpy 2 compatible builds before the last release, so we couldn't test that yet. This may reveal a couple of issues.

User visible changes (including notable bug fixes) are documented in whats-new.rst

keewis · 2024-06-19T19:30:06Z

@kmuehlbauer, would you have time to look into the two remaining failing tests? From what I can tell, this has been emitting a DeprecationWarning for quite a while and raises an OverflowError in numpy>=2. I fear I don't know enough about our encoding operations to debug this.

(in case you know what the issue is, feel free to push a fix to this PR)

kmuehlbauer · 2024-06-19T19:33:39Z

@keewis Yes, I'll have a closer look tomorrow.

kmuehlbauer · 2024-06-20T10:24:37Z

Adding error log for verbosity:

FAILED xarray/tests/test_conventions.py::TestCFEncodedDataStore::test_roundtrip_mask_and_scale[dtype0-create_unsigned_masked_scaled_data-create_encoded_unsigned_masked_scaled_data] - OverflowError: Failed to decode variable 'x': Python integer -1 out of bounds for uint8
FAILED xarray/tests/test_conventions.py::TestCFEncodedDataStore::test_roundtrip_mask_and_scale[dtype1-create_unsigned_masked_scaled_data-create_encoded_unsigned_masked_scaled_data] - OverflowError: Failed to decode variable 'x': Python integer -1 out of bounds for uint8

kmuehlbauer · 2024-06-20T10:55:30Z

OK, that looks like some error using a combination of packed data and _Unsigned-attribute.
I'm currently lacking a upstream-dev environment on my laptop and WiFi is bad while traveling.

Xref: https://docs.unidata.ucar.edu/nug/current/best_practices.html#bp_Unsigned-Data plus the following section on packed data values.

kmuehlbauer · 2024-06-21T06:43:47Z

@keewis I have a local reproducer now, looking into this later today.

kmuehlbauer · 2024-06-21T11:32:58Z

Citing from above link: NetCDF-3 does not have unsigned integer primitive types.

The _Unsigned attribute was introduced, to notify any reading application, that the signed data on disk should be treated as unsigned in memory. Nevertheless the _FillValue need to be of on-disk type (signed).

The implementation is somewhat hacky, as we have to consider also scale/offset. My current proposal includes using the correct _FillValue within the test and use view instead of cast for transformation from uint -> int.

…o prevent OverflowError.

keewis · 2024-06-22T21:52:30Z

thanks for looking into this and fixing it, @kmuehlbauer (and so quickly, too)!

Just so I understand the implications of using view: we will only ever encounter numpy (or cupy, which aims to be a drop-in replacement) arrays here, right?

Edit: the failing upstream-dev CI is unrelated, that's a scipy deprecation.

kmuehlbauer · 2024-06-23T06:55:11Z

Now that you ask that, I'm a bit unsure.

_FillValue is an attribute here, a scalar, so no arrays involved. To transform from uint to int, I've wrapped it into numpy scalar to use view and assigned it back to the attribute. Should work, but there might be another (better) solution I'm missing.

keewis · 2024-06-27T17:24:51Z

revisiting this, I believe this is fine, though this does fail if for example data is backed by a cupy array. I couldn't get that to work even with main and numpy<2, so maybe this doesn't matter. @dcherian, any opinions? Otherwise this should be ready to merge (edit: and really, this is blocking a bunch of PRs, so merging sooner than later would be great).

kmuehlbauer · 2024-06-28T08:36:28Z

@keewis I'm not sure, if my fix is the ultimate solution. I'm on it with a better fitting solution today.

keewis · 2024-06-28T09:26:18Z

well, feel free to improve/modify as you see fit!

keewis · 2024-07-11T08:48:52Z

the mypy environment uses the standard environment, which means that unpinning here surfaces the issues in that CI as well. Merging sounds good to me (I just didn't want to merge my own PR), I just didn't want to merge my own PR.

dcherian · 2024-07-11T08:53:58Z

I just didn't want to merge my own PR

I think we should all get a little more comfortable doing this. It's immeasurably worse, to our user base, to leave finished PRs hanging around and unreleased.

* main: exclude the bots from the release notes (pydata#9235) switch the documentation to run with `numpy>=2` (pydata#9177) `numpy` 2 compatibility in the iris code paths (pydata#9156) `numpy` 2 compatibility in the `netcdf4` and `h5netcdf` backends (pydata#9136) Fix time indexing regression in `convert_calendar` (pydata#9192) Use duckarray assertions in test_coding_times (pydata#9226) Use reshape and ravel from duck_array_ops in coding/times.py (pydata#9225) Cleanup test_coding_times.py (pydata#9223) Only use necessary dims when creating temporary dataarray (pydata#9206) Fix two bugs in DataTree.update() (pydata#9214) Use numpy 2.0-compat `np.complex64` dtype in test (pydata#9217)

djhoese · 2024-07-16T18:31:29Z

FYI this PR seems to break the case where _FillValue is not the same type as the array. I haven't completely tracked it down, but I'm getting failures in my unstable CI with:

            if "_FillValue" in attrs:
                new_fill = np.array(attrs["_FillValue"])
                # use view here to prevent OverflowError
>               attrs["_FillValue"] = new_fill.view(signed_dtype).item()
E               ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged

djhoese · 2024-07-16T18:44:38Z

Here's a reproducer:

import xarray as xr
import numpy as np

v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": -1}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})

uic = xr.coding.variables.UnsignedIntegerCoder()

uic.encode(v, "test")
# ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged

Basically I'm passing 32-bit float data so signed_dtype becomes int32 in the encode method. So np.array(-1).view(np.int32) is not allowed as np.array(-1) is a 64-bit integer. Even if I pass fill value as a int16 (matching my output encoding) it still fails because it is trying to use a signed_dtype of int32 to match the 32-bit float I'm passing in.

Edit: If I pass fill value as np.array(-1, dtype=np.int32) then it seems to work.

kmuehlbauer · 2024-07-17T13:29:51Z

This works for me in latest main branch.

import xarray as xr
import numpy as np
fillvalue = np.int16(-1)
v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": fillvalue}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})
uic = xr.coding.variables.UnsignedIntegerCoder()
uic.encode(v, "test")

@djhoese Are you writing to NETCDF3 or NETCDF4_CLASSIC? You might also just skip the "_Unsigned" attribute and use uint16 as is when writing to NETCDF4.

djhoese · 2024-07-17T14:35:07Z

@kmuehlbauer The netcdf files being produced are NetCDF4 and are being ingested by a third-party Java application so using unsigned types directly isn't allowed (Java doesn't have unsigned types).

djhoese · 2024-07-17T14:38:22Z

It seems I can do fill value as np.array(65535, dtype=np.uint16) but then it does show up in the resulting Variable's attrs as 65535. Oh I can do plain 65535 too. I haven't tested what this shows up as in a final NetCDF file.

Edit: As far as I can find the _FillValue is always supposed to match the dtype of the packed on-disk variable. Specifying 65535 makes this not true.

djhoese · 2024-07-17T15:24:55Z

Ok I've made a PR for Satpy that I think fixes my use case. I think the main misunderstanding here is that someone used to writing NetCDF files outside of xarray and possibly familiar with CF standards will think they need _FillValue to match the dtype of the on-disk/in-file variable. Xarray is assuming here that the _FillValue is in the type of the in-memory data or at least that's what I'm telling myself. My workaround is to pass the -1 as the unsigned equivalent (65535) and then xarray does the conversion to the on-disk type (signed int) "for me".

In my opinion the fix in this PR is incorrect, but I feel like I'm missing some use case for why it needed to be changed in the first place so I'm not sure I can suggest something better.

You mentioned above that there were overflow cases being errored on in numpy 2 with the old code, but if the user is specifying _FillValue as an on-disk-dtype compatible value then that never should have happened, right?

kmuehlbauer · 2024-07-17T15:38:01Z

Providing the signed int16 equivalent (-1) as _FillValue should work, as in my example above. Does this not work?

kmuehlbauer · 2024-07-17T15:49:21Z

This seems to work perfectly fine:

import xarray as xr
import numpy as np
fillvalue = np.uint16(65535)
v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": fillvalue}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})
uic = xr.coding.variables.UnsignedIntegerCoder()
uic.encode(v, "test")

<xarray.Variable (y: 10, x: 10)> Size: 200B
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int16)
Attributes:
    _FillValue:  -1
    _Unsigned:   true

You can specify _FillValue either as uint or int, it will be converted to packed type dtype (int) using .view (this was introduced in this PR, since numpy2 errors now when just trying to cast).

djhoese · 2024-07-17T15:49:49Z

Yes, you're right. This works. I missed that that's what you were doing in your code. However, I'd say this isn't expected from a user point of view. I don't think I should have to specify a numpy scalar for _FillValue. Regular python integers should work too.

djhoese · 2024-07-17T16:26:49Z

@kmuehlbauer Further up you had said that the change to the new view logic was to avoid overflow complaints from numpy 2. In those cases, what was being provided as the _FillValue?

djhoese · 2024-07-17T17:29:55Z

@kmuehlbauer I have another case I just discovered after working around the -1 case as discussed. I have a dataset coming from somewhere else that is uint8 and a value of 1 is the fill value. The variable needs to be stored as a signed int8 with _Unsigned set to make the third-party application happy.

In [6]: v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.uint8), attrs={"_FillValue": 1}, encoding={"_Unsigned": "true", "dtype": "int8", "zlib": True})

In [7]: uic.encode(v, "test")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 uic.encode(v, "test")

File ~/miniforge3/envs/satpy_py312/lib/python3.12/site-packages/xarray/coding/variables.py:524, in UnsignedIntegerCoder.encode(self, variable, name)
    522     new_fill = np.array(attrs["_FillValue"])
    523     # use view here to prevent OverflowError
--> 524     attrs["_FillValue"] = new_fill.view(signed_dtype).item()
    525     # new_fill = signed_dtype.type(attrs["_FillValue"])
    526     # attrs["_FillValue"] = new_fill
    527 data = duck_array_ops.astype(duck_array_ops.around(data), signed_dtype)

ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged

So the above fails because new_fill = np.array(attrs["_FillValue"]) is taking a python native integer and making it an np.int64. Now the .view(signed_dtype) logic fails because it is trying to cast a 64-bit integer to a 8-bit integer.

Note: I copied the code from this PR into my stable xarray environment so I could swap between the two behaviors quickly.

kmuehlbauer · 2024-07-17T19:00:27Z

@djhoese Obviously I did not take all possibilities into account, as it unfortunately looks like 😢. I have to admit, that the CF coding issues are getting increasingly annoying 🙄.

I'm currently traveling and may not have time to dedicate for the next 2 weeks. If you or others see immediate fixes for this, please go ahead.

djhoese · 2024-07-17T19:09:57Z

I'll see what I can do. I think this issue I've brought up can be summarized as: _FillValue should be able to be a python native integer or numpy scalar. I think the main reason it isn't working is the np.array(attrs["_FillValue"]) because it takes a python native integer and turns it into a int64 most of the time.

djhoese · 2024-07-18T02:36:26Z

@kmuehlbauer and others, question for you as I work on "fixing" this: Is the roundtrip the only thing that matters? Or can xarray accept input and assume it knows best about what the user wanted? Especially in this _Unsigned case, I'm not sure it is a clear answer. Does the user always provide a _FillValue of the output on-disk type? Or is xarray allowed to normalize any fill value to the on-disk type? If it does that normalization, wrote it to a NetCDF file, and read it back in then (I think) it wouldn't have the original _FillValue the user started with.

In the original create_unsigned_masked_scaled_data test helper function the on-disk type was int8 (i1) and the _FillValue was 255 in the encoding dict. @kmuehlbauer changed that fill value to np.int8(-1) which makes sense if the encoding of a Dataset represents what will live on-disk when the data is encoded...but what lives in .attrs["_FillValue"] when it is decoded/loaded from disk? This is confusing.

Case 1: I have 32-bit float data in-memory. I want to write it to disk to fit in an 8-bit signed NetCDF variable with _Unsigned set to true. That signed integer data should use -1 to mark invalid data. So in any .encoding dictionaries .encoding["_FillValue"] should be -1 to match the on-disk type. In .attrs["_FillValue"] should also always be -1, right? Unless the user is reading poorly formatted data, the _FillValue should never be 255. Right?

Maybe I should get more sleep before dealing with CF.

dcherian · 2024-07-18T03:21:13Z

Seems like the dtype of _FillValue should match the dtype of the array on disk (http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#missing-data)

The scalar attribute with the name _FillValue and of the same type as its variable is recognized by the netCDF library as the value used to pre-fill disk space allocated to the variable.
...
The missing values of a variable with scale_factor and/or add_offset attributes (see Section 8.1, "Packed Data") are interpreted relative to the variable’s external values (a.k.a. the packed values, the raw values, the values stored in the netCDF file), not the values that result after the scale and offset are applied.

djhoese · 2024-07-18T13:24:37Z

Yes, but then what is the point of the decode logic currently? The existing logic is to convert _FillValue to the unsigned type that the data has been unpacked as (so the in-memory version). Besides backwards compatibility, if _FillValue does not match the dtype of the in-memory unpacked data then users can't do data == fill_value masking if they turned auto masking off.

kmuehlbauer · 2024-07-18T14:52:21Z

I've worked on the CF coding part in the past and found it not easy (huh). The idea was to have different coders for the different parts of the CF coding pipeline.

The problem is that a priori knowledge is needed for some Coders depending on other Coders. Here in this example in the decoding path the unsigned representation of _FillValue is needed in subsequent CFMaskCoder.

We can't just run CFMaskCoder before UnsignedIntegerCoder as we would already transform to float there.

Maybe we should handle the complete pipeline for _Unsigned in one Coder?

dcherian · 2024-07-18T15:08:26Z

we should handle the complete pipeline for _Unsigned in one Coder?

This would be fine.

djhoese · 2024-07-18T16:43:15Z

@kmuehlbauer But isn't the CFMaskCoder run first?

    for coder in [
        times.CFDatetimeCoder(),
        times.CFTimedeltaCoder(),
        variables.CFScaleOffsetCoder(),
        variables.CFMaskCoder(),
        variables.UnsignedIntegerCoder(),
        variables.NativeEnumCoder(),
        variables.NonStringCoder(),
        variables.DefaultFillvalueCoder(),
        variables.BooleanCoder(),
    ]:
        var = coder.encode(var, name=name)

I think I agree with the current behavior that the in-memory .attrs["_FillValue"] should match the in-memory array's dtype (uint in our examples so far). Once the Variable/DataArray is encoded then _FillValue in the .attrs will match the on-disk dtype. It looks like _FillValue is never meant to appear in .encoding so maybe I just forget about that extra complication.

kmuehlbauer · 2024-07-18T16:58:05Z

@djhoese Yes, for encoding. For decoding the other way round.

djhoese · 2024-07-18T21:10:07Z

First try: #9258

don't remove netcdf4 from the upstream-dev environment

c54bb54

keewis added the run-upstream Run upstream CI label Jun 19, 2024

also stop removing h5py and hdf5

379d408

keewis changed the title ~~numpy 2 compatibility in the netcdf4 backend~~ numpy 2 compatibility in the netcdf4 and h5netcdf backends Jun 19, 2024

keewis added 2 commits June 19, 2024 20:54

hard-code the precision (I believe this was missed in pydata#9081)

429f87d

don't remove h5py either

5863384

Merge branch 'main' into numpy2-netcdf4

d8c7aa9

use on-diks _FillValue as standrd expects, use view instead of cast t…

d463bc0

…o prevent OverflowError.

kmuehlbauer mentioned this pull request Jun 22, 2024

Can not save timedelta data arrays with small integer dtypes and _FillValue #9134

Open

5 tasks

Merge branch 'main' into numpy2-netcdf4

eed1e2e

whats-new

7e856ba

This was referenced Jun 23, 2024

numpy 2 compatibility in the iris code paths #9156

Merged

Change np.core.defchararray to np.char (#9165) #9166

Merged

keewis added 2 commits June 27, 2024 14:27

Merge branch 'main' into numpy2-netcdf4

9a0e5cc

Merge branch 'main' into numpy2-netcdf4

f6fe3ef

keewis added 2 commits June 27, 2024 21:55

Merge branch 'main' into numpy2-netcdf4

6a5a93c

unpin numpy

f63208e

keewis added the plan to merge Final call for comments label Jun 27, 2024

Merge branch 'main' into numpy2-netcdf4

5bc7728

dcherian mentioned this pull request Jul 11, 2024

release 2024.07.0 #9230

Closed

dcherian merged commit 7087ca4 into pydata:main Jul 11, 2024
22 of 28 checks passed

keewis deleted the numpy2-netcdf4 branch July 11, 2024 08:54

djhoese mentioned this pull request Jul 17, 2024

Fix default AWIPS tiled _FillValue of -1 for newer versions of xarray pytroll/satpy#2858

Merged

4 tasks

djhoese mentioned this pull request Jul 18, 2024

Restore ability to specify _FillValue as Python native integers #9258

Merged

4 tasks

djhoese mentioned this pull request Jul 22, 2024

Combine UnsignedIntegerCoder and CFMaskCoder #9266

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`numpy` 2 compatibility in the `netcdf4` and `h5netcdf` backends #9136

`numpy` 2 compatibility in the `netcdf4` and `h5netcdf` backends #9136

keewis commented Jun 19, 2024 •

edited

Loading

keewis commented Jun 19, 2024 •

edited

Loading

kmuehlbauer commented Jun 19, 2024

kmuehlbauer commented Jun 20, 2024

kmuehlbauer commented Jun 20, 2024

kmuehlbauer commented Jun 21, 2024

kmuehlbauer commented Jun 21, 2024

keewis commented Jun 22, 2024 •

edited

Loading

kmuehlbauer commented Jun 23, 2024

keewis commented Jun 27, 2024 •

edited

Loading

kmuehlbauer commented Jun 28, 2024

keewis commented Jun 28, 2024 •

edited

Loading

keewis commented Jul 11, 2024

dcherian commented Jul 11, 2024

djhoese commented Jul 16, 2024

djhoese commented Jul 16, 2024 •

edited

Loading

kmuehlbauer commented Jul 17, 2024

djhoese commented Jul 17, 2024

djhoese commented Jul 17, 2024 •

edited

Loading

djhoese commented Jul 17, 2024

kmuehlbauer commented Jul 17, 2024

kmuehlbauer commented Jul 17, 2024

djhoese commented Jul 17, 2024

djhoese commented Jul 17, 2024

djhoese commented Jul 17, 2024

kmuehlbauer commented Jul 17, 2024

djhoese commented Jul 17, 2024

djhoese commented Jul 18, 2024

dcherian commented Jul 18, 2024

djhoese commented Jul 18, 2024 •

edited

Loading

kmuehlbauer commented Jul 18, 2024

dcherian commented Jul 18, 2024

djhoese commented Jul 18, 2024

kmuehlbauer commented Jul 18, 2024

djhoese commented Jul 18, 2024

numpy 2 compatibility in the netcdf4 and h5netcdf backends #9136

numpy 2 compatibility in the netcdf4 and h5netcdf backends #9136

Conversation

keewis commented Jun 19, 2024 • edited Loading

keewis commented Jun 19, 2024 • edited Loading

kmuehlbauer commented Jun 19, 2024

kmuehlbauer commented Jun 20, 2024

kmuehlbauer commented Jun 20, 2024

kmuehlbauer commented Jun 21, 2024

kmuehlbauer commented Jun 21, 2024

keewis commented Jun 22, 2024 • edited Loading

kmuehlbauer commented Jun 23, 2024

keewis commented Jun 27, 2024 • edited Loading

kmuehlbauer commented Jun 28, 2024

keewis commented Jun 28, 2024 • edited Loading

keewis commented Jul 11, 2024

dcherian commented Jul 11, 2024

djhoese commented Jul 16, 2024

djhoese commented Jul 16, 2024 • edited Loading

kmuehlbauer commented Jul 17, 2024

djhoese commented Jul 17, 2024

djhoese commented Jul 17, 2024 • edited Loading

djhoese commented Jul 17, 2024

kmuehlbauer commented Jul 17, 2024

kmuehlbauer commented Jul 17, 2024

djhoese commented Jul 17, 2024

djhoese commented Jul 17, 2024

djhoese commented Jul 17, 2024

kmuehlbauer commented Jul 17, 2024

djhoese commented Jul 17, 2024

djhoese commented Jul 18, 2024

dcherian commented Jul 18, 2024

djhoese commented Jul 18, 2024 • edited Loading

kmuehlbauer commented Jul 18, 2024

dcherian commented Jul 18, 2024

djhoese commented Jul 18, 2024

kmuehlbauer commented Jul 18, 2024

djhoese commented Jul 18, 2024

`numpy` 2 compatibility in the `netcdf4` and `h5netcdf` backends #9136

`numpy` 2 compatibility in the `netcdf4` and `h5netcdf` backends #9136

keewis commented Jun 19, 2024 •

edited

Loading

keewis commented Jun 19, 2024 •

edited

Loading

keewis commented Jun 22, 2024 •

edited

Loading

keewis commented Jun 27, 2024 •

edited

Loading

keewis commented Jun 28, 2024 •

edited

Loading

djhoese commented Jul 16, 2024 •

edited

Loading

djhoese commented Jul 17, 2024 •

edited

Loading

djhoese commented Jul 18, 2024 •

edited

Loading