Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numpy 2 compatibility in the netcdf4 and h5netcdf backends #9136

Merged
merged 20 commits into from
Jul 11, 2024

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented Jun 19, 2024

netcdf4 didn't have numpy 2 compatible builds before the last release, so we couldn't test that yet. This may reveal a couple of issues.

  • User visible changes (including notable bug fixes) are documented in whats-new.rst

@keewis keewis added the run-upstream Run upstream CI label Jun 19, 2024
@keewis keewis changed the title numpy 2 compatibility in the netcdf4 backend numpy 2 compatibility in the netcdf4 and h5netcdf backends Jun 19, 2024
@keewis
Copy link
Collaborator Author

keewis commented Jun 19, 2024

@kmuehlbauer, would you have time to look into the two remaining failing tests? From what I can tell, this has been emitting a DeprecationWarning for quite a while and raises an OverflowError in numpy>=2. I fear I don't know enough about our encoding operations to debug this.

(in case you know what the issue is, feel free to push a fix to this PR)

@kmuehlbauer
Copy link
Contributor

@keewis Yes, I'll have a closer look tomorrow.

@kmuehlbauer
Copy link
Contributor

Adding error log for verbosity:

FAILED xarray/tests/test_conventions.py::TestCFEncodedDataStore::test_roundtrip_mask_and_scale[dtype0-create_unsigned_masked_scaled_data-create_encoded_unsigned_masked_scaled_data] - OverflowError: Failed to decode variable 'x': Python integer -1 out of bounds for uint8
FAILED xarray/tests/test_conventions.py::TestCFEncodedDataStore::test_roundtrip_mask_and_scale[dtype1-create_unsigned_masked_scaled_data-create_encoded_unsigned_masked_scaled_data] - OverflowError: Failed to decode variable 'x': Python integer -1 out of bounds for uint8

@kmuehlbauer
Copy link
Contributor

OK, that looks like some error using a combination of packed data and _Unsigned-attribute.
I'm currently lacking a upstream-dev environment on my laptop and WiFi is bad while traveling.

Xref: https://docs.unidata.ucar.edu/nug/current/best_practices.html#bp_Unsigned-Data plus the following section on packed data values.

@kmuehlbauer
Copy link
Contributor

@keewis I have a local reproducer now, looking into this later today.

@kmuehlbauer
Copy link
Contributor

Citing from above link: NetCDF-3 does not have unsigned integer primitive types.

The _Unsigned attribute was introduced, to notify any reading application, that the signed data on disk should be treated as unsigned in memory. Nevertheless the _FillValue need to be of on-disk type (signed).

The implementation is somewhat hacky, as we have to consider also scale/offset. My current proposal includes using the correct _FillValue within the test and use view instead of cast for transformation from uint -> int.

@keewis
Copy link
Collaborator Author

keewis commented Jun 22, 2024

thanks for looking into this and fixing it, @kmuehlbauer (and so quickly, too)!

Just so I understand the implications of using view: we will only ever encounter numpy (or cupy, which aims to be a drop-in replacement) arrays here, right?

Edit: the failing upstream-dev CI is unrelated, that's a scipy deprecation.

@kmuehlbauer
Copy link
Contributor

Now that you ask that, I'm a bit unsure.

_FillValue is an attribute here, a scalar, so no arrays involved. To transform from uint to int, I've wrapped it into numpy scalar to use view and assigned it back to the attribute. Should work, but there might be another (better) solution I'm missing.

@keewis
Copy link
Collaborator Author

keewis commented Jun 27, 2024

revisiting this, I believe this is fine, though this does fail if for example data is backed by a cupy array. I couldn't get that to work even with main and numpy<2, so maybe this doesn't matter. @dcherian, any opinions? Otherwise this should be ready to merge (edit: and really, this is blocking a bunch of PRs, so merging sooner than later would be great).

@keewis keewis added the plan to merge Final call for comments label Jun 27, 2024
@kmuehlbauer
Copy link
Contributor

@keewis I'm not sure, if my fix is the ultimate solution. I'm on it with a better fitting solution today.

@keewis
Copy link
Collaborator Author

keewis commented Jun 28, 2024

well, feel free to improve/modify as you see fit!

@dcherian dcherian mentioned this pull request Jul 11, 2024
@keewis
Copy link
Collaborator Author

keewis commented Jul 11, 2024

the mypy environment uses the standard environment, which means that unpinning here surfaces the issues in that CI as well. Merging sounds good to me (I just didn't want to merge my own PR), I just didn't want to merge my own PR.

@dcherian
Copy link
Contributor

I just didn't want to merge my own PR

I think we should all get a little more comfortable doing this. It's immeasurably worse, to our user base, to leave finished PRs hanging around and unreleased.

@dcherian dcherian merged commit 7087ca4 into pydata:main Jul 11, 2024
22 of 28 checks passed
@keewis keewis deleted the numpy2-netcdf4 branch July 11, 2024 08:54
dcherian added a commit to dcherian/xarray that referenced this pull request Jul 11, 2024
* main:
  exclude the bots from the release notes (pydata#9235)
  switch the documentation to run with `numpy>=2` (pydata#9177)
  `numpy` 2 compatibility in the iris code paths (pydata#9156)
  `numpy` 2 compatibility in the `netcdf4` and `h5netcdf` backends (pydata#9136)
  Fix time indexing regression in `convert_calendar` (pydata#9192)
  Use duckarray assertions in test_coding_times (pydata#9226)
  Use reshape and ravel from duck_array_ops in coding/times.py (pydata#9225)
  Cleanup test_coding_times.py (pydata#9223)
  Only use necessary dims when creating temporary dataarray (pydata#9206)
  Fix two bugs in DataTree.update() (pydata#9214)
  Use numpy 2.0-compat `np.complex64` dtype in test (pydata#9217)
@djhoese
Copy link
Contributor

djhoese commented Jul 16, 2024

FYI this PR seems to break the case where _FillValue is not the same type as the array. I haven't completely tracked it down, but I'm getting failures in my unstable CI with:

            if "_FillValue" in attrs:
                new_fill = np.array(attrs["_FillValue"])
                # use view here to prevent OverflowError
>               attrs["_FillValue"] = new_fill.view(signed_dtype).item()
E               ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged

@djhoese
Copy link
Contributor

djhoese commented Jul 16, 2024

Here's a reproducer:

import xarray as xr
import numpy as np

v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": -1}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})

uic = xr.coding.variables.UnsignedIntegerCoder()

uic.encode(v, "test")
# ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged

Basically I'm passing 32-bit float data so signed_dtype becomes int32 in the encode method. So np.array(-1).view(np.int32) is not allowed as np.array(-1) is a 64-bit integer. Even if I pass fill value as a int16 (matching my output encoding) it still fails because it is trying to use a signed_dtype of int32 to match the 32-bit float I'm passing in.

Edit: If I pass fill value as np.array(-1, dtype=np.int32) then it seems to work.

@kmuehlbauer
Copy link
Contributor

This works for me in latest main branch.

import xarray as xr
import numpy as np
fillvalue = np.int16(-1)
v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": fillvalue}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})
uic = xr.coding.variables.UnsignedIntegerCoder()
uic.encode(v, "test")

@djhoese Are you writing to NETCDF3 or NETCDF4_CLASSIC? You might also just skip the "_Unsigned" attribute and use uint16 as is when writing to NETCDF4.

@djhoese
Copy link
Contributor

djhoese commented Jul 17, 2024

@kmuehlbauer The netcdf files being produced are NetCDF4 and are being ingested by a third-party Java application so using unsigned types directly isn't allowed (Java doesn't have unsigned types).

@djhoese
Copy link
Contributor

djhoese commented Jul 17, 2024

It seems I can do fill value as np.array(65535, dtype=np.uint16) but then it does show up in the resulting Variable's attrs as 65535. Oh I can do plain 65535 too. I haven't tested what this shows up as in a final NetCDF file.

Edit: As far as I can find the _FillValue is always supposed to match the dtype of the packed on-disk variable. Specifying 65535 makes this not true.

@djhoese
Copy link
Contributor

djhoese commented Jul 17, 2024

Ok I've made a PR for Satpy that I think fixes my use case. I think the main misunderstanding here is that someone used to writing NetCDF files outside of xarray and possibly familiar with CF standards will think they need _FillValue to match the dtype of the on-disk/in-file variable. Xarray is assuming here that the _FillValue is in the type of the in-memory data or at least that's what I'm telling myself. My workaround is to pass the -1 as the unsigned equivalent (65535) and then xarray does the conversion to the on-disk type (signed int) "for me".

In my opinion the fix in this PR is incorrect, but I feel like I'm missing some use case for why it needed to be changed in the first place so I'm not sure I can suggest something better.

You mentioned above that there were overflow cases being errored on in numpy 2 with the old code, but if the user is specifying _FillValue as an on-disk-dtype compatible value then that never should have happened, right?

@kmuehlbauer
Copy link
Contributor

Providing the signed int16 equivalent (-1) as _FillValue should work, as in my example above. Does this not work?

@kmuehlbauer
Copy link
Contributor

This seems to work perfectly fine:

import xarray as xr
import numpy as np
fillvalue = np.uint16(65535)
v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.float32), attrs={"_FillValue": fillvalue}, encoding={"_Unsigned": "true", "dtype": "int16", "zlib": True})
uic = xr.coding.variables.UnsignedIntegerCoder()
uic.encode(v, "test")
<xarray.Variable (y: 10, x: 10)> Size: 200B
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int16)
Attributes:
    _FillValue:  -1
    _Unsigned:   true

You can specify _FillValue either as uint or int, it will be converted to packed type dtype (int) using .view (this was introduced in this PR, since numpy2 errors now when just trying to cast).

@djhoese
Copy link
Contributor

djhoese commented Jul 17, 2024

Yes, you're right. This works. I missed that that's what you were doing in your code. However, I'd say this isn't expected from a user point of view. I don't think I should have to specify a numpy scalar for _FillValue. Regular python integers should work too.

@djhoese
Copy link
Contributor

djhoese commented Jul 17, 2024

@kmuehlbauer Further up you had said that the change to the new view logic was to avoid overflow complaints from numpy 2. In those cases, what was being provided as the _FillValue?

@djhoese
Copy link
Contributor

djhoese commented Jul 17, 2024

@kmuehlbauer I have another case I just discovered after working around the -1 case as discussed. I have a dataset coming from somewhere else that is uint8 and a value of 1 is the fill value. The variable needs to be stored as a signed int8 with _Unsigned set to make the third-party application happy.

In [6]: v = xr.Variable(("y", "x"), np.zeros((10, 10), dtype=np.uint8), attrs={"_FillValue": 1}, encoding={"_Unsigned": "true", "dtype": "int8", "zlib": True})

In [7]: uic.encode(v, "test")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 uic.encode(v, "test")

File ~/miniforge3/envs/satpy_py312/lib/python3.12/site-packages/xarray/coding/variables.py:524, in UnsignedIntegerCoder.encode(self, variable, name)
    522     new_fill = np.array(attrs["_FillValue"])
    523     # use view here to prevent OverflowError
--> 524     attrs["_FillValue"] = new_fill.view(signed_dtype).item()
    525     # new_fill = signed_dtype.type(attrs["_FillValue"])
    526     # attrs["_FillValue"] = new_fill
    527 data = duck_array_ops.astype(duck_array_ops.around(data), signed_dtype)

ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged

So the above fails because new_fill = np.array(attrs["_FillValue"]) is taking a python native integer and making it an np.int64. Now the .view(signed_dtype) logic fails because it is trying to cast a 64-bit integer to a 8-bit integer.

Note: I copied the code from this PR into my stable xarray environment so I could swap between the two behaviors quickly.

@kmuehlbauer
Copy link
Contributor

@djhoese Obviously I did not take all possibilities into account, as it unfortunately looks like 😢. I have to admit, that the CF coding issues are getting increasingly annoying 🙄.

I'm currently traveling and may not have time to dedicate for the next 2 weeks. If you or others see immediate fixes for this, please go ahead.

@djhoese
Copy link
Contributor

djhoese commented Jul 17, 2024

I'll see what I can do. I think this issue I've brought up can be summarized as: _FillValue should be able to be a python native integer or numpy scalar. I think the main reason it isn't working is the np.array(attrs["_FillValue"]) because it takes a python native integer and turns it into a int64 most of the time.

@djhoese
Copy link
Contributor

djhoese commented Jul 18, 2024

@kmuehlbauer and others, question for you as I work on "fixing" this: Is the roundtrip the only thing that matters? Or can xarray accept input and assume it knows best about what the user wanted? Especially in this _Unsigned case, I'm not sure it is a clear answer. Does the user always provide a _FillValue of the output on-disk type? Or is xarray allowed to normalize any fill value to the on-disk type? If it does that normalization, wrote it to a NetCDF file, and read it back in then (I think) it wouldn't have the original _FillValue the user started with.

In the original create_unsigned_masked_scaled_data test helper function the on-disk type was int8 (i1) and the _FillValue was 255 in the encoding dict. @kmuehlbauer changed that fill value to np.int8(-1) which makes sense if the encoding of a Dataset represents what will live on-disk when the data is encoded...but what lives in .attrs["_FillValue"] when it is decoded/loaded from disk? This is confusing.

Case 1: I have 32-bit float data in-memory. I want to write it to disk to fit in an 8-bit signed NetCDF variable with _Unsigned set to true. That signed integer data should use -1 to mark invalid data. So in any .encoding dictionaries .encoding["_FillValue"] should be -1 to match the on-disk type. In .attrs["_FillValue"] should also always be -1, right? Unless the user is reading poorly formatted data, the _FillValue should never be 255. Right?

Maybe I should get more sleep before dealing with CF.

@dcherian
Copy link
Contributor

Seems like the dtype of _FillValue should match the dtype of the array on disk (http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#missing-data)

The scalar attribute with the name _FillValue and of the same type as its variable is recognized by the netCDF library as the value used to pre-fill disk space allocated to the variable.
...
The missing values of a variable with scale_factor and/or add_offset attributes (see Section 8.1, "Packed Data") are interpreted relative to the variable’s external values (a.k.a. the packed values, the raw values, the values stored in the netCDF file), not the values that result after the scale and offset are applied.

@djhoese
Copy link
Contributor

djhoese commented Jul 18, 2024

Yes, but then what is the point of the decode logic currently? The existing logic is to convert _FillValue to the unsigned type that the data has been unpacked as (so the in-memory version). Besides backwards compatibility, if _FillValue does not match the dtype of the in-memory unpacked data then users can't do data == fill_value masking if they turned auto masking off.

@kmuehlbauer
Copy link
Contributor

I've worked on the CF coding part in the past and found it not easy (huh). The idea was to have different coders for the different parts of the CF coding pipeline.

The problem is that a priori knowledge is needed for some Coders depending on other Coders. Here in this example in the decoding path the unsigned representation of _FillValue is needed in subsequent CFMaskCoder.

We can't just run CFMaskCoder before UnsignedIntegerCoder as we would already transform to float there.

Maybe we should handle the complete pipeline for _Unsigned in one Coder?

@dcherian
Copy link
Contributor

we should handle the complete pipeline for _Unsigned in one Coder?

This would be fine.

@djhoese
Copy link
Contributor

djhoese commented Jul 18, 2024

@kmuehlbauer But isn't the CFMaskCoder run first?

    for coder in [
        times.CFDatetimeCoder(),
        times.CFTimedeltaCoder(),
        variables.CFScaleOffsetCoder(),
        variables.CFMaskCoder(),
        variables.UnsignedIntegerCoder(),
        variables.NativeEnumCoder(),
        variables.NonStringCoder(),
        variables.DefaultFillvalueCoder(),
        variables.BooleanCoder(),
    ]:
        var = coder.encode(var, name=name)

I think I agree with the current behavior that the in-memory .attrs["_FillValue"] should match the in-memory array's dtype (uint in our examples so far). Once the Variable/DataArray is encoded then _FillValue in the .attrs will match the on-disk dtype. It looks like _FillValue is never meant to appear in .encoding so maybe I just forget about that extra complication.

@kmuehlbauer
Copy link
Contributor

@djhoese Yes, for encoding. For decoding the other way round.

@djhoese
Copy link
Contributor

djhoese commented Jul 18, 2024

First try: #9258

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments run-upstream Run upstream CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants