cf-coding #7654

kmuehlbauer · 2023-03-21T10:34:44Z

xref Saving and loading an array of strings changes datatype to object #7652
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst
includes
- implement coders
- preserve boolean-dtype within encoding, adapt test
- determine cf packed data type
- ~~transform numpy object-dtype strings (vlen) to numpy unicode strings~~

This should also fix:

Closes nan values appearing when saving and loading from netCDF due to encoding #7691
Closes float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray #2304
and possible other issues which have the float32/float64 issue

kmuehlbauer · 2023-03-21T10:35:38Z

I'll add to whats-new.rst if an updated version is merged.

kmuehlbauer · 2023-03-21T13:47:45Z

The one failed run might be spurious. Maybe a re-run of this would work.

basnijholt · 2023-03-22T04:41:05Z

Thanks a lot for the quick PR!

I can confirm that this fixes

Saving and loading an array of strings changes datatype to object #7652 (comment) (bool -> int8)

But not int64 -> int32, and <U1 -> O

Saving and loading an array of strings changes datatype to object #7652 (comment)
Saving and loading an array of strings changes datatype to object #7652 (comment)

kmuehlbauer · 2023-03-22T06:58:04Z

@basnijholt Thanks for testing. I can't reproduce this. Here everything works as expected. But I've a slightly different environment (full netcdf4-stack, latest versions of everything).

The tests which check for dtype equality (vlen-case) do not raise in this PR for any backend and array container, so I'm assuming this should work as expected (at least for bool->int8, and <U1 -> O.

As stated over at #7652 I can't reproduce the int64->int32 conversion in any environments I tested so far. I'll have another look at your environment.

kmuehlbauer

I've added some explanations and concerns. I'd very much appreciate comments and guidance here.

Pinging @shoyer and @max-sixty as authors of relevant code.

xarray/conventions.py

xarray/tests/test_backends.py

xarray/conventions.py

xarray/tests/test_backends.py

kmuehlbauer · 2023-03-22T08:24:50Z

But not int64 -> int32, and <U1 -> O

@basnijholt I've explained the issues whereabouts over at your issue #7652 (comment).

kmuehlbauer · 2023-03-23T12:27:45Z

XRef: #2040 (comment)

Citing @shoyer from above comment:

The main reason why we don't do any special handling for object arrays currently in xarray is that our conventions coding/decoding system has no way of marking variable length string arrays. We should probably handle this by making a custom dtype like h5py that marks variables length strings using dtype metadata: http://docs.h5py.org/en/latest/special.html#variable-length-strings

Another interesting issue by @shoyer:
#2059

I'm really uncertain if using .astype here is the right way to go. Any comments appreciated.

kmuehlbauer · 2023-03-24T11:28:14Z

And yet another related issue: #1647

Currently both netcdf/h5netcdf are able to set a _FillValue for VLEN strings, but for the numpy-side "NaN" can only be handled with dtype=object. Maybe it's time to consolidate string handling in xarray. But that should be taken care of in a separate issue / feature branch.

dcherian · 2023-03-24T15:35:58Z

Any comments appreciated.

Let's discuss at next week's meeting. @kmuehlbauer can you make it? 9.30am MT Wed 29 Mar 2023

kmuehlbauer · 2023-03-24T16:21:06Z

make

That's 15 UTC, or 17 CEST (my local time). Should work for me. I'll try to collect all available information on that topic and the current status.

basnijholt · 2023-03-24T16:27:01Z

Just leaving a note here. I would expect that the datatype that was saved, is the datatype that is loaded. So preferably if I save a string array of e.g., type <U5, I expect it would still be <U5 when loaded, not suddenly HDF5 VLEN types.

Thanks again @kmuehlbauer for digging into this problem and all your work! 😄

kmuehlbauer · 2023-03-31T11:13:49Z

@dcherian @basnijholt

After the dev-meeting I've taken a step back and first implemented the coders as mentioned in @shoyer's ToDo.

I've fixed the one bool->int issue and it now derives the dtype for ScaleOffset coding from scale_factor add_offset.

I've improved some test with regard to the scale/offset issue.

I'll concentrate on the string fillvalue issues in a follow up PR.

xarray/tests/test_backends.py

xarray/coding/variables.py

kmuehlbauer · 2023-04-01T08:46:49Z

@dcherian @Illviljan Thanks for the first round of review. I've rebased everything on latest main. Now the code moving from conventions.py to coding.variable.py is correct. I've also removed the functions which have been converted to VariableCoders and adapted the tests.

To sum up this PR, it does:

convert functions to VariableCoders along @shoyer's TODO:

xarray/xarray/conventions.py

Lines 298 to 302 in 1c81162

    
           # TODO(shoyer): convert all of these to use coders, too: 
        
           var = maybe_encode_nonstring_dtype(var, name=name) 
        
           var = maybe_default_fill_value(var) 
        
           var = maybe_encode_bools(var) 
        
           var = ensure_dtype_not_object(var, name=name)

xarray/xarray/conventions.py

Lines 393 to 405 in 1c81162

    
           # TODO(shoyer): convert everything below to use coders 
        
           if decode_endianness and not data.dtype.isnative: 
        
               # do this last, so it's only done if we didn't already unmask/scale 
        
               data = NativeEndiannessArray(data) 
        
               original_dtype = data.dtype 
        
           encoding.setdefault("dtype", original_dtype) 
        
           if "dtype" in attributes and attributes["dtype"] == "bool": 
        
               del attributes["dtype"] 
        
               data = BoolTypeArray(data)

preserve boolean dtype within encoding:
Saving and loading an array of strings changes datatype to object #7652 (comment)
deterrmine cf packed dtype from scale_factor/add_offset
nan values appearing when saving and loading from netCDF due to encoding #7691, float32 instead of float64 when decoding int16 with scale_factor netcdf var using xarray #2304

kmuehlbauer · 2023-04-01T09:48:57Z

@Illviljan I'm not able to figure out the typing if I want to use Data-types as functions to convert python numbers to array scalars. If you have any suggestion how to fix this, please let me know.

dcherian · 2023-04-01T13:27:37Z

xarray/coding/variables.py

-def _choose_float_dtype(dtype: np.dtype, has_offset: bool) -> type[np.floating[Any]]:
-    """Return a float dtype that can losslessly represent `dtype` values."""
-    # Keep float32 as-is.  Upcast half-precision to single-precision,
+def _choose_float_dtype(


@mankoff do you have time to take a look here please?

Hi @dcherian . I'm not sure what to look for. I will link to my open-but-stale PR that tried to start addressing this/similar issues: #6812

Perhaps also relevant are my last few comments on #2304 (see #2304 (comment) ). The problem for me is that (1) the CF standards are ambiguously defined and (2) xarray needs to address the many use-cases where the CF standards are not followed (usually this means different data types).

Thanks! Does this PR fix your original issue?

Thanks @mankoff, I'll have a look at your extensive notes over there and try to come up with aomething.

xarray/tests/test_backends.py

for more information, see https://pre-commit.ci

@Illviljan

typing by @Illviljan Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>

for more information, see https://pre-commit.ci

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>

…k for float32/64 only.

…llvalue fixes

…nflict, use _FillValue for missing_value if available

kmuehlbauer · 2023-04-04T14:10:33Z

Still hunting for corner cases and issues inside encode_cf_variable/decode_cf_variable.

It looks like I already see some light again. Not sure, if this is the last iteration, but the testsuite is still running green with added and enhanced tests, which is not that bad.

Unfortunately #2304 is still an issue for now. I'll clarify that later with an added test.

kmuehlbauer · 2023-04-05T05:46:15Z

@dcherian Just a heads-up: I find this PR getting more and more involved at different parts of the machinery and hard to follow for reviewers. I'll split this up and start with the more or less undisputed changes.

kmuehlbauer · 2023-04-05T06:15:58Z

As explained I've created two PR (#7719 and #7720) for the "easy" changes from this PR. Would be great, if those could go in fast. Thanks!

kmuehlbauer · 2023-05-08T18:09:59Z

I've converted to draft for now, as I'm still evaluating solutions for the CF encoding/decoding.

kmuehlbauer · 2023-11-17T20:57:30Z

I'll close this one. Most things have been addressed in other PR's.

github-actions bot added the topic-CF conventions label Mar 21, 2023

kmuehlbauer mentioned this pull request Mar 21, 2023

Saving and loading an array of strings changes datatype to object #7652

Closed

kmuehlbauer changed the title ~~preserve boolean-dtype within encoding~~ preserve dtypes (bool, vlen string) within encoding Mar 21, 2023

kmuehlbauer commented Mar 22, 2023

View reviewed changes

xarray/conventions.py Outdated Show resolved Hide resolved

xarray/tests/test_backends.py Show resolved Hide resolved

xarray/conventions.py Outdated Show resolved Hide resolved

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

kmuehlbauer force-pushed the preserve-bool-dtype branch from 4b65ffd to 5901354 Compare March 23, 2023 12:02

dcherian added the needs discussion label Mar 24, 2023

kmuehlbauer force-pushed the preserve-bool-dtype branch 2 times, most recently from 64ef0b9 to b5dad00 Compare March 31, 2023 11:03

kmuehlbauer changed the title ~~preserve dtypes (bool, vlen string) within encoding~~ cf-coding Mar 31, 2023

kmuehlbauer mentioned this pull request Mar 31, 2023

nan values appearing when saving and loading from netCDF due to encoding #7691

Closed

4 tasks

kmuehlbauer closed this Mar 31, 2023

kmuehlbauer reopened this Mar 31, 2023

Illviljan reviewed Mar 31, 2023

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

xarray/coding/variables.py Outdated Show resolved Hide resolved

xarray/coding/variables.py Outdated Show resolved Hide resolved

xarray/coding/variables.py Outdated Show resolved Hide resolved

kmuehlbauer force-pushed the preserve-bool-dtype branch from 2e4a81d to fc29432 Compare April 1, 2023 08:35

kmuehlbauer mentioned this pull request Apr 1, 2023

Decoding netCDF is giving incorrect values for a large file #5597

Closed

dcherian reviewed Apr 1, 2023

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

kmuehlbauer and others added 15 commits April 4, 2023 15:43

use scale_factor/add_offset in tests as specified by cf conventions

b6310e4

add test which checks add_offset being not conforming to cf standards

06ec16a

[pre-commit.ci] auto fixes from pre-commit.com hooks

19ef234

for more information, see https://pre-commit.ci

convert to float32 to keep pydata#1840 in sync

e877aa7

[pre-commit.ci] auto fixes from pre-commit.com hooks

8cc7319

for more information, see https://pre-commit.ci

add more comments, add more typing

6a73653

[pre-commit.ci] auto fixes from pre-commit.com hooks

d6887cd

for more information, see https://pre-commit.ci

add additional test, make _choose_float_dtype more explicit

c8c3f14

Apply suggestions from code review

2f5709a

typing by @Illviljan Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f215d01

for more information, see https://pre-commit.ci

Apply suggestions from code review

c5cd53d

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>

Test against None, as 0 (False) is a valid input for add_offset. Chec…

986ffc6

…k for float32/64 only.

apply astype in the right order to prevent loss of precision, more fi…

51f7515

…llvalue fixes

add mask/scale roundtrip test for variable cf encoding/decoding

8c074f0

separate CFMaskCoder test for multiple missing_values/__FillValues co…

031cac5

…nflict, use _FillValue for missing_value if available

kmuehlbauer force-pushed the preserve-bool-dtype branch from aeed54f to 031cac5 Compare April 4, 2023 13:44

This was referenced Apr 5, 2023

Implement more Variable Coders #7719

Merged

preserve boolean dtype in encoding #7720

Merged

kmuehlbauer marked this pull request as draft May 8, 2023 18:08

tomwhite mentioned this pull request May 22, 2023

vcf_to_zarr creates zero-sized first chunk which results in incorrect dtype. sgkit-dev/sgkit#1090

Open

ghiggi mentioned this pull request May 23, 2023

open_dataset with chunks="auto" fails when a netCDF4 variables/coordinates is encoded as NC_STRING #7868

Closed

bdestombe mentioned this pull request Aug 25, 2023

getting an extra layer when reading geotop from cache gwmod/nlmod#218

Closed

kmuehlbauer closed this Nov 17, 2023

kmuehlbauer mentioned this pull request Feb 6, 2024

correctly encode/decode _FillValues/missing_values/dtypes for packed data #8713

Merged

5 tasks

Thomas-Z mentioned this pull request Apr 18, 2024

netCDF encoding and decoding issues. #8957

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cf-coding #7654

cf-coding #7654

kmuehlbauer commented Mar 21, 2023 •

edited by dcherian

Loading

kmuehlbauer commented Mar 21, 2023

kmuehlbauer commented Mar 21, 2023

basnijholt commented Mar 22, 2023 •

edited

Loading

kmuehlbauer commented Mar 22, 2023

kmuehlbauer left a comment

kmuehlbauer commented Mar 22, 2023

kmuehlbauer commented Mar 23, 2023

kmuehlbauer commented Mar 24, 2023

dcherian commented Mar 24, 2023

kmuehlbauer commented Mar 24, 2023

basnijholt commented Mar 24, 2023

kmuehlbauer commented Mar 31, 2023

kmuehlbauer commented Apr 1, 2023 •

edited

Loading

kmuehlbauer commented Apr 1, 2023

dcherian Apr 1, 2023

mankoff Apr 1, 2023

dcherian Apr 1, 2023

kmuehlbauer Apr 1, 2023

kmuehlbauer commented Apr 4, 2023

kmuehlbauer commented Apr 5, 2023

kmuehlbauer commented Apr 5, 2023

kmuehlbauer commented May 8, 2023

kmuehlbauer commented Nov 17, 2023

cf-coding #7654

cf-coding #7654

Conversation

kmuehlbauer commented Mar 21, 2023 • edited by dcherian Loading

kmuehlbauer commented Mar 21, 2023

kmuehlbauer commented Mar 21, 2023

basnijholt commented Mar 22, 2023 • edited Loading

kmuehlbauer commented Mar 22, 2023

kmuehlbauer left a comment

Choose a reason for hiding this comment

kmuehlbauer commented Mar 22, 2023

kmuehlbauer commented Mar 23, 2023

kmuehlbauer commented Mar 24, 2023

dcherian commented Mar 24, 2023

kmuehlbauer commented Mar 24, 2023

basnijholt commented Mar 24, 2023

kmuehlbauer commented Mar 31, 2023

kmuehlbauer commented Apr 1, 2023 • edited Loading

kmuehlbauer commented Apr 1, 2023

dcherian Apr 1, 2023

Choose a reason for hiding this comment

mankoff Apr 1, 2023

Choose a reason for hiding this comment

dcherian Apr 1, 2023

Choose a reason for hiding this comment

kmuehlbauer Apr 1, 2023

Choose a reason for hiding this comment

kmuehlbauer commented Apr 4, 2023

kmuehlbauer commented Apr 5, 2023

kmuehlbauer commented Apr 5, 2023

kmuehlbauer commented May 8, 2023

kmuehlbauer commented Nov 17, 2023

kmuehlbauer commented Mar 21, 2023 •

edited by dcherian

Loading

basnijholt commented Mar 22, 2023 •

edited

Loading

kmuehlbauer commented Apr 1, 2023 •

edited

Loading