More robust `None` handling #3195

mariosasko · 2021-11-02T11:15:10Z

PyArrow has explicit support for null values, so it makes sense to support Nones on our side as well.

Colab Notebook with examples

Changes:

allow None for the features types with special encoding (ClassLabel, TranslationVariableLanguages, Value, _ArrayXD)
handle None in class_encode_column (also there is an option to stringify Nones and treat them as a class)
support None sorting in sort (use pandas for that)
handle None in align_labels_with_mapping
support for None in ArrayXD (converts None to np.nan to align the behavior with PyArrow)
support for None in the Audio/Image feature
allow promotion when concatenating tables (pa.concat_tables(table_list, promote=True)) and null row/~~column~~ broadcasting similar to pandas

Additional notes:

use null instead of none for function arguments for consistency with existing disable_nullable
fixes a bug with the update_metadata_with_features call in Dataset.rename_columns
had to update some tests, let me know if that's ok

TODO:

check how the Audio features behaves with Nones
Better None handling in concatenate_datasets/add_item
Fix formatting with Nones
Add Colab with examples
Tests

TODOs for subsequent PRs:

Mention None handling in the docs
Add drop_null/fill_null to Dataset/DatasetDict

Fix #3181 #3253

lhoestq · 2021-11-04T10:53:01Z

I also created a PR regarding disable_nullable that must be always False by default, in order to always allow None values
#3211

lhoestq

Looking good !
I just have a question about padding when concatenating datasets or adding a column, and some nitpicks:

src/datasets/arrow_dataset.py

src/datasets/features/features.py

mariosasko · 2021-11-22T17:26:39Z

@lhoestq I addressed your comments, added tests, did some refactoring to make the implementation cleaner and added support for None values in map transforms when the feature type is ArrayXD (previously, I only implemented None decoding).

My only concern is that during decoding ArrayXD arrays with None values will be auto-casted to float64 to allow np.nan insertion and this might be unexpected if dtype is not float, so one option would be to allow None values only if the storage type is float32 or float64. Let me know WDYT would be the most consistent behavior here.

lhoestq · 2021-11-24T15:47:46Z

Cool ! :D

My only concern is that during decoding ArrayXD arrays with None values will be auto-casted to float64 to allow np.nan insertion and this might be unexpected if dtype is not float, so one option would be to allow None values only if the storage type is float32 or float64. Let me know WDYT would be the most consistent behavior here.

Yes that makes sense to only fill with nan if the type is compatible

mariosasko · 2021-11-25T17:24:27Z

After some more experimenting, I think we can keep auto-cast to float because PyArrow also does it:

import pyarrow as pa
arr = pa.array([1, 2, 3, 4, None], type=pa.int32()).to_numpy(zero_copy_only=False) # None present - int32 -> float64
assert arr.dtype == np.float64

Additional changes:

fixes a bug in the _is_zero_copy_only implementation for the ArraXD types. Previously, _is_zero_copy_only would always return False for these types. Still have to see if it's possible to optimize copying of the non-extension types (Sequence, ...), but I plan to work on that in a separate PR.
Allow dynamic first dimension for ArrayXD #2891 introduced a bug where the dtype of ArrayXD wouldn't be preserved due to to_pylist call in NumPy Formatter (np.array(np.array(..).tolist()) doesn't necessarily preserve dtype of the initial array), so I'm also fixing that.

lhoestq

Very nice ! My final comments:

src/datasets/arrow_dataset.py

src/datasets/formatting/formatting.py

lhoestq

LGTM ! Thank you :)

lhoestq · 2021-12-09T14:26:51Z

The CI fail for windows is unrelated to this PR, merging

mariosasko added 5 commits October 30, 2021 22:48

Make features nullable

3bf068c

Allow None in class_encode_column

7384951

Optimize class_encode_column

8b7baa6

Simplify cast change; update sort

1a6939d

Fix sort

a53de01

mariosasko added 6 commits November 13, 2021 14:19

Fix merge conflict

b5ffc12

Fix merge conflict - part 2

3c64406

Add option to include Nones in class_encode_column

a1b7278

Add support for type promotion

d3e6b9b

Preserve column order in class_encode_column

e185f97

Add None handling to decoding

b6939be

mariosasko mentioned this pull request Nov 15, 2021

Respect row ordering when concatenating datasets along axis=1 #3273

Closed

mariosasko added 12 commits November 15, 2021 14:40

Pandas decoding

23235b7

Style

cec9e5e

Small decode refactor

10881bd

Style

f4a0d3d

Fix class_encode_column

133a597

Fixing failing test - part 1

161b7df

Fix update_metadata_with_features in rename_columns

5b43053

Fix apply offset to indices part

4a5ad65

Check for duplicated columns

997d928

Fix error due to duplicated column names

6466f74

Add support for None in ArrayXD types

b91b8a4

Add support for None values in NumPy format for non-ArrayXD types

7e2d6d8

lhoestq reviewed Nov 16, 2021

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/features/features.py Outdated Show resolved Hide resolved

mariosasko added 4 commits November 20, 2021 00:52

Merge branch 'master' of github.com:huggingface/datasets into fix-3181

1b43c5e

Address comments and additional improvements/refactor

b51ace8

Remove format

60b181f

Add tests

737e56a

mariosasko added 3 commits November 22, 2021 17:15

Refactor align_features logic

1f1bd59

Style

72710f3

Allow unsafe cast in add_item

b398b7c

mariosasko marked this pull request as ready for review November 22, 2021 17:27

mariosasko linked an issue Nov 23, 2021 that may be closed by this pull request

GeneratorBasedBuilder does not support None values #3253

Closed

mariosasko added 3 commits November 25, 2021 17:18

Avoid autocast and log warning

4e87b11

Merge branch 'master' of github.com:huggingface/datasets into fix-3181

ab8bd00

Use autocast again (PyArrow does it)

9f4248e

mariosasko added 2 commits December 8, 2021 14:36

Merge conflict

8b0ed15

Allow None in Image feature

5d54ccc

lhoestq reviewed Dec 8, 2021

View reviewed changes

src/datasets/arrow_dataset.py Show resolved Hide resolved

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/formatting/formatting.py Show resolved Hide resolved

mariosasko added 3 commits December 9, 2021 12:35

Remove casts in add_item/concatenate_datasets

7636ad6

Leave cast in add_item

86c23cb

Better comment in add_item

9f887c6

lhoestq approved these changes Dec 9, 2021

View reviewed changes

lhoestq merged commit 684c23b into master Dec 9, 2021

lhoestq deleted the fix-3181 branch December 9, 2021 14:27

lhoestq mentioned this pull request Dec 9, 2021

None converted to "None" when loading a dataset #3181

Closed

mariosasko added a commit that referenced this pull request Dec 9, 2021

Skip None encoding (line deleted by accident in #3195)

b946330

mariosasko mentioned this pull request Dec 9, 2021

Skip None encoding (line deleted by accident in #3195) #3414

Merged

mariosasko added a commit that referenced this pull request Dec 10, 2021

Skip None encoding (line deleted by accident in #3195) (#3414)

6090f3c

markmc mentioned this pull request Jul 12, 2024

Investigate Dataset.map() multiprocessing failure instructlab/sdg#123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More robust `None` handling #3195

More robust `None` handling #3195

mariosasko commented Nov 2, 2021 •

edited

Loading

lhoestq commented Nov 4, 2021

lhoestq left a comment

mariosasko commented Nov 22, 2021

lhoestq commented Nov 24, 2021

mariosasko commented Nov 25, 2021

lhoestq left a comment

lhoestq left a comment

lhoestq commented Dec 9, 2021

More robust None handling #3195

More robust None handling #3195

Conversation

mariosasko commented Nov 2, 2021 • edited Loading

lhoestq commented Nov 4, 2021

lhoestq left a comment

Choose a reason for hiding this comment

mariosasko commented Nov 22, 2021

lhoestq commented Nov 24, 2021

mariosasko commented Nov 25, 2021

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq commented Dec 9, 2021

More robust `None` handling #3195

More robust `None` handling #3195

mariosasko commented Nov 2, 2021 •

edited

Loading