Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robust None handling #3195

Merged
merged 38 commits into from
Dec 9, 2021
Merged

More robust None handling #3195

merged 38 commits into from
Dec 9, 2021

Conversation

mariosasko
Copy link
Collaborator

@mariosasko mariosasko commented Nov 2, 2021

PyArrow has explicit support for null values, so it makes sense to support Nones on our side as well.

Colab Notebook with examples

Changes:

  • allow None for the features types with special encoding (ClassLabel, TranslationVariableLanguages, Value, _ArrayXD)
  • handle None in class_encode_column (also there is an option to stringify Nones and treat them as a class)
  • support None sorting in sort (use pandas for that)
  • handle None in align_labels_with_mapping
  • support for None in ArrayXD (converts None to np.nan to align the behavior with PyArrow)
  • support for None in the Audio/Image feature
  • allow promotion when concatenating tables (pa.concat_tables(table_list, promote=True)) and null row/column broadcasting similar to pandas

Additional notes:

  • use null instead of none for function arguments for consistency with existing disable_nullable
  • fixes a bug with the update_metadata_with_features call in Dataset.rename_columns
  • had to update some tests, let me know if that's ok

TODO:

  • check how the Audio features behaves with Nones
  • Better None handling in concatenate_datasets/add_item
  • Fix formatting with Nones
  • Add Colab with examples
  • Tests

TODOs for subsequent PRs:

  • Mention None handling in the docs
  • Add drop_null/fill_null to Dataset/DatasetDict

Fix #3181 #3253

@lhoestq
Copy link
Member

lhoestq commented Nov 4, 2021

I also created a PR regarding disable_nullable that must be always False by default, in order to always allow None values
#3211

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good !
I just have a question about padding when concatenating datasets or adding a column, and some nitpicks:

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
src/datasets/features/features.py Outdated Show resolved Hide resolved
@mariosasko
Copy link
Collaborator Author

@lhoestq I addressed your comments, added tests, did some refactoring to make the implementation cleaner and added support for None values in map transforms when the feature type is ArrayXD (previously, I only implemented None decoding).

My only concern is that during decoding ArrayXD arrays with None values will be auto-casted to float64 to allow np.nan insertion and this might be unexpected if dtype is not float, so one option would be to allow None values only if the storage type is float32 or float64. Let me know WDYT would be the most consistent behavior here.

@mariosasko mariosasko marked this pull request as ready for review November 22, 2021 17:27
@mariosasko mariosasko linked an issue Nov 23, 2021 that may be closed by this pull request
@lhoestq
Copy link
Member

lhoestq commented Nov 24, 2021

Cool ! :D

My only concern is that during decoding ArrayXD arrays with None values will be auto-casted to float64 to allow np.nan insertion and this might be unexpected if dtype is not float, so one option would be to allow None values only if the storage type is float32 or float64. Let me know WDYT would be the most consistent behavior here.

Yes that makes sense to only fill with nan if the type is compatible

@mariosasko
Copy link
Collaborator Author

After some more experimenting, I think we can keep auto-cast to float because PyArrow also does it:

import pyarrow as pa
arr = pa.array([1, 2, 3, 4, None], type=pa.int32()).to_numpy(zero_copy_only=False) # None present - int32 -> float64
assert arr.dtype == np.float64

Additional changes:

  • fixes a bug in the _is_zero_copy_only implementation for the ArraXD types. Previously, _is_zero_copy_only would always return False for these types. Still have to see if it's possible to optimize copying of the non-extension types (Sequence, ...), but I plan to work on that in a separate PR.
  • Allow dynamic first dimension for ArrayXD #2891 introduced a bug where the dtype of ArrayXD wouldn't be preserved due to to_pylist call in NumPy Formatter (np.array(np.array(..).tolist()) doesn't necessarily preserve dtype of the initial array), so I'm also fixing that.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice ! My final comments:

src/datasets/arrow_dataset.py Show resolved Hide resolved
src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
src/datasets/formatting/formatting.py Show resolved Hide resolved
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Thank you :)

@lhoestq
Copy link
Member

lhoestq commented Dec 9, 2021

The CI fail for windows is unrelated to this PR, merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GeneratorBasedBuilder does not support None values None converted to "None" when loading a dataset
2 participants