Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Origin/fix missing features error #5318

Merged

Conversation

eunseojo
Copy link
Contributor

@eunseojo eunseojo commented Dec 1, 2022

This fixes the problem of when the dataset_load function reads a function with "features" provided but some read batches don't have columns that later show up. For instance, the provided "features" requires columns A,B,C but only columns B,C show. This fixes this by adding the column A with nulls.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Dec 1, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks !

Could you also add a test in tests/packaged_modules/test_json.py ?

You can copy test_json_generate_tables and rename it something like test_json_generate_tables_with_features, and pass explicit features to Json(...)

src/datasets/packaged_modules/json/json.py Outdated Show resolved Hide resolved
@eunseojo
Copy link
Contributor Author

eunseojo commented Dec 2, 2022

please review :) @lhoestq @ola13 thankoo

@lhoestq
Copy link
Member

lhoestq commented Dec 2, 2022

Thanks :) I just updated the test to make sure it works even when there's a column missing, and did a minor change to json.py to add the missing columns for the other kinds of JSON files as well (I moved the code toself._cast_table)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this ! Also cc @mariosasko does it sound good to you as well ?

@ola13
Copy link

ola13 commented Dec 2, 2022

Thanks Unso! If @lhoestq is happy then I'm also happy :D

@eunseojo eunseojo merged commit 001d8b9 into huggingface:main Dec 4, 2022
@mariosasko
Copy link
Collaborator

When I noticed the ping, this PR had already been merged...

Luckily, PyArrow's read_json behaves the same when explicit_schema is given via ParseOptions, so I'm okay with this change (our JSON loader doesn't use read_json for decoding JSON in some scenarios, so this manual approach is the right one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants