Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OSCAR dataset card #1833

Merged
merged 10 commits into from
Feb 12, 2021
Merged

Conversation

pjox
Copy link
Contributor

@pjox pjox commented Feb 8, 2021

I added more information and completed the dataset card for OSCAR which was started by @lhoestq in his previous PR.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you :)

I added a few suggestions.

I'm also generating the tags that go at the top of the dataset card

datasets/oscar/README.md Outdated Show resolved Hide resolved
datasets/oscar/README.md Show resolved Hide resolved
datasets/oscar/README.md Show resolved Hide resolved
datasets/oscar/README.md Outdated Show resolved Hide resolved
datasets/oscar/README.md Outdated Show resolved Hide resolved
@lhoestq lhoestq mentioned this pull request Feb 8, 2021
@pjox
Copy link
Contributor Author

pjox commented Feb 8, 2021

@lhoestq Thanks for the suggestions! I agree with all of them. Should I accept them one by one or can I accept them all at once? When I try to load the whole diff GitHub is complaining and it does no render them well (probably my browser?) 😅

pjox and others added 6 commits February 8, 2021 14:49
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@pjox
Copy link
Contributor Author

pjox commented Feb 8, 2021

I just merged the tables as suggested 😄 . However I noticed something weird, the train sizes are identical for both the original and deduplicated files ... This is not normal, in general the original files are almost twice as big as the deduplicated ones 🤔

@lhoestq
Copy link
Member

lhoestq commented Feb 9, 2021

Good catch @pjox ! I just checked and this is because the scripts doesn't handle having several blank lines in a row.
Blank lines introduced by deduplication are currently not ignored so we end up with the same number of examples in the dataset as the original version (but with empty examples...)
I fixed that in this commit. I'm re-running the metadata generation for deduplicated configs.

@lhoestq
Copy link
Member

lhoestq commented Feb 11, 2021

I got the new sizes today, will update the dataset_infos.json and the dataset card tomorrow

@cahya-wirawan
Copy link
Contributor

I got the new sizes today, will update the dataset_infos.json and the dataset card tomorrow

great, I just wanted to report that I got error message "NonMatchingSplitsSizesError" when I tried to load one of the oscar dataset.

@lhoestq
Copy link
Member

lhoestq commented Feb 12, 2021

Hi @cahya-wirawan, which configuration of oscar do you have this issue with ?

@lhoestq
Copy link
Member

lhoestq commented Feb 12, 2021

Ok I see you're having this issue because I haven't updated the sizes yet ! I'm opening a PR

I just checked and indeed there's an issue with the deduplicated configurations since the commit I mentioned above.
I'm fixing this by using the new sizes I got yesterday :)

@lhoestq
Copy link
Member

lhoestq commented Feb 12, 2021

I just updated the size in the table @pjox it should be good now :)
I also updated the sizes in the dataset_infos.json in #1868 (merged)

@cahya-wirawan
Copy link
Contributor

cahya-wirawan commented Feb 12, 2021

Thanks @lhoestq for fixing the issue, it works now

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright this is all good then ! Thanks a lot @pjox

@lhoestq lhoestq merged commit f9df773 into huggingface:master Feb 12, 2021
@pjox
Copy link
Contributor Author

pjox commented Feb 12, 2021

Thank you so much @lhoestq !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants