Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix script of Arabic Billion Words dataset to return all data #3136

Merged
merged 5 commits into from
Oct 22, 2021

Conversation

albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented Oct 22, 2021

The script has a bug and only parses and generates a portion of the entire dataset.

This PR fixes the loading script so that is properly parses the entire dataset.

Current implementation generates the same number of examples as reported in the original paper for all configurations except for one:

  • For "Youm7" we generate more examples (1172136) than the ones reported by the paper (1025027)
Number of examples Number of examples according to the source
Alittihad 349342 349342
Almasryalyoum 291723 291723
Almustaqbal 446873 446873
Alqabas 817274 817274
Echoroukonline 139732 139732
Ryiadh 858188 858188
Sabanews 92149 92149
SaudiYoum 888068 888068
Techreen 314597 314597
Youm7 1172136 1025027

Fix #3126.

@albertvillanova albertvillanova merged commit ae181e2 into master Oct 22, 2021
@albertvillanova albertvillanova deleted the fix-3126 branch October 22, 2021 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"arabic_billion_words" dataset does not create the full dataset
1 participant