Fix script of Arabic Billion Words dataset to return all data #3136

albertvillanova · 2021-10-22T09:14:24Z

The script has a bug and only parses and generates a portion of the entire dataset.

This PR fixes the loading script so that is properly parses the entire dataset.

Current implementation generates the same number of examples as reported in the original paper for all configurations except for one:

For "Youm7" we generate more examples (1172136) than the ones reported by the paper (1025027)

	Number of examples	Number of examples according to the source
Alittihad	349342	349342
Almasryalyoum	291723	291723
Almustaqbal	446873	446873
Alqabas	817274	817274
Echoroukonline	139732	139732
Ryiadh	858188	858188
Sabanews	92149	92149
SaudiYoum	888068	888068
Techreen	314597	314597
Youm7	1172136	1025027

albertvillanova added 5 commits October 22, 2021 08:33

Fix arabic_billion_words to extract all examples

2046842

Fix pattern for misspelled tags and return all articles

27979d6

Update metadata JSON

8e884ee

Update dataset card

89aadc6

Update dataset card

c875735

albertvillanova merged commit ae181e2 into master Oct 22, 2021

albertvillanova deleted the fix-3126 branch October 22, 2021 13:28

Provide feedback