Skip to content

Commit

Permalink
Add language_bcp47 tag (#4753)
Browse files Browse the repository at this point in the history
* fix ms_terms

* fix ted_talks_iwslt

* move tags to language_bcp47

* fix bad codes

* add language_details and language_bcp47 in DatasetMetadata

* more fixes

* fix wiki_dpr
  • Loading branch information
lhoestq authored Jul 27, 2022
1 parent f5847a3 commit aa48a29
Show file tree
Hide file tree
Showing 60 changed files with 641 additions and 575 deletions.
2 changes: 2 additions & 0 deletions datasets/ai2_arc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ annotations_creators:
language_creators:
- found
language:
- en
language_bcp47:
- en-US
license:
- cc-by-sa-4.0
Expand Down
2 changes: 2 additions & 0 deletions datasets/arcd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ annotations_creators:
language_creators:
- crowdsourced
language:
- ar
language_bcp47:
- ar-SA
license:
- mit
Expand Down
2 changes: 1 addition & 1 deletion datasets/bbaw_egyptian/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ language_creators:
- found
language:
- de
- en
- egy
- en
license:
- cc-by-4.0
multilinguality:
Expand Down
4 changes: 2 additions & 2 deletions datasets/blbooks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@ annotations_creators:
language_creators:
- machine-generated
language:
- en
- fr
- de
- en
- es
- fr
- it
- nl
license:
Expand Down
2 changes: 1 addition & 1 deletion datasets/blbooksgenre/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ language_creators:
- crowdsourced
- expert-generated
language:
- en
- de
- en
- fr
- nl
license:
Expand Down
1 change: 0 additions & 1 deletion datasets/casino/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ task_ids:
- dialogue-modeling
pretty_name: Campsite Negotiation Dialogues
paperswithcode_id: casino

---


Expand Down
18 changes: 10 additions & 8 deletions datasets/cc100/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ language:
- be
- bg
- bn
- bn-Latn
- br
- bs
- ca
Expand All @@ -39,7 +38,6 @@ language:
- ha
- he
- hi
- hi-Latn
- hr
- ht
- hu
Expand Down Expand Up @@ -71,7 +69,6 @@ language:
- mr
- ms
- my
- my-x-zawgyi
- ne
- nl
- 'no'
Expand All @@ -87,9 +84,9 @@ language:
- ro
- ru
- sa
- si
- sc
- sd
- si
- sk
- sl
- so
Expand All @@ -100,26 +97,31 @@ language:
- sv
- sw
- ta
- ta-Latn
- te
- te-Latn
- th
- tl
- tn
- tr
- ug
- uk
- ur
- ur-Latn
- uz
- vi
- wo
- xh
- yi
- yo
- zh
- zu
language_bcp47:
- bn-Latn
- hi-Latn
- my-x-zawgyi
- ta-Latn
- te-Latn
- ur-Latn
- zh-Hans
- zh-Hant
- zu
license:
- unknown
multilinguality:
Expand Down
12 changes: 5 additions & 7 deletions datasets/ccaligned_multilingual/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@ language:
- br
- bs
- ca
- ceb
- ckb
- cs
- ceb
- cy
- de
- dv
Expand Down Expand Up @@ -48,6 +48,7 @@ language:
- iu
- ja
- ka
- kac
- kg
- kk
- km
Expand All @@ -71,7 +72,6 @@ language:
- ms
- mt
- my
- my
- ne
- nl
- 'no'
Expand All @@ -83,15 +83,14 @@ language:
- pl
- ps
- pt
- shn
- kac
- rm
- ro
- ru
- rw
- sc
- sd
- se
- shn
- si
- sk
- sl
Expand All @@ -116,19 +115,18 @@ language:
- tr
- ts
- tt
- zgh
- ug
- uk
- ur
- uz
- ve
- vi
- wo
- war
- wo
- xh
- yi
- yo
- zh
- zgh
- zh
- zu
- zza
Expand Down
2 changes: 1 addition & 1 deletion datasets/code_x_glue_tc_text_to_code/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ annotations_creators:
language_creators:
- found
language:
- en
- code
- en
license:
- c-uda
multilinguality:
Expand Down
4 changes: 2 additions & 2 deletions datasets/code_x_glue_tt_text_to_text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ language_creators:
- found
language:
- da
- nb
- en
- lv
- nb
- zh
- en
license:
- c-uda
multilinguality:
Expand Down
11 changes: 8 additions & 3 deletions datasets/common_language/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ language:
- eu
- fa
- fr
- fy-NL
- fy
- ia
- id
- it
Expand All @@ -36,17 +36,22 @@ language:
- nl
- pl
- pt
- rm-sursilv
- rm
- ro
- ru
- rw
- sah
- sl
- sv-SE
- sv
- ta
- tr
- tt
- uk
- zh
language_bcp47:
- fy-NL
- rm-sursilv
- sv-SE
- zh-CN
- zh-HK
- zh-TW
Expand Down
19 changes: 13 additions & 6 deletions datasets/common_voice/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ language:
- fa
- fi
- fr
- fy-NL
- ga-IE
- fy
- ga
- hi
- hsb
- hu
Expand All @@ -44,24 +44,31 @@ language:
- mt
- nl
- or
- pa-IN
- pa
- pl
- pt
- rm-sursilv
- rm-vallader
- rm
- ro
- ru
- rw
- sah
- sl
- sv-SE
- sv
- ta
- th
- tr
- tt
- uk
- vi
- vot
- zh
language_bcp47:
- fy-NL
- ga-IE
- pa-IN
- rm-sursilv
- rm-vallader
- sv-SE
- zh-CN
- zh-HK
- zh-TW
Expand Down
2 changes: 2 additions & 0 deletions datasets/conv_questions/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ annotations_creators:
language_creators:
- crowdsourced
language:
- en
language_bcp47:
- en-US
license:
- cc-by-4.0
Expand Down
31 changes: 17 additions & 14 deletions datasets/covost2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,27 +5,30 @@ language_creators:
- crowdsourced
- expert-generated
language:
- fr
- ar
- ca
- cy
- de
- es
- ca
- it
- ru
- zh-CN
- pt
- fa
- et
- fa
- fr
- id
- it
- ja
- lv
- mn
- nl
- tr
- ar
- sv-SE
- lv
- pt
- ru
- sl
- sv
- ta
- ja
- id
- cy
- tr
- zh
language_bcp47:
- sv-SE
- zh-CN
license:
- cc-by-nc-4.0
multilinguality:
Expand Down
2 changes: 2 additions & 0 deletions datasets/hendrycks_test/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ annotations_creators:
language_creators:
- expert-generated
language:
- en
language_bcp47:
- en-US
license:
- mit
Expand Down
4 changes: 2 additions & 2 deletions datasets/ilist/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ multilinguality:
task_ids:
- text-classification-other-language-identification
language:
- hi
- awa
- bho
- mag
- bra
- hi
- mag
annotations_creators:
- unknown
source_datasets:
Expand Down
3 changes: 3 additions & 0 deletions datasets/kan_hope/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ annotations_creators:
language_creators:
- crowdsourced
language:
- en
- kn
language_bcp47:
- en-IN
- kn-IN
license:
Expand Down
Loading

1 comment on commit aa48a29

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008119 / 0.011353 (-0.003234) 0.004066 / 0.011008 (-0.006943) 0.030542 / 0.038508 (-0.007966) 0.035470 / 0.023109 (0.012361) 0.297364 / 0.275898 (0.021466) 0.326582 / 0.323480 (0.003102) 0.006168 / 0.007986 (-0.001818) 0.003696 / 0.004328 (-0.000632) 0.007172 / 0.004250 (0.002922) 0.043915 / 0.037052 (0.006863) 0.321910 / 0.258489 (0.063421) 0.389756 / 0.293841 (0.095915) 0.031108 / 0.128546 (-0.097438) 0.009670 / 0.075646 (-0.065976) 0.265403 / 0.419271 (-0.153869) 0.052153 / 0.043533 (0.008620) 0.297171 / 0.255139 (0.042032) 0.320095 / 0.283200 (0.036895) 0.109113 / 0.141683 (-0.032570) 1.435210 / 1.452155 (-0.016944) 1.483771 / 1.492716 (-0.008946)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.254683 / 0.018006 (0.236677) 0.554527 / 0.000490 (0.554037) 0.007244 / 0.000200 (0.007044) 0.000334 / 0.000054 (0.000279)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024675 / 0.037411 (-0.012736) 0.103073 / 0.014526 (0.088548) 0.117164 / 0.176557 (-0.059392) 0.159845 / 0.737135 (-0.577290) 0.120357 / 0.296338 (-0.175982)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.393527 / 0.215209 (0.178318) 3.935866 / 2.077655 (1.858211) 1.848285 / 1.504120 (0.344165) 1.653657 / 1.541195 (0.112462) 1.710331 / 1.468490 (0.241841) 0.420358 / 4.584777 (-4.164419) 3.701377 / 3.745712 (-0.044335) 1.981285 / 5.269862 (-3.288577) 1.188367 / 4.565676 (-3.377309) 0.051038 / 0.424275 (-0.373237) 0.011118 / 0.007607 (0.003511) 0.501591 / 0.226044 (0.275547) 4.996949 / 2.268929 (2.728020) 2.220168 / 55.444624 (-53.224456) 1.858290 / 6.876477 (-5.018187) 2.059695 / 2.142072 (-0.082378) 0.532262 / 4.805227 (-4.272966) 0.116526 / 6.500664 (-6.384138) 0.059737 / 0.075469 (-0.015732)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.438312 / 1.841788 (-0.403476) 13.815117 / 8.074308 (5.740809) 25.064409 / 10.191392 (14.873017) 0.863155 / 0.680424 (0.182732) 0.530299 / 0.534201 (-0.003902) 0.386543 / 0.579283 (-0.192740) 0.429548 / 0.434364 (-0.004816) 0.272202 / 0.540337 (-0.268135) 0.269444 / 1.386936 (-1.117492)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005825 / 0.011353 (-0.005527) 0.004016 / 0.011008 (-0.006993) 0.029195 / 0.038508 (-0.009313) 0.033049 / 0.023109 (0.009940) 0.310527 / 0.275898 (0.034629) 0.363334 / 0.323480 (0.039854) 0.004023 / 0.007986 (-0.003962) 0.004899 / 0.004328 (0.000571) 0.005008 / 0.004250 (0.000758) 0.046072 / 0.037052 (0.009020) 0.310426 / 0.258489 (0.051937) 0.345059 / 0.293841 (0.051218) 0.029872 / 0.128546 (-0.098675) 0.009917 / 0.075646 (-0.065729) 0.262600 / 0.419271 (-0.156672) 0.053979 / 0.043533 (0.010446) 0.305929 / 0.255139 (0.050790) 0.316926 / 0.283200 (0.033727) 0.107305 / 0.141683 (-0.034378) 1.484026 / 1.452155 (0.031871) 1.549523 / 1.492716 (0.056806)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.246984 / 0.018006 (0.228978) 0.551244 / 0.000490 (0.550754) 0.000418 / 0.000200 (0.000218) 0.000057 / 0.000054 (0.000003)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.025853 / 0.037411 (-0.011558) 0.103791 / 0.014526 (0.089265) 0.116076 / 0.176557 (-0.060481) 0.162653 / 0.737135 (-0.574482) 0.122584 / 0.296338 (-0.173754)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.404309 / 0.215209 (0.189100) 4.041413 / 2.077655 (1.963759) 1.849328 / 1.504120 (0.345208) 1.661874 / 1.541195 (0.120679) 1.705473 / 1.468490 (0.236983) 0.429610 / 4.584777 (-4.155167) 3.771480 / 3.745712 (0.025768) 3.284011 / 5.269862 (-1.985850) 1.833408 / 4.565676 (-2.732268) 0.051444 / 0.424275 (-0.372831) 0.010918 / 0.007607 (0.003311) 0.502993 / 0.226044 (0.276948) 5.103145 / 2.268929 (2.834216) 2.372679 / 55.444624 (-53.071945) 2.044437 / 6.876477 (-4.832040) 2.147277 / 2.142072 (0.005205) 0.537999 / 4.805227 (-4.267228) 0.123958 / 6.500664 (-6.376706) 0.064969 / 0.075469 (-0.010500)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.461062 / 1.841788 (-0.380726) 13.768847 / 8.074308 (5.694539) 25.333782 / 10.191392 (15.142390) 0.860687 / 0.680424 (0.180263) 0.534203 / 0.534201 (0.000002) 0.386703 / 0.579283 (-0.192580) 0.433694 / 0.434364 (-0.000670) 0.276734 / 0.540337 (-0.263604) 0.286846 / 1.386936 (-1.100090)

CML watermark

Please sign in to comment.