Release 1.16.0 · huggingface/datasets

Datasets Changes

New: riddle_sense by @ziyiwu9494 in #3161
New: Multi-Lingual LibriSpeech by @patrickvonplaten in #3198
New: XCSR by @yangxqiao in #3074
New: CMU Hinglish DoG by @Ishan-Kumar2 in #3149
New: Multidoc2dial by @sivasankalpp in #3205
New: IndoNLI by @afaji in #3307
Update: DaNE - updated URL for download by @MalteHB in #3203
Update: xcopa - (fix checksum issues + add translated data) by @mariosasko in #3254
Update: tatoeba - update to v2021-07-22 by @KoichiYasuoka in #3225
Update: KILT - update metadata JSON by @albertvillanova in #3276
Update: Covost 2 - update download instructions by @patrickvonplaten in #3281
Update: Common Voice, OpenSLR, LibriSpeech ASR, Vivos - make several audio datasets streamable by @lhoestq in #3290
Fix: tuple_ie - fix download url by @mariosasko in #3213
Fix: id_newspapers_2018 - fix streaming by @lhoestq in #3249
Fix: bookcorpusopen - fix RAM usage by @lhoestq in #3280
Fix: Scielo - fix ConnectionError by @mariosasko in #3260
Fix: tatoeba - fix URLs for a subset of xtreme by @mariosasko in #3321

Datasets Features

Push to hub capabilities for Dataset and DatasetDict by @LysandreJik in #3098:
- upload your dataset to the Hugging face Hub with the push_to_hub() method !
- See documentation here
200+ datasets now support streaming:
- Stream TAR-based dataset using iter_archive by @lhoestq in #3110
- Stream from Google Drive and other hosts by @lhoestq in #3248
- Support Audio feature in streaming mode by @albertvillanova in #3133
- Support Audio feature for TAR archives in sequential access by @albertvillanova in #3129
Resolve data_files by split name automatically by @lhoestq in #3221
- It takes into account the file names to know which file goes into which split
- See documentation here
Filter method for batched=True by @thomasw21 in #3244
Adding with_rank arg to pass process rank to map by @TevenLeScao in #3314

Dataset Cards

Add full tagset to conll2003 README by @BramVanroy in #3230
Fix some contact information formats by @lhoestq in #3274
Add wikipedia tags by @lhoestq in #3301
Updating details of IRC disentanglement data by @jkkummerfeld in #3259

Metrics Changes

New: OpenAI's pass@k code evaluation metric by @lvwerra in #2916
Update: BLEURT - options to use updated bleurt checkpoints by @jaehlee in #3235
Update: CER - update to support latest release by @mariosasko in #3252
Update: WER - update to the documentation by @wooters in #3278

Documentation

Add docs for to_tf_dataset by @stevhliu in #3175
Small updates to to_tf_dataset documentation by @Rocketknight1 in #3215
Update link to Datasets Tagging app in Spaces by @albertvillanova in #3194
Improve repository structure docs by @lhoestq in #3233
Swap descriptions of v1 and raw-v1 configs of WikiText dataset and fix metadata by @albertvillanova in #3241
Add docs for audio processing by @stevhliu in #3222
Add push_to_hub docs by @lhoestq in #3319

Additional improvements and bug fixes

Catch token invalid error in CI by @lhoestq in #3200
Pin keras version until TF fixes its release by @albertvillanova in #3208
Fix disable_nullable default value to False by @lhoestq in #3211
Fix code quality in riddle_sense dataset by @albertvillanova in #3218
Better error msg if len(predictions) doesn't match len(references) in metrics by @mariosasko in #3160
Use huggingface_hub.HfApi to list datasets/metrics by @mariosasko in #3121
Pin version exclusion for tensorflow incompatible with keras by @albertvillanova in #3216
Group tests in multiprocessing workers by test file by @albertvillanova in #3231
Fix load_from_disk temporary directory by @lhoestq in #3245
[tiny] fix typo in stream docs by @nollied in #3246
Avoid PyArrow type optimization if it fails by @mariosasko in #3234
Remove redundant isort module placement by @mariosasko in #3243
asserts replaced by exception for text classification task with test. by @manisnesan in #3256
Add os.listdir for streaming by @lhoestq in #3270
asserts replaced with exception for image classification task, csv, json by @manisnesan in #3262
Force data files extraction if download_mode='force_redownload' by @mariosasko in #3275
Minor Typo Fix - Precision to Recall by @SebastinSanty in #3279
Decode audio from remote by @lhoestq in #3271
Fix build_docs CI by @lhoestq in #3286
Allow datasets with indices table when concatenating along axis=1 by @mariosasko in #3288
f-string formatting by @Mehdi2402 in #3277
Unpin markdown for build_docs now that it's fixed by @lhoestq in #3289
Pin version exclusion for Markdown by @albertvillanova in #3293
Use f-strings in the dataset scripts by @Carlosbogo in #3291
fix old_val typo in f-string by @Mehdi2402 in #3302
asserts replaced with exception for fingerprint.py, search.py, arrow_writer.py and metric.py by @Ishan-Kumar2 in #3305
fix: files counted twice in inferred structure by @borisdayma in #3309
Finish transition to PyArrow 3.0.0 by @mariosasko in #3318
Removing query params for dynamic URL caching by @anton-l in #3315

Citation

Update BibTeX entry by @albertvillanova in #3223
Fix paper BibTeX citation with proceedings reference by @albertvillanova in #3226
Add CITATION file by @albertvillanova in #3228
Fix URL in CITATION file by @albertvillanova in #3229

Deprecations

Deprecate prepare_module by @albertvillanova in #3166

Full Changelog: 1.15.1...1.16.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.16.0

Datasets Changes

Datasets Features

Dataset Cards

Metrics Changes

Documentation

Additional improvements and bug fixes

Citation

Deprecations

Contributors