1.16.0
Datasets Changes
- New: riddle_sense by @ziyiwu9494 in #3161
- New: Multi-Lingual LibriSpeech by @patrickvonplaten in #3198
- New: XCSR by @yangxqiao in #3074
- New: CMU Hinglish DoG by @Ishan-Kumar2 in #3149
- New: Multidoc2dial by @sivasankalpp in #3205
- New: IndoNLI by @afaji in #3307
- Update: DaNE - updated URL for download by @MalteHB in #3203
- Update: xcopa - (fix checksum issues + add translated data) by @mariosasko in #3254
- Update: tatoeba - update to v2021-07-22 by @KoichiYasuoka in #3225
- Update: KILT - update metadata JSON by @albertvillanova in #3276
- Update: Covost 2 - update download instructions by @patrickvonplaten in #3281
- Update: Common Voice, OpenSLR, LibriSpeech ASR, Vivos - make several audio datasets streamable by @lhoestq in #3290
- Fix: tuple_ie - fix download url by @mariosasko in #3213
- Fix: id_newspapers_2018 - fix streaming by @lhoestq in #3249
- Fix: bookcorpusopen - fix RAM usage by @lhoestq in #3280
- Fix: Scielo - fix ConnectionError by @mariosasko in #3260
- Fix: tatoeba - fix URLs for a subset of xtreme by @mariosasko in #3321
Datasets Features
- Push to hub capabilities for
Dataset
andDatasetDict
by @LysandreJik in #3098:- upload your dataset to the Hugging face Hub with the
push_to_hub()
method ! - See documentation here
- upload your dataset to the Hugging face Hub with the
- 200+ datasets now support streaming:
- Stream TAR-based dataset using iter_archive by @lhoestq in #3110
- Stream from Google Drive and other hosts by @lhoestq in #3248
- Support Audio feature in streaming mode by @albertvillanova in #3133
- Support Audio feature for TAR archives in sequential access by @albertvillanova in #3129
- Resolve data_files by split name automatically by @lhoestq in #3221
- It takes into account the file names to know which file goes into which split
- See documentation here
- Filter method for batched=True by @thomasw21 in #3244
- Adding
with_rank
arg to pass process rank tomap
by @TevenLeScao in #3314
Dataset Cards
- Add full tagset to conll2003 README by @BramVanroy in #3230
- Fix some contact information formats by @lhoestq in #3274
- Add wikipedia tags by @lhoestq in #3301
- Updating details of IRC disentanglement data by @jkkummerfeld in #3259
Metrics Changes
- New: OpenAI's pass@k code evaluation metric by @lvwerra in #2916
- Update: BLEURT - options to use updated bleurt checkpoints by @jaehlee in #3235
- Update: CER - update to support latest release by @mariosasko in #3252
- Update: WER - update to the documentation by @wooters in #3278
Documentation
- Add docs for
to_tf_dataset
by @stevhliu in #3175 - Small updates to to_tf_dataset documentation by @Rocketknight1 in #3215
- Update link to Datasets Tagging app in Spaces by @albertvillanova in #3194
- Improve repository structure docs by @lhoestq in #3233
- Swap descriptions of v1 and raw-v1 configs of WikiText dataset and fix metadata by @albertvillanova in #3241
- Add docs for audio processing by @stevhliu in #3222
- Add push_to_hub docs by @lhoestq in #3319
Additional improvements and bug fixes
- Catch token invalid error in CI by @lhoestq in #3200
- Pin keras version until TF fixes its release by @albertvillanova in #3208
- Fix disable_nullable default value to False by @lhoestq in #3211
- Fix code quality in riddle_sense dataset by @albertvillanova in #3218
- Better error msg if
len(predictions)
doesn't matchlen(references)
in metrics by @mariosasko in #3160 - Use huggingface_hub.HfApi to list datasets/metrics by @mariosasko in #3121
- Pin version exclusion for tensorflow incompatible with keras by @albertvillanova in #3216
- Group tests in multiprocessing workers by test file by @albertvillanova in #3231
- Fix load_from_disk temporary directory by @lhoestq in #3245
- [tiny] fix typo in stream docs by @nollied in #3246
- Avoid PyArrow type optimization if it fails by @mariosasko in #3234
- Remove redundant isort module placement by @mariosasko in #3243
- asserts replaced by exception for text classification task with test. by @manisnesan in #3256
- Add os.listdir for streaming by @lhoestq in #3270
- asserts replaced with exception for image classification task, csv, json by @manisnesan in #3262
- Force data files extraction if download_mode='force_redownload' by @mariosasko in #3275
- Minor Typo Fix - Precision to Recall by @SebastinSanty in #3279
- Decode audio from remote by @lhoestq in #3271
- Fix build_docs CI by @lhoestq in #3286
- Allow datasets with indices table when concatenating along axis=1 by @mariosasko in #3288
- f-string formatting by @Mehdi2402 in #3277
- Unpin markdown for build_docs now that it's fixed by @lhoestq in #3289
- Pin version exclusion for Markdown by @albertvillanova in #3293
- Use f-strings in the dataset scripts by @Carlosbogo in #3291
- fix old_val typo in f-string by @Mehdi2402 in #3302
- asserts replaced with exception for
fingerprint.py
,search.py
,arrow_writer.py
andmetric.py
by @Ishan-Kumar2 in #3305 - fix: files counted twice in inferred structure by @borisdayma in #3309
- Finish transition to PyArrow 3.0.0 by @mariosasko in #3318
- Removing query params for dynamic URL caching by @anton-l in #3315
Citation
- Update BibTeX entry by @albertvillanova in #3223
- Fix paper BibTeX citation with proceedings reference by @albertvillanova in #3226
- Add CITATION file by @albertvillanova in #3228
- Fix URL in CITATION file by @albertvillanova in #3229
Deprecations
- Deprecate prepare_module by @albertvillanova in #3166
Full Changelog: 1.15.1...1.16.0