Releases: huggingface/datasets
Releases · huggingface/datasets
1.18.3
Bug fixes
- Fix MP3 resampling when a dataset's audio files have different sampling rates by @lhoestq in #3665
- Extend dataset builder for streaming in
get_dataset_split_names
by @mariosasko in #3657
Dataset changes
- New: Turkic X-WMT evaluation set for machine translation by @mirzakhalov in #3605
- New: British Library books dataset by @davanstrien in #3603
- Fix: wiki_bio - Update link by @jxmorris12 in #3651
Other improvements
- sp. Columbia => Colombia by @serapio in #3652
- Run pyupgrade for Python 3.6+ by @bryant1410 in #3560
New Contributors
- @serapio made their first contribution in #3652
- @mirzakhalov made their first contribution in #3605
Full Changelog: 1.18.2...1.18.3
1.18.2
Bug fixes
- Fix streaming datasets that are not reset correctly by @lhoestq in #3646
- Fix numpy rngs when shuffling with seed=None by @mariosasko in #3641
- Fix dataset slicing with negative bounds when indices mapping is not
None
by @mariosasko in #3642 - Fix
add_column
on datasets with indices mapping by @mariosasko in #3647
Other improvements
- Update index.rst by @VioletteLepercq in #3636
- Fix Windows CI: bump python to 3.7 by @lhoestq in #3648
New Contributors
- @VioletteLepercq made their first contribution in #3636
Full Changelog: 1.18.1...1.18.2
1.18.1
Improvements
- Make decoding of Audio and Image feature optional by @mariosasko in #3430
Bug fixes
- Fix
prepare_for_task()
by @mariosasko in #3614 - Fix: Multilingual Librispeech - fix bad url formatting by @polinaeterna in #3619
Full Changelog: 1.18.0...1.18.1
1.18.0
Datasets Changes
- New: VCTK
- New: CPPE-5 dataset by @mariosasko in #3517
- New: RedCaps dataset by @mariosasko in #3424
- New: WIDER FACE dataset by @mariosasko in #3413
- New: SVHN dataset by @mariosasko in #3535
- New: BNL newspapers by @davanstrien in #3397
- New: PASS dataset by @mariosasko in #3576
- New: Text2log Dataset by @apergo-ai in #3579
- Update: beans, cats_vs_dogs - Use
iter_files
instead ofstr(Path(...)
in image dataset by @mariosasko in #3477 - Update : PIB - update version and make it streamable by @albertvillanova in #3496
- Update: code_x_glue_tt_text_to_text, compguesswhat - Remove print statements in datasets by @mariosasko in #3546
- Update: MuchoCine - add missing tasks by @mariosasko in #3571
- Fix: Tashkeela - fix to yield stripped text by @albertvillanova in #3471
- Fix: asset - change to raw.githubusercontent.com URLs by @VictorSanh in #3516
- Fix: CC100 - use HTTPS for the data source URL by @aajanki in #3519
- Fix: vision datsets - Fix bug in
ImageClassifcation
task template by @mariosasko in #3557 - Fix: tweet_qa - fix
DuplicatedKeysError
and improve card by @mariosasko in #3559 - Fix: mC4 - fix multiple language downloading by @polinaeterna in #3594
- Fix: CoNLL2003:
Datasets Features
- [Time series] Add support for time, date, duration, and decimal dtypes by @mariosasko in #3591
- [Image][Audio] Add flexible casting for Image and Audio + Support nested casting by @lhoestq in #3575
- Allows DatasetDict.filter to have batching option by @thomasw21 in #3506
- Add desc parameter to filter by @mariosasko in #3513
- Add
gzip
forto_json
by @bhavitvyamalik in #3492 - Allow multiple task templates of the same type by @mariosasko in #3562
- Add parameter
preserve_index
tofrom_pandas
by @Sorrow321 in #3565 - Dataset Streaming:
- Fix
str(Path(...))
conversion in streaming on Linux by @mariosasko in #3472 - Extend support for streaming datasets that use ET.parse by @albertvillanova in #3476
- Extend support for streaming datasets that use os.walk by @albertvillanova in #3478
- Fix
Metrics Changes
- Add Mauve metric by @jthickstun in #3573
Dataset cards
- update
pretty_name
for first 200 datasets by @bhavitvyamalik in #3498 - update
pretty_name
for all the other datasets by @bhavitvyamalik in #3536 - pib: Update pib dataset card by @albertvillanova in #3501
- arabic_speech_corpus: Adding link to license. by @meg-huggingface in #3524
- Covost2: Update README.md by @meg-huggingface in #3528
- librispeech_asr: Update README.md by @meg-huggingface in #3529
- vivos: Update README.md by @meg-huggingface in #3530
- audio datasets: Audio datacard update - first pass by @meg-huggingface in #3520
- common_language: Update README.md by @meg-huggingface in #3527
- wiki_dpr: Update wiki_dpr README.md by @lhoestq in #3534
- qa4mre: Fix qa4mre tags by @lhoestq in #3574
- HellaSwag: Update HellaSwag README.md by @borgr in #3588
- ANLI: Update ANLI README.md by @borgr in #3590
- tweet_eval: Update README.md by @borgr in #3593
Documentation
- Fix rendering of docs by @albertvillanova in #3470
- Fix to_tf_dataset references in docs by @mariosasko in #3514
- added PII statements and license links to data cards by @mcmillanmajora in #3537
- Readme usage update by @meg-huggingface in #3538
- Update the CC-100 dataset card by @aajanki in #3542
- Research wording for nc licenses by @meg-huggingface in #3539
- Added links to licensing and PII message in vctk dataset by @mcmillanmajora in #3523
- Give clearer instructions to add the YAML tags by @albertvillanova in #3532
General improvements and bug fixes
- Fix overriding of filesystem info by @albertvillanova in #3481
- Update ADD_NEW_DATASET.md by @apergo-ai in #3487
- Fix weird spacing in ManualDownloadError message by @bryant1410 in #3486
- Clone full repo to detect new tags when mirroring datasets on the Hub by @lhoestq in #3494
- Remove unused phony rule from Makefile by @bryant1410 in #3483
- fix: 🐛 pass token when retrieving the split names by @severo in #3545
- Pin torchmetrics to fix the COMET test by @lhoestq in #3589
- Preserve encoding/decoding with features in
Iterable.map
call by @mariosasko in #3556
New Contributors
- @apergo-ai made their first contribution in #3487
- @bryant1410 made their first contribution in #3486
- @meg-huggingface made their first contribution in #3527
- @aajanki made their first contribution in #3519
- @Sorrow321 made their first contribution in #3565
- @jthickstun made their first contribution in #3573
- @borgr made their first contribution in #3588
Full Changelog: 1.17.0...1.18.0
1.17.0
Dataset Changes
- New: The Pile
- Add The Pile dataset and PubMed Central subset by @albertvillanova in #3287
- Add The Pile Free Law subset by @albertvillanova in #3359
- Add The Pile USPTO subset by @albertvillanova in #3360
- Add The Pile subsets by @albertvillanova in #3378
- Add The Pile Enron Emails subset by @albertvillanova in #3427
- New: British Library Books Genre by @davanstrien in #3312
- New: Americas NLI by @fdschmidt93 in #3371
- New: Speech commands by @polinaeterna in #3335
- New: eli5_category by @jingshenSN2 in #3420
- New: OneStopQa by @scaperex in #3436
- Update: LABR - make the dataset streamable by @albertvillanova in #3352
- Update: CLUE benchmark - update cluewsc2020, chid, c3 and tnews by @mariosasko in #3376
- Update: beans, cast_vs_dogs, cifar10, cifar100, fashion_mnist, mnist, head_qa: use the new Image feature type + streaming support by @mariosasko in #3362
- Update: CC100- add Georgian data by @AnzorGozalishvili in #3383
- Update: disaster_response_messages - update download urls (+ add validation split) by @mariosasko in #3426
- Update: swahili_news - update to new version by @albertvillanova in #3463
- Fix: WikiAuto, Jeopardy, definite_pronoun_resolution - fix URLs by @LashaO in #3266
- Fix: QED - fix type of bridge field by @mariosasko in #3417
- Fix: ASSET - fix dataset data URLs by @tianjianjiang in #3342
Dataset Features
- Add Image feature by @mariosasko in #3163
- to_tf_dataset() refactor by @Rocketknight1 in #3356
- More robust
None
handling by @mariosasko in #3195 - Add
cast_column
toIterableDataset
by @mariosasko in #3439 - Support streaming zipped dataset repo by passing only repo name by @albertvillanova in #3375
- Extend support for streaming datasets that use pd.read_excel by @albertvillanova in #3355
- Extend iter_archive to support file object input by @albertvillanova in #3443
- Extend text to support yielding lines, paragraphs or documents by @albertvillanova in #3442
- Push dataset_infos.json to Hub to preserve feature types by @lhoestq in #3467
Dataset cards
- Change TriviaQA license (#3313) by @avinashsai in #3330
- Add missing tags to XTREME by @mariosasko in #3322
- Remove duplicate name from dataset cards by @albertvillanova in #3354
- Fix typos in dataset cards by @albertvillanova in #3386
- Fix duplicated tag in wikicorpus dataset card by @lhoestq in #3458
Dataset Tasks
- Create Language Modeling task by @albertvillanova in #3387
Metric Changes
- BLEURT: Match key names to correspond with filename by @jaehlee in #3348
- Fix links in metrics description by @albertvillanova in #3461
- Fix METEOR missing NLTK's omw-1.4 by @lhoestq in #3469
Docs
- Add ArrayXD docs by @stevhliu in #3344
- Document a training loop for streaming dataset by @lhoestq in #3370
- Fix formatting in IterableDataset.map docs by @mariosasko in #3395
- Correctly indent builder config in dataset script docs by @mariosasko in #3432
- Update BLEURT hyperlink by @lewtun in #3437
Additional improvements and bug fixes
- Quick fix error formatting by @NouamaneTazi in #3328
- Fix error message and add extension fallback by @mariosasko in #3332
- Avoid content-encoding issue while streaming datasets by @albertvillanova in #3350
- Fix JSON ClassLabel casting for integers by @lhoestq in #3340
- Better error message when download fails by @lhoestq in #3343
- Fix dict source_datasets tagset validator by @albertvillanova in #3368
- Fix typo in other-structured-to-text task tag by @albertvillanova in #3367
- Fix temporary dataset_path creation for URIs related to remote fs by @francisco-perez-sorrosal in #3296
- Fix flaky test of the temporary directory used by load_from_disk by @lhoestq in #3388
- More robust first elem check in encode/cast example by @mariosasko in #3402
- Fix module inference for archive with a directory by @albertvillanova in #3406
- Fix dependencies conflicts in Windows CI after conda update to 4.11 by @lhoestq in #3410
- Pass new_fingerprint in multiprocessing by @lhoestq in #3409
- Fix flaky test again for s3 serialization by @lhoestq in #3412
- Skip None encoding (line deleted by accident in #3195) by @mariosasko in #3414
- Clean squad dummy data by @lhoestq in #3428
- #3337 Add typing overloads to Dataset.getitem for mypy by @Dref360 in #3382
- Make cast cacheable (again) on Windows by @mariosasko in #3429
- Use max number of data files to infer module by @albertvillanova in #3407
- Fix iter_archive generator by @albertvillanova in #3454
- [Staging] Update dataset repos automatically on the Hub by @lhoestq in #3451
- Update supported versions of Python in setup.py by @mariosasko in #3438
- raise exception instead of using assertions. by @manisnesan in #3349
New Contributors
- @avinashsai made their first contribution in #3330
- @NouamaneTazi made their first contribution in #3328
- @davanstrien made their first contribution in #3312
- @francisco-perez-sorrosal made their first contribution in #3296
- @LashaO made their first contribution in #3266
- @fdschmidt93 made their first contribution in #3371
- @polinaeterna made their first contribution in #3335
- @AnzorGozalishvili made their first contribution in #3383
- @tianjianjiang made their first contribution in #3342
- @jingshenSN2 made their first contribution in #3420
- @scaperex made their first contribution in #3436
Full Changelog: 1.16.1...1.17.0
1.16.1
1.16.0
Datasets Changes
- New: riddle_sense by @ziyiwu9494 in #3161
- New: Multi-Lingual LibriSpeech by @patrickvonplaten in #3198
- New: XCSR by @yangxqiao in #3074
- New: CMU Hinglish DoG by @Ishan-Kumar2 in #3149
- New: Multidoc2dial by @sivasankalpp in #3205
- New: IndoNLI by @afaji in #3307
- Update: DaNE - updated URL for download by @MalteHB in #3203
- Update: xcopa - (fix checksum issues + add translated data) by @mariosasko in #3254
- Update: tatoeba - update to v2021-07-22 by @KoichiYasuoka in #3225
- Update: KILT - update metadata JSON by @albertvillanova in #3276
- Update: Covost 2 - update download instructions by @patrickvonplaten in #3281
- Update: Common Voice, OpenSLR, LibriSpeech ASR, Vivos - make several audio datasets streamable by @lhoestq in #3290
- Fix: tuple_ie - fix download url by @mariosasko in #3213
- Fix: id_newspapers_2018 - fix streaming by @lhoestq in #3249
- Fix: bookcorpusopen - fix RAM usage by @lhoestq in #3280
- Fix: Scielo - fix ConnectionError by @mariosasko in #3260
- Fix: tatoeba - fix URLs for a subset of xtreme by @mariosasko in #3321
Datasets Features
- Push to hub capabilities for
Dataset
andDatasetDict
by @LysandreJik in #3098:- upload your dataset to the Hugging face Hub with the
push_to_hub()
method ! - See documentation here
- upload your dataset to the Hugging face Hub with the
- 200+ datasets now support streaming:
- Stream TAR-based dataset using iter_archive by @lhoestq in #3110
- Stream from Google Drive and other hosts by @lhoestq in #3248
- Support Audio feature in streaming mode by @albertvillanova in #3133
- Support Audio feature for TAR archives in sequential access by @albertvillanova in #3129
- Resolve data_files by split name automatically by @lhoestq in #3221
- It takes into account the file names to know which file goes into which split
- See documentation here
- Filter method for batched=True by @thomasw21 in #3244
- Adding
with_rank
arg to pass process rank tomap
by @TevenLeScao in #3314
Dataset Cards
- Add full tagset to conll2003 README by @BramVanroy in #3230
- Fix some contact information formats by @lhoestq in #3274
- Add wikipedia tags by @lhoestq in #3301
- Updating details of IRC disentanglement data by @jkkummerfeld in #3259
Metrics Changes
- New: OpenAI's pass@k code evaluation metric by @lvwerra in #2916
- Update: BLEURT - options to use updated bleurt checkpoints by @jaehlee in #3235
- Update: CER - update to support latest release by @mariosasko in #3252
- Update: WER - update to the documentation by @wooters in #3278
Documentation
- Add docs for
to_tf_dataset
by @stevhliu in #3175 - Small updates to to_tf_dataset documentation by @Rocketknight1 in #3215
- Update link to Datasets Tagging app in Spaces by @albertvillanova in #3194
- Improve repository structure docs by @lhoestq in #3233
- Swap descriptions of v1 and raw-v1 configs of WikiText dataset and fix metadata by @albertvillanova in #3241
- Add docs for audio processing by @stevhliu in #3222
- Add push_to_hub docs by @lhoestq in #3319
Additional improvements and bug fixes
- Catch token invalid error in CI by @lhoestq in #3200
- Pin keras version until TF fixes its release by @albertvillanova in #3208
- Fix disable_nullable default value to False by @lhoestq in #3211
- Fix code quality in riddle_sense dataset by @albertvillanova in #3218
- Better error msg if
len(predictions)
doesn't matchlen(references)
in metrics by @mariosasko in #3160 - Use huggingface_hub.HfApi to list datasets/metrics by @mariosasko in #3121
- Pin version exclusion for tensorflow incompatible with keras by @albertvillanova in #3216
- Group tests in multiprocessing workers by test file by @albertvillanova in #3231
- Fix load_from_disk temporary directory by @lhoestq in #3245
- [tiny] fix typo in stream docs by @nollied in #3246
- Avoid PyArrow type optimization if it fails by @mariosasko in #3234
- Remove redundant isort module placement by @mariosasko in #3243
- asserts replaced by exception for text classification task with test. by @manisnesan in #3256
- Add os.listdir for streaming by @lhoestq in #3270
- asserts replaced with exception for image classification task, csv, json by @manisnesan in #3262
- Force data files extraction if download_mode='force_redownload' by @mariosasko in #3275
- Minor Typo Fix - Precision to Recall by @SebastinSanty in #3279
- Decode audio from remote by @lhoestq in #3271
- Fix build_docs CI by @lhoestq in #3286
- Allow datasets with indices table when concatenating along axis=1 by @mariosasko in #3288
- f-string formatting by @Mehdi2402 in #3277
- Unpin markdown for build_docs now that it's fixed by @lhoestq in #3289
- Pin version exclusion for Markdown by @albertvillanova in #3293
- Use f-strings in the dataset scripts by @Carlosbogo in #3291
- fix old_val typo in f-string by @Mehdi2402 in #3302
- asserts replaced with exception for
fingerprint.py
,search.py
,arrow_writer.py
andmetric.py
by @Ishan-Kumar2 in #3305 - fix: files counted twice in inferred structure by @borisdayma in #3309
- Finish transition to PyArrow 3.0.0 by @mariosasko in #3318
- Removing query params for dynamic URL caching by @anton-l in #3315
Citation
- Update BibTeX entry by @albertvillanova in #3223
- Fix paper BibTeX citation with proceedings reference by @albertvillanova in #3226
- Add CITATION file by @albertvillanova in #3228
- Fix URL in CITATION file by @albertvillanova in #3229
Deprecations
- Deprecate prepare_module by @albertvillanova in #3166
Full Changelog: 1.15.1...1.16.0
1.15.1
1.15.0
Dataset Changes
- Update: JNLBA - add tags names by @bhavitvyamalik in #3092
- Update: OpenSLR - add SLR83 to OpenSLR by @tyrius02 in #3125 and #3176
- Update: RONEC - update to v2 by @dumitrescustefan in #3184
- Fix: Arabic Billion Words - Fix script to return all data by @albertvillanova in #3136
- Fix: HLGD - fix label mapping by @VictorSanh in #3180
Dataset Features
- Allow dynamic first dimension for ArrayXD by @rpowalski in #2891
- add multi-proc in
to_csv
by @bhavitvyamalik in #2896 - QOL improvements: auto-flatten_indices and desc in map calls by @mariosasko in #3196
Dataset Cards
Metrics Changes
- New: metric for the MATH dataset (competition_math). by @hacobe in #3020
- New: Google BLEU (aka GLEU) metric by @slowwavesleep in #3108
- New: TER by @BramVanroy in #3153
- New: ChrF(++) by @BramVanroy in #3187
General improvements and bug fixes
- Correctly update metadata to preserve features when concatenating datasets with axis=1 by @mariosasko in #3120
- Fixes to
to_tf_dataset
by @Rocketknight1 in #3085 - Add security policy to the project by @albertvillanova in #2958
- Update doc links to point to new docs by @mariosasko in #3116
- Fix caching bugs by @mariosasko in #3141
- Fix numpy deprecation warning for ragged tensors by @lhoestq in #3137
- Fixed: duplicate parameter and missing parameter in docstring by @PanQiWei in #3157
- Fix some typos in the documentation by @h4iku in #3152
- Fix string encoding for Value type by @lhoestq in #3158
- Fix CLI test to ignore verfications when saving infos by @albertvillanova in #3147
- Make inspect.get_dataset_config_names always return a non-empty list by @albertvillanova in #3159
- Fix issue with filelock filename being too long on encrypted filesystems by @mariosasko in #3173
- Asserts replaced by exceptions (#3171) by @joseporiolayats in #3174
- Preserve ordering in
zip_dict
by @mariosasko in #3170 - Don't memoize strings when hashing since two identical strings may have different python ids by @lhoestq in #3182
- Re-add faiss to windows testing suite by @BramVanroy in #3151
- Add missing docstring to DownloadConfig by @mariosasko in #3183
- More efficient nested features encoding by @eladsegal in #3124
- Fix optimized encoding for arrays by @lhoestq in #3197
1.14.0
Dataset changes
- Update: LexGLUE and MultiEURLEX README - update dataset cards #3075 (@iliaschalkidis)
- Update: SUPERB - use Audio features #3101 (@anton-l)
- Fix: Blog Authorship Corpus - fix URLs #3106 (@albertvillanova)
Dataset features
General improvements and bug fixes
- Replace FSTimeoutError with parent TimeoutError #3100 (@albertvillanova)
- Fix project description in PyPI #3103 (@albertvillanova)
- Align tqdm control with cache control #3031 (@mariosasko)
- Add paper BibTeX citation #3107 (@albertvillanova)