Releases · huggingface/datasets

02 Feb 14:21

lhoestq

1.18.3

c6bc52a

1.18.3

Bug fixes

Fix MP3 resampling when a dataset's audio files have different sampling rates by @lhoestq in #3665
Extend dataset builder for streaming in get_dataset_split_names by @mariosasko in #3657

Dataset changes

New: Turkic X-WMT evaluation set for machine translation by @mirzakhalov in #3605
New: British Library books dataset by @davanstrien in #3603
Fix: wiki_bio - Update link by @jxmorris12 in #3651

Other improvements

sp. Columbia => Colombia by @serapio in #3652
Run pyupgrade for Python 3.6+ by @bryant1410 in #3560

New Contributors

@serapio made their first contribution in #3652
@mirzakhalov made their first contribution in #3605

Full Changelog: 1.18.2...1.18.3

Contributors

serapio, bryant1410, and 5 other contributors

Assets 2

28 Jan 16:55

lhoestq

1.18.2

ba00b25

1.18.2

Bug fixes

Fix streaming datasets that are not reset correctly by @lhoestq in #3646
Fix numpy rngs when shuffling with seed=None by @mariosasko in #3641
Fix dataset slicing with negative bounds when indices mapping is not None by @mariosasko in #3642
Fix add_column on datasets with indices mapping by @mariosasko in #3647

Other improvements

Update index.rst by @VioletteLepercq in #3636
Fix Windows CI: bump python to 3.7 by @lhoestq in #3648

New Contributors

@VioletteLepercq made their first contribution in #3636

Full Changelog: 1.18.1...1.18.2

Contributors

lhoestq, mariosasko, and VioletteLepercq

Assets 2

26 Jan 14:23

lhoestq

1.18.1

218e496

1.18.1

Improvements

Make decoding of Audio and Image feature optional by @mariosasko in #3430

Bug fixes

Fix prepare_for_task() by @mariosasko in #3614
Fix: Multilingual Librispeech - fix bad url formatting by @polinaeterna in #3619

Full Changelog: 1.18.0...1.18.1

Contributors

polinaeterna and mariosasko

Assets 2

21 Jan 16:46

lhoestq

1.18.0

c0aea8d

1.18.0

Datasets Changes

New: VCTK
- Add VCTK dataset by @jaketae in #3351
- Fix VCTK encoding by @lhoestq in #3493
- Docs: Add VCTK dataset description by @jaketae in #3500
New: CPPE-5 dataset by @mariosasko in #3517
New: RedCaps dataset by @mariosasko in #3424
New: WIDER FACE dataset by @mariosasko in #3413
New: SVHN dataset by @mariosasko in #3535
New: BNL newspapers by @davanstrien in #3397
New: PASS dataset by @mariosasko in #3576
New: Text2log Dataset by @apergo-ai in #3579
Update: beans, cats_vs_dogs - Use iter_files instead of str(Path(...) in image dataset by @mariosasko in #3477
Update : PIB - update version and make it streamable by @albertvillanova in #3496
Update: code_x_glue_tt_text_to_text, compguesswhat - Remove print statements in datasets by @mariosasko in #3546
Update: MuchoCine - add missing tasks by @mariosasko in #3571
Fix: Tashkeela - fix to yield stripped text by @albertvillanova in #3471
Fix: asset - change to raw.githubusercontent.com URLs by @VictorSanh in #3516
Fix: CC100 - use HTTPS for the data source URL by @aajanki in #3519
Fix: vision datsets - Fix bug in ImageClassifcation task template by @mariosasko in #3557
Fix: tweet_qa - fix DuplicatedKeysError and improve card by @mariosasko in #3559
Fix: mC4 - fix multiple language downloading by @polinaeterna in #3594
Fix: CoNLL2003:
- Use old url for conll2003 by @lhoestq in #3600
- Update url for conll2003 by @lhoestq in #3602
- Add conll2003 licensing by @lhoestq in #3601

Datasets Features

[Time series] Add support for time, date, duration, and decimal dtypes by @mariosasko in #3591
[Image][Audio] Add flexible casting for Image and Audio + Support nested casting by @lhoestq in #3575
Allows DatasetDict.filter to have batching option by @thomasw21 in #3506
Add desc parameter to filter by @mariosasko in #3513
Add gzip for to_json by @bhavitvyamalik in #3492
Allow multiple task templates of the same type by @mariosasko in #3562
Add parameter preserve_index to from_pandas by @Sorrow321 in #3565
Dataset Streaming:
- Fix str(Path(...)) conversion in streaming on Linux by @mariosasko in #3472
- Extend support for streaming datasets that use ET.parse by @albertvillanova in #3476
- Extend support for streaming datasets that use os.walk by @albertvillanova in #3478

Metrics Changes

Add Mauve metric by @jthickstun in #3573

Dataset cards

update pretty_name for first 200 datasets by @bhavitvyamalik in #3498
update pretty_name for all the other datasets by @bhavitvyamalik in #3536
pib: Update pib dataset card by @albertvillanova in #3501
arabic_speech_corpus: Adding link to license. by @meg-huggingface in #3524
Covost2: Update README.md by @meg-huggingface in #3528
librispeech_asr: Update README.md by @meg-huggingface in #3529
vivos: Update README.md by @meg-huggingface in #3530
audio datasets: Audio datacard update - first pass by @meg-huggingface in #3520
common_language: Update README.md by @meg-huggingface in #3527
wiki_dpr: Update wiki_dpr README.md by @lhoestq in #3534
qa4mre: Fix qa4mre tags by @lhoestq in #3574
HellaSwag: Update HellaSwag README.md by @borgr in #3588
ANLI: Update ANLI README.md by @borgr in #3590
tweet_eval: Update README.md by @borgr in #3593

Documentation

Fix rendering of docs by @albertvillanova in #3470
Fix to_tf_dataset references in docs by @mariosasko in #3514
added PII statements and license links to data cards by @mcmillanmajora in #3537
Readme usage update by @meg-huggingface in #3538
Update the CC-100 dataset card by @aajanki in #3542
Research wording for nc licenses by @meg-huggingface in #3539
Added links to licensing and PII message in vctk dataset by @mcmillanmajora in #3523
Give clearer instructions to add the YAML tags by @albertvillanova in #3532

General improvements and bug fixes

Fix overriding of filesystem info by @albertvillanova in #3481
Update ADD_NEW_DATASET.md by @apergo-ai in #3487
Fix weird spacing in ManualDownloadError message by @bryant1410 in #3486
Clone full repo to detect new tags when mirroring datasets on the Hub by @lhoestq in #3494
Remove unused phony rule from Makefile by @bryant1410 in #3483
fix: 🐛 pass token when retrieving the split names by @severo in #3545
Pin torchmetrics to fix the COMET test by @lhoestq in #3589
Preserve encoding/decoding with features in Iterable.map call by @mariosasko in #3556

New Contributors

@apergo-ai made their first contribution in #3487
@bryant1410 made their first contribution in #3486
@meg-huggingface made their first contribution in #3527
@aajanki made their first contribution in #3519
@Sorrow321 made their first contribution in #3565
@jthickstun made their first contribution in #3573
@borgr made their first contribution in #3588

Full Changelog: 1.17.0...1.18.0

Contributors

aajanki, severo, and 16 other contributors

Assets 2

21 Dec 17:41

lhoestq

1.17.0

dff6c92

1.17.0

Dataset Changes

New: The Pile
- Add The Pile dataset and PubMed Central subset by @albertvillanova in #3287
- Add The Pile Free Law subset by @albertvillanova in #3359
- Add The Pile USPTO subset by @albertvillanova in #3360
- Add The Pile subsets by @albertvillanova in #3378
- Add The Pile Enron Emails subset by @albertvillanova in #3427
New: British Library Books Genre by @davanstrien in #3312
New: Americas NLI by @fdschmidt93 in #3371
New: Speech commands by @polinaeterna in #3335
New: eli5_category by @jingshenSN2 in #3420
New: OneStopQa by @scaperex in #3436
Update: LABR - make the dataset streamable by @albertvillanova in #3352
Update: CLUE benchmark - update cluewsc2020, chid, c3 and tnews by @mariosasko in #3376
Update: beans, cast_vs_dogs, cifar10, cifar100, fashion_mnist, mnist, head_qa: use the new Image feature type + streaming support by @mariosasko in #3362
Update: CC100- add Georgian data by @AnzorGozalishvili in #3383
Update: disaster_response_messages - update download urls (+ add validation split) by @mariosasko in #3426
Update: swahili_news - update to new version by @albertvillanova in #3463
Fix: WikiAuto, Jeopardy, definite_pronoun_resolution - fix URLs by @LashaO in #3266
Fix: QED - fix type of bridge field by @mariosasko in #3417
Fix: ASSET - fix dataset data URLs by @tianjianjiang in #3342

Dataset Features

Add Image feature by @mariosasko in #3163
to_tf_dataset() refactor by @Rocketknight1 in #3356
More robust None handling by @mariosasko in #3195
Add cast_column to IterableDataset by @mariosasko in #3439
Support streaming zipped dataset repo by passing only repo name by @albertvillanova in #3375
Extend support for streaming datasets that use pd.read_excel by @albertvillanova in #3355
Extend iter_archive to support file object input by @albertvillanova in #3443
Extend text to support yielding lines, paragraphs or documents by @albertvillanova in #3442
Push dataset_infos.json to Hub to preserve feature types by @lhoestq in #3467

Dataset cards

Change TriviaQA license (#3313) by @avinashsai in #3330
Add missing tags to XTREME by @mariosasko in #3322
Remove duplicate name from dataset cards by @albertvillanova in #3354
Fix typos in dataset cards by @albertvillanova in #3386
Fix duplicated tag in wikicorpus dataset card by @lhoestq in #3458

Dataset Tasks

Create Language Modeling task by @albertvillanova in #3387

Metric Changes

BLEURT: Match key names to correspond with filename by @jaehlee in #3348
Fix links in metrics description by @albertvillanova in #3461
Fix METEOR missing NLTK's omw-1.4 by @lhoestq in #3469

Docs

Add ArrayXD docs by @stevhliu in #3344
Document a training loop for streaming dataset by @lhoestq in #3370
Fix formatting in IterableDataset.map docs by @mariosasko in #3395
Correctly indent builder config in dataset script docs by @mariosasko in #3432
Update BLEURT hyperlink by @lewtun in #3437

Additional improvements and bug fixes

Quick fix error formatting by @NouamaneTazi in #3328
Fix error message and add extension fallback by @mariosasko in #3332
Avoid content-encoding issue while streaming datasets by @albertvillanova in #3350
Fix JSON ClassLabel casting for integers by @lhoestq in #3340
Better error message when download fails by @lhoestq in #3343
Fix dict source_datasets tagset validator by @albertvillanova in #3368
Fix typo in other-structured-to-text task tag by @albertvillanova in #3367
Fix temporary dataset_path creation for URIs related to remote fs by @francisco-perez-sorrosal in #3296
Fix flaky test of the temporary directory used by load_from_disk by @lhoestq in #3388
More robust first elem check in encode/cast example by @mariosasko in #3402
Fix module inference for archive with a directory by @albertvillanova in #3406
Fix dependencies conflicts in Windows CI after conda update to 4.11 by @lhoestq in #3410
Pass new_fingerprint in multiprocessing by @lhoestq in #3409
Fix flaky test again for s3 serialization by @lhoestq in #3412
Skip None encoding (line deleted by accident in #3195) by @mariosasko in #3414
Clean squad dummy data by @lhoestq in #3428
#3337 Add typing overloads to Dataset.getitem for mypy by @Dref360 in #3382
Make cast cacheable (again) on Windows by @mariosasko in #3429
Use max number of data files to infer module by @albertvillanova in #3407
Fix iter_archive generator by @albertvillanova in #3454
[Staging] Update dataset repos automatically on the Hub by @lhoestq in #3451
Update supported versions of Python in setup.py by @mariosasko in #3438
raise exception instead of using assertions. by @manisnesan in #3349

New Contributors

@avinashsai made their first contribution in #3330
@NouamaneTazi made their first contribution in #3328
@davanstrien made their first contribution in #3312
@francisco-perez-sorrosal made their first contribution in #3296
@LashaO made their first contribution in #3266
@fdschmidt93 made their first contribution in #3371
@polinaeterna made their first contribution in #3335
@AnzorGozalishvili made their first contribution in #3383
@tianjianjiang made their first contribution in #3342
@jingshenSN2 made their first contribution in #3420
@scaperex made their first contribution in #3436

Full Changelog: 1.16.1...1.17.0

Contributors

manisnesan, francisco-perez-sorrosal, and 18 other contributors

Assets 2

26 Nov 16:58

lhoestq

1.16.1

acca8f4

1.16.1

Bug fixes

Fix import datasets on python 3.10 by @lhoestq in #3326
Fix wrongly converted assert by @eliasws in #3323

Contributors

eliasws and lhoestq

Assets 2

26 Nov 14:22

lhoestq

1.16.0

d50f5f9

1.16.0

Datasets Changes

New: riddle_sense by @ziyiwu9494 in #3161
New: Multi-Lingual LibriSpeech by @patrickvonplaten in #3198
New: XCSR by @yangxqiao in #3074
New: CMU Hinglish DoG by @Ishan-Kumar2 in #3149
New: Multidoc2dial by @sivasankalpp in #3205
New: IndoNLI by @afaji in #3307
Update: DaNE - updated URL for download by @MalteHB in #3203
Update: xcopa - (fix checksum issues + add translated data) by @mariosasko in #3254
Update: tatoeba - update to v2021-07-22 by @KoichiYasuoka in #3225
Update: KILT - update metadata JSON by @albertvillanova in #3276
Update: Covost 2 - update download instructions by @patrickvonplaten in #3281
Update: Common Voice, OpenSLR, LibriSpeech ASR, Vivos - make several audio datasets streamable by @lhoestq in #3290
Fix: tuple_ie - fix download url by @mariosasko in #3213
Fix: id_newspapers_2018 - fix streaming by @lhoestq in #3249
Fix: bookcorpusopen - fix RAM usage by @lhoestq in #3280
Fix: Scielo - fix ConnectionError by @mariosasko in #3260
Fix: tatoeba - fix URLs for a subset of xtreme by @mariosasko in #3321

Datasets Features

Push to hub capabilities for Dataset and DatasetDict by @LysandreJik in #3098:
- upload your dataset to the Hugging face Hub with the push_to_hub() method !
- See documentation here
200+ datasets now support streaming:
- Stream TAR-based dataset using iter_archive by @lhoestq in #3110
- Stream from Google Drive and other hosts by @lhoestq in #3248
- Support Audio feature in streaming mode by @albertvillanova in #3133
- Support Audio feature for TAR archives in sequential access by @albertvillanova in #3129
Resolve data_files by split name automatically by @lhoestq in #3221
- It takes into account the file names to know which file goes into which split
- See documentation here
Filter method for batched=True by @thomasw21 in #3244
Adding with_rank arg to pass process rank to map by @TevenLeScao in #3314

Dataset Cards

Add full tagset to conll2003 README by @BramVanroy in #3230
Fix some contact information formats by @lhoestq in #3274
Add wikipedia tags by @lhoestq in #3301
Updating details of IRC disentanglement data by @jkkummerfeld in #3259

Metrics Changes

New: OpenAI's pass@k code evaluation metric by @lvwerra in #2916
Update: BLEURT - options to use updated bleurt checkpoints by @jaehlee in #3235
Update: CER - update to support latest release by @mariosasko in #3252
Update: WER - update to the documentation by @wooters in #3278

Documentation

Add docs for to_tf_dataset by @stevhliu in #3175
Small updates to to_tf_dataset documentation by @Rocketknight1 in #3215
Update link to Datasets Tagging app in Spaces by @albertvillanova in #3194
Improve repository structure docs by @lhoestq in #3233
Swap descriptions of v1 and raw-v1 configs of WikiText dataset and fix metadata by @albertvillanova in #3241
Add docs for audio processing by @stevhliu in #3222
Add push_to_hub docs by @lhoestq in #3319

Additional improvements and bug fixes

Catch token invalid error in CI by @lhoestq in #3200
Pin keras version until TF fixes its release by @albertvillanova in #3208
Fix disable_nullable default value to False by @lhoestq in #3211
Fix code quality in riddle_sense dataset by @albertvillanova in #3218
Better error msg if len(predictions) doesn't match len(references) in metrics by @mariosasko in #3160
Use huggingface_hub.HfApi to list datasets/metrics by @mariosasko in #3121
Pin version exclusion for tensorflow incompatible with keras by @albertvillanova in #3216
Group tests in multiprocessing workers by test file by @albertvillanova in #3231
Fix load_from_disk temporary directory by @lhoestq in #3245
[tiny] fix typo in stream docs by @nollied in #3246
Avoid PyArrow type optimization if it fails by @mariosasko in #3234
Remove redundant isort module placement by @mariosasko in #3243
asserts replaced by exception for text classification task with test. by @manisnesan in #3256
Add os.listdir for streaming by @lhoestq in #3270
asserts replaced with exception for image classification task, csv, json by @manisnesan in #3262
Force data files extraction if download_mode='force_redownload' by @mariosasko in #3275
Minor Typo Fix - Precision to Recall by @SebastinSanty in #3279
Decode audio from remote by @lhoestq in #3271
Fix build_docs CI by @lhoestq in #3286
Allow datasets with indices table when concatenating along axis=1 by @mariosasko in #3288
f-string formatting by @Mehdi2402 in #3277
Unpin markdown for build_docs now that it's fixed by @lhoestq in #3289
Pin version exclusion for Markdown by @albertvillanova in #3293
Use f-strings in the dataset scripts by @Carlosbogo in #3291
fix old_val typo in f-string by @Mehdi2402 in #3302
asserts replaced with exception for fingerprint.py, search.py, arrow_writer.py and metric.py by @Ishan-Kumar2 in #3305
fix: files counted twice in inferred structure by @borisdayma in #3309
Finish transition to PyArrow 3.0.0 by @mariosasko in #3318
Removing query params for dynamic URL caching by @anton-l in #3315

Citation

Update BibTeX entry by @albertvillanova in #3223
Fix paper BibTeX citation with proceedings reference by @albertvillanova in #3226
Add CITATION file by @albertvillanova in #3228
Fix URL in CITATION file by @albertvillanova in #3229

Deprecations

Deprecate prepare_module by @albertvillanova in #3166

Full Changelog: 1.15.1...1.16.0

Contributors

manisnesan, borisdayma, and 26 other contributors

Assets 2

02 Nov 21:47

lhoestq

1.15.1

0181006

1.15.1

Dependencies

Bump huggingface_hub to 0.1.0 by @lhoestq in #3199

Contributors

lhoestq

Assets 2

02 Nov 21:22

lhoestq

1.15.0

dcaa3c0

1.15.0

Dataset Changes

Update: JNLBA - add tags names by @bhavitvyamalik in #3092
Update: OpenSLR - add SLR83 to OpenSLR by @tyrius02 in #3125 and #3176
Update: RONEC - update to v2 by @dumitrescustefan in #3184
Fix: Arabic Billion Words - Fix script to return all data by @albertvillanova in #3136
Fix: HLGD - fix label mapping by @VictorSanh in #3180

Dataset Features

Allow dynamic first dimension for ArrayXD by @rpowalski in #2891
add multi-proc in to_csv by @bhavitvyamalik in #2896
QOL improvements: auto-flatten_indices and desc in map calls by @mariosasko in #3196

Dataset Cards

Fill in dataset card for NCBI disease dataset by @edugp in #3115

Metrics Changes

New: metric for the MATH dataset (competition_math). by @hacobe in #3020
New: Google BLEU (aka GLEU) metric by @slowwavesleep in #3108
New: TER by @BramVanroy in #3153
New: ChrF(++) by @BramVanroy in #3187

General improvements and bug fixes

Correctly update metadata to preserve features when concatenating datasets with axis=1 by @mariosasko in #3120
Fixes to to_tf_dataset by @Rocketknight1 in #3085
Add security policy to the project by @albertvillanova in #2958
Update doc links to point to new docs by @mariosasko in #3116
Fix caching bugs by @mariosasko in #3141
Fix numpy deprecation warning for ragged tensors by @lhoestq in #3137
Fixed: duplicate parameter and missing parameter in docstring by @PanQiWei in #3157
Fix some typos in the documentation by @h4iku in #3152
Fix string encoding for Value type by @lhoestq in #3158
Fix CLI test to ignore verfications when saving infos by @albertvillanova in #3147
Make inspect.get_dataset_config_names always return a non-empty list by @albertvillanova in #3159
Fix issue with filelock filename being too long on encrypted filesystems by @mariosasko in #3173
Asserts replaced by exceptions (#3171) by @joseporiolayats in #3174
Preserve ordering in zip_dict by @mariosasko in #3170
Don't memoize strings when hashing since two identical strings may have different python ids by @lhoestq in #3182
Re-add faiss to windows testing suite by @BramVanroy in #3151
Add missing docstring to DownloadConfig by @mariosasko in #3183
More efficient nested features encoding by @eladsegal in #3124
Fix optimized encoding for arrays by @lhoestq in #3197

Contributors

BramVanroy, h4iku, and 15 other contributors

Assets 2

19 Oct 16:46

albertvillanova

1.14.0

ec82422

1.14.0

Dataset changes

Update: LexGLUE and MultiEURLEX README - update dataset cards #3075 (@iliaschalkidis)
Update: SUPERB - use Audio features #3101 (@anton-l)
Fix: Blog Authorship Corpus - fix URLs #3106 (@albertvillanova)

Dataset features

Add iter_archive #3066 (@lhoestq)

General improvements and bug fixes

Replace FSTimeoutError with parent TimeoutError #3100 (@albertvillanova)
Fix project description in PyPI #3103 (@albertvillanova)
Align tqdm control with cache control #3031 (@mariosasko)
Add paper BibTeX citation #3107 (@albertvillanova)

Contributors

iliaschalkidis, albertvillanova, and 3 other contributors

Assets 2

Releases: huggingface/datasets

1.18.3

Bug fixes

Dataset changes

Other improvements

New Contributors

Contributors

1.18.2

Bug fixes

Other improvements

New Contributors

Contributors

1.18.1

Improvements

Bug fixes

Contributors

1.18.0

Datasets Changes

Datasets Features

Metrics Changes

Dataset cards

Documentation

General improvements and bug fixes

New Contributors

Contributors

1.17.0

Dataset Changes

Dataset Features

Dataset cards

Dataset Tasks

Metric Changes

Docs

Additional improvements and bug fixes

New Contributors

Contributors

1.16.1

Bug fixes

Contributors

1.16.0

Datasets Changes

Datasets Features

Dataset Cards

Metrics Changes

Documentation

Additional improvements and bug fixes

Citation

Deprecations

Contributors

1.15.1

Dependencies

Contributors

1.15.0

Dataset Changes

Dataset Features

Dataset Cards

Metrics Changes

General improvements and bug fixes

Contributors

1.14.0

Dataset changes

Dataset features

General improvements and bug fixes

Contributors