Release 2.8.0 · huggingface/datasets

Important

Removed YAML integer keys from class_label metadata by @albertvillanova in #5277
- From now on, datasets pushed on the Hub and using ClassLabel will use a new YAML model to store the feature types
- The new model uses strings instead of integers for the ids in label name mapping (e.g. 0 -> "0"). This is due to the Hub limitations. In a few months the Hub may stop allowing users to push the old YAML model.
- Old versions of datasets are not able to reload datasets pushed with this new model, so we encourage everyone to update.

Datasets Features

Fix methods using IterableDataset.map that lead to features=None by @alvarobartt in #5287
- Datasets in streaming mode now update their features after column renaming or removal
Add num_proc to from_csv/generator/json/parquet/text by @lhoestq in #5239
- Use multiprocessing to load multiple files in parallel
Add features param to IterableDataset.map by @alvarobartt in #5311
Sharded save_to_disk + multiprocessing by @lhoestq in #5268
- Pass num_shards or max_shard_size to ds.save_to_disk() or ds.push_to_hub()
- Pass num_proc to use multiprocessing.
Support for decoding Image/Audio types in map when format type is not default one by @mariosasko in #5252
Support torch dataloader without torch formatting for IterableDataset by @lhoestq in #5357
- You can now pass any dataset in streaming mode to a PyTorch DataLoader directly:
```
from datasets import load_dataset
ds = load_dataset("c4", "en", streaming=True, split="train")
dataloader = DataLoader(ds, batch_size=32, num_workers=4)
```

Docs

Complete doc migration by @mishig25 in #5248

General improvements and bug fixes

typo by @WrRan in #5253
typo by @WrRan in #5254
remove an unused statement by @WrRan in #5257
fix wrong print by @WrRan in #5256
Fix max_shard_size docs by @lhoestq in #5267
Specify arguments as keywords in librosa.reshape to avoid future errors by @polinaeterna in #5266
Change release procedure to use only pull requests by @albertvillanova in #5250
Warn about checksums by @lhoestq in #5279
Tweak readme by @lhoestq in #5210
Save file name in embed_storage by @lhoestq in #5285
Use correct dataset type in from_generator docs by @mariosasko in #5307
Support streaming datasets with pathlib.Path.with_suffix by @albertvillanova in #5294
Fix xjoin for Windows pathnames by @albertvillanova in #5297
Fix xopen for Windows pathnames by @albertvillanova in #5299
Ci py3.10 by @lhoestq in #5065
Update Overview.ipynb google colab by @lhoestq in #5211
Support xPath for Windows pathnames by @albertvillanova in #5310
Fix description of streaming in the docs by @polinaeterna in #5313
Fix Text sample_by paragraph by @albertvillanova in #5319
[Extract] Place the lock file next to the destination directory by @lhoestq in #5320
Fix loading from HF GCP cache by @lhoestq in #5321
- This was affecting datasets like wikipedia or natural_questions
Fix docs building for main by @albertvillanova in #5328
Origin/fix missing features error by @eunseojo in #5318
fix: 🐛 pass the token to get the list of config names by @severo in #5333
Clarify imagefolder is for small datasets by @stevhliu in #5329
Close stream in ArrowWriter.finalize before inference error by @mariosasko in #5309
Use same num_proc for dataset download and generation by @mariosasko in #5300
Set IterableDataset.map param batch_size typing as optional by @alvarobartt in #5336
fix: dataset path should be absolute by @vigsterkr in #5234
Clean up DatasetInfo and Dataset docstrings by @stevhliu in #5340
Clean up docstrings by @stevhliu in #5334
Remove tasks.json by @lhoestq in #5341
Support topdown parameter in xwalk by @mariosasko in #5308
Improve use_auth_token docstring and deprecate use_auth_token in download_and_prepare by @mariosasko in #5302
Clean up Loading methods docstrings by @stevhliu in #5350
Clean up remaining Main Classes docstrings by @stevhliu in #5349
Clean up Dataset and DatasetDict by @stevhliu in #5344
Clean up Table class docstrings by @stevhliu in #5355
Raise error for .tar archives in the same way as for .tar.gz and .tgz in _get_extraction_protocol by @polinaeterna in #5322
Clean filesystem and logging docstrings by @stevhliu in #5356
ExamplesIterable fixes by @lhoestq in #5366
Simplify skipping by @Muennighoff in #5373
Release: 2.8.0 by @lhoestq in #5375

New Contributors

@WrRan made their first contribution in #5253
@eunseojo made their first contribution in #5318
@vigsterkr made their first contribution in #5234
@Muennighoff made their first contribution in #5373

Full Changelog: 2.7.0...2.8.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.8.0

Important

Datasets Features

Docs

General improvements and bug fixes

New Contributors

Contributors