2.7.0
Dataset Features
- Multiprocessed dataset builder by @TevenLeScao in #5107
- Load big datasets faster than before using multiprocessing:
from datasets import load_dataset ds = load_dataset("imagenet-1k", num_proc=4)
- Make torch.Tensor and spacy models cacheable by @mariosasko in #5191
- Function passed to
map
orfilter
that uses tensors or pipelines can now be cached
- Function passed to
- Drop labels in Image and Audio folders if files are on different levels in directory or if there is only one label by @polinaeterna in #5192
- TextConfig: added "errors" by @NightMachinery in #5155
Audio setup
- Add ffmpeg4 installation instructions in warnings by @polinaeterna in #5167
Docs
- Update create image dataset docs by @stevhliu in #5177
- add: segmentation guide. by @sayakpaul in #5188
- Reword E2E training and inference tips in the vision guides by @sayakpaul in #5217
- Add SQL guide by @stevhliu in #5223
General improvements and bug fixes
- Add
pyproject.toml
forblack
by @mariosasko in #5125 - Fix
tqdm
zip bug by @david1542 in #5120 - Install tensorflow-macos dependency conditionally by @albertvillanova in #5124
- [TYPO] Update new_dataset_script.py by @cakiki in #5119
- Avoid extra cast in
class_encode_column
by @mariosasko in #5130 - Use yaml for issue templates + revamp by @mariosasko in #5116
- Update docs once dataset scripts transferred to the Hub by @albertvillanova in #5136
- Delete duplicate issue template file by @albertvillanova in #5146
- Deprecate num_proc parameter in DownloadManager.extract by @ayushthe1 in #5142
- Raise ImportError instead of OSError by @ayushthe1 in #5141
- Fix CI require beam by @albertvillanova in #5168
- Make iter_files deterministic by @albertvillanova in #5149
- Add PB and TB in convert_file_size_to_int by @lhoestq in #5171
- Reduce default max
writer_batch_size
by @mariosasko in #5163 - Support dill 0.3.6 by @albertvillanova in #5166
- Make filename matching more robust by @riccardobucco in #5128
- Preserve None in list type cast in PyArrow 10 by @mariosasko in #5174
- Raise ffmpeg warnings only once by @polinaeterna in #5173
- Add "ipykernel" to list of
co_filename
s to remove by @gpucce in #5169 - chore: add notebook links to img cls and obj det. by @sayakpaul in #5187
- Fix docs about dataset_info in YAML by @albertvillanova in #5194
- fsspec lock reset in multiprocessing by @lhoestq in #5159
- Add note about the name of a dataset script by @polinaeterna in #5198
- Deprecate dummy data generation command by @mariosasko in #5199
- Do not sort splits in dataset info by @polinaeterna in #5201
- Add missing
DownloadConfig.use_auth_token
value by @alvarobartt in #5205 - Update canonical links to Hub links by @stevhliu in #5203
- Refactor CI hub fixtures to use monkeypatch instead of patch by @albertvillanova in #5208
- Update github pr docs actions by @mishig25 in #5214
- Use hfh hf_hub_url function by @albertvillanova in #5196
- Pin
typer
version in tests to <0.5 to fix Windows CI by @polinaeterna in #5235 - Fix shards in IterableDataset.from_generator by @lhoestq in #5233
- Fix class name of symbolic link by @riccardobucco in #5126
- Make
Version
hashable by @mariosasko in #5238 - Handle ArrowNotImplementedError caused by try_type being Image or Audio in cast by @mariosasko in #5236
- Encode path only for old versions of hfh by @lhoestq in #5237
- Fix CI require_beam maximum compatible dill version by @albertvillanova in #5212
- Support hfh rc version by @lhoestq in #5241
- Cleaner error tracebacks for dataset script errors by @mariosasko in #5240
New Contributors
- @david1542 made their first contribution in #5120
- @ayushthe1 made their first contribution in #5142
- @gpucce made their first contribution in #5169
- @sayakpaul made their first contribution in #5187
- @NightMachinery made their first contribution in #5155
Full Changelog: 2.6.1...2.7.0