Releases · huggingface/datasets

26 Jan 19:33

lhoestq

2.9.0

b5672a9

2.9.0

Datasets Features

Parallel implementation of to_tf_dataset() by @Rocketknight1 in #5377
- Pass num_workers= to .to_tf_dataset() to make your dataset faster with multiprocessing

Distributed support by @lhoestq in #5369

Split your dataset for each node for distributed training
It supports both Dataset and IterableDataset (e.g. in streaming mode)
See the documentation for more details

import os
from datasets.distributed import split_dataset_by_node

rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)

Support streaming datasets with os.path.exists and Path.exists by @albertvillanova in #5400
Tqdm progress bar for to_parquet by @zanussbaum in #5456
ZIP files support in iter_archive with better compression type check by @Mehdi2402 in #3379
Support other formats than uint8 for image arrays by @vigsterkr in #5365

Documentation

Depth estimation dataset guide by @sayakpaul in #5379
- see https://huggingface.co/docs/datasets/main/en/depth_estimation
Imagefolder docs: mention support of CSV and ZIP by @lhoestq in #5463
- see https://huggingface.co/docs/datasets/main/en/image_load#imagefolder
Update docs of S3 filesystem with async aiobotocore by @maheshpec in #5411
- see https://huggingface.co/docs/datasets/main/en/filesystems#amazon-s3

General improvements and bug fixes

Raise error if ClassLabel names is not python list by @freddyheppell in #5359
Temporarily pin pydantic test dependency by @albertvillanova in #5395
Unpin pydantic test dependency by @albertvillanova in #5397
Replace one letter import in docs by @MKhalusova in #5403
Fix Colab notebook link by @albertvillanova in #5392
Fix fs.open resource leaks by @tkukurin in #5358
Fix deprecation warning when use_auth_token passed to download_and_prepare by @albertvillanova in #5409
Fix streaming pandas.read_excel by @albertvillanova in #5372
ci: 🎡 remove two obsolete issue templates by @severo in #5420
Handle 0-dim tensors in cast_to_python_objects by @mariosasko in #5384
Fix CI by temporarily pinning apache-beam < 2.44.0 by @albertvillanova in #5429
Fix CI benchmarks by temporarily pinning Docker image version by @albertvillanova in #5432
Revert container image pin in CI benchmarks by @0x2b3bfa0 in #5436
Finish deprecating the fs argument by @dconathan in #5393
Update actions/checkout in CD Conda release by @albertvillanova in #5438
Fix RuntimeError: Sharding is ambiguous for this dataset by @albertvillanova in #5416
Fix documentation about batch samplers by @thomasw21 in #5440
Fix CI by temporarily pinning fsspec < 2023.1.0 by @albertvillanova in #5447
Support fsspec 2023.1.0 in CI by @albertvillanova in #5449
Update share tutorial by @stevhliu in #5443
Swap log messages for symbolic/hard links in tar extractor by @albertvillanova in #5452
Fix base directory while extracting insecure TAR files by @albertvillanova in #5453
Fix link in load_dataset docstring by @mariosasko in #5389
Document that removing all the columns returns an empty document and the num_row is lost by @thomasw21 in #5460
Concatenate on axis=1 with misaligned blocks by @lhoestq in #5462
Raise from disconnect error in xopen by @lhoestq in #5382
remove pathlib.Path with URIs by @jonny-cyberhaven in #5466
Remove deprecated shard_size arg from .push_to_hub() by @polinaeterna in #5469

New Contributors

@freddyheppell made their first contribution in #5359
@MKhalusova made their first contribution in #5403
@tkukurin made their first contribution in #5358
@0x2b3bfa0 made their first contribution in #5436
@maheshpec made their first contribution in #5411
@dconathan made their first contribution in #5393
@zanussbaum made their first contribution in #5456
@jonny-cyberhaven made their first contribution in #5466

Full Changelog: 2.8.0...2.9.0

Contributors

vigsterkr, tkukurin, and 17 other contributors

Assets 2

19 Dec 10:55

lhoestq

2.8.0

037c9b5

2.8.0

Important

Removed YAML integer keys from class_label metadata by @albertvillanova in #5277
- From now on, datasets pushed on the Hub and using ClassLabel will use a new YAML model to store the feature types
- The new model uses strings instead of integers for the ids in label name mapping (e.g. 0 -> "0"). This is due to the Hub limitations. In a few months the Hub may stop allowing users to push the old YAML model.
- Old versions of datasets are not able to reload datasets pushed with this new model, so we encourage everyone to update.

Datasets Features

Fix methods using IterableDataset.map that lead to features=None by @alvarobartt in #5287
- Datasets in streaming mode now update their features after column renaming or removal
Add num_proc to from_csv/generator/json/parquet/text by @lhoestq in #5239
- Use multiprocessing to load multiple files in parallel
Add features param to IterableDataset.map by @alvarobartt in #5311
Sharded save_to_disk + multiprocessing by @lhoestq in #5268
- Pass num_shards or max_shard_size to ds.save_to_disk() or ds.push_to_hub()
- Pass num_proc to use multiprocessing.
Support for decoding Image/Audio types in map when format type is not default one by @mariosasko in #5252
Support torch dataloader without torch formatting for IterableDataset by @lhoestq in #5357
- You can now pass any dataset in streaming mode to a PyTorch DataLoader directly:
```
from datasets import load_dataset
ds = load_dataset("c4", "en", streaming=True, split="train")
dataloader = DataLoader(ds, batch_size=32, num_workers=4)
```

Docs

Complete doc migration by @mishig25 in #5248

General improvements and bug fixes

typo by @WrRan in #5253
typo by @WrRan in #5254
remove an unused statement by @WrRan in #5257
fix wrong print by @WrRan in #5256
Fix max_shard_size docs by @lhoestq in #5267
Specify arguments as keywords in librosa.reshape to avoid future errors by @polinaeterna in #5266
Change release procedure to use only pull requests by @albertvillanova in #5250
Warn about checksums by @lhoestq in #5279
Tweak readme by @lhoestq in #5210
Save file name in embed_storage by @lhoestq in #5285
Use correct dataset type in from_generator docs by @mariosasko in #5307
Support streaming datasets with pathlib.Path.with_suffix by @albertvillanova in #5294
Fix xjoin for Windows pathnames by @albertvillanova in #5297
Fix xopen for Windows pathnames by @albertvillanova in #5299
Ci py3.10 by @lhoestq in #5065
Update Overview.ipynb google colab by @lhoestq in #5211
Support xPath for Windows pathnames by @albertvillanova in #5310
Fix description of streaming in the docs by @polinaeterna in #5313
Fix Text sample_by paragraph by @albertvillanova in #5319
[Extract] Place the lock file next to the destination directory by @lhoestq in #5320
Fix loading from HF GCP cache by @lhoestq in #5321
- This was affecting datasets like wikipedia or natural_questions
Fix docs building for main by @albertvillanova in #5328
Origin/fix missing features error by @eunseojo in #5318
fix: 🐛 pass the token to get the list of config names by @severo in #5333
Clarify imagefolder is for small datasets by @stevhliu in #5329
Close stream in ArrowWriter.finalize before inference error by @mariosasko in #5309
Use same num_proc for dataset download and generation by @mariosasko in #5300
Set IterableDataset.map param batch_size typing as optional by @alvarobartt in #5336
fix: dataset path should be absolute by @vigsterkr in #5234
Clean up DatasetInfo and Dataset docstrings by @stevhliu in #5340
Clean up docstrings by @stevhliu in #5334
Remove tasks.json by @lhoestq in #5341
Support topdown parameter in xwalk by @mariosasko in #5308
Improve use_auth_token docstring and deprecate use_auth_token in download_and_prepare by @mariosasko in #5302
Clean up Loading methods docstrings by @stevhliu in #5350
Clean up remaining Main Classes docstrings by @stevhliu in #5349
Clean up Dataset and DatasetDict by @stevhliu in #5344
Clean up Table class docstrings by @stevhliu in #5355
Raise error for .tar archives in the same way as for .tar.gz and .tgz in _get_extraction_protocol by @polinaeterna in #5322
Clean filesystem and logging docstrings by @stevhliu in #5356
ExamplesIterable fixes by @lhoestq in #5366
Simplify skipping by @Muennighoff in #5373
Release: 2.8.0 by @lhoestq in #5375

New Contributors

@WrRan made their first contribution in #5253
@eunseojo made their first contribution in #5318
@vigsterkr made their first contribution in #5234
@Muennighoff made their first contribution in #5373

Full Changelog: 2.7.0...2.8.0

Contributors

vigsterkr, severo, and 10 other contributors

Assets 2

22 Nov 17:27

albertvillanova

2.7.1

5ef1ab1

2.7.1

Bug fixes

Remove YAML integer keys from class_label metadata by @albertvillanova in #5277

Full Changelog: 2.7.0...2.7.1

Contributors

albertvillanova

Assets 2

22 Nov 17:49

albertvillanova

2.6.2

a6a5a1c

2.6.2

Bug fixes

Remove YAML integer keys from class_label metadata by @albertvillanova in #5277

Full Changelog: 2.6.1...2.6.2

Contributors

albertvillanova

Assets 2

16 Nov 10:11

albertvillanova

2.7.0

edf1902

2.7.0

Dataset Features

Multiprocessed dataset builder by @TevenLeScao in #5107
- Load big datasets faster than before using multiprocessing:
```
from datasets import load_dataset
ds = load_dataset("imagenet-1k", num_proc=4)
```
Make torch.Tensor and spacy models cacheable by @mariosasko in #5191
- Function passed to map or filter that uses tensors or pipelines can now be cached
Drop labels in Image and Audio folders if files are on different levels in directory or if there is only one label by @polinaeterna in #5192
TextConfig: added "errors" by @NightMachinery in #5155

Audio setup

Add ffmpeg4 installation instructions in warnings by @polinaeterna in #5167

Docs

Update create image dataset docs by @stevhliu in #5177
add: segmentation guide. by @sayakpaul in #5188
Reword E2E training and inference tips in the vision guides by @sayakpaul in #5217
Add SQL guide by @stevhliu in #5223

General improvements and bug fixes

Add pyproject.toml for black by @mariosasko in #5125
Fix tqdm zip bug by @david1542 in #5120
Install tensorflow-macos dependency conditionally by @albertvillanova in #5124
[TYPO] Update new_dataset_script.py by @cakiki in #5119
Avoid extra cast in class_encode_column by @mariosasko in #5130
Use yaml for issue templates + revamp by @mariosasko in #5116
Update docs once dataset scripts transferred to the Hub by @albertvillanova in #5136
Delete duplicate issue template file by @albertvillanova in #5146
Deprecate num_proc parameter in DownloadManager.extract by @ayushthe1 in #5142
Raise ImportError instead of OSError by @ayushthe1 in #5141
Fix CI require beam by @albertvillanova in #5168
Make iter_files deterministic by @albertvillanova in #5149
Add PB and TB in convert_file_size_to_int by @lhoestq in #5171
Reduce default max writer_batch_size by @mariosasko in #5163
Support dill 0.3.6 by @albertvillanova in #5166
Make filename matching more robust by @riccardobucco in #5128
Preserve None in list type cast in PyArrow 10 by @mariosasko in #5174
Raise ffmpeg warnings only once by @polinaeterna in #5173
Add "ipykernel" to list of co_filenames to remove by @gpucce in #5169
chore: add notebook links to img cls and obj det. by @sayakpaul in #5187
Fix docs about dataset_info in YAML by @albertvillanova in #5194
fsspec lock reset in multiprocessing by @lhoestq in #5159
Add note about the name of a dataset script by @polinaeterna in #5198
Deprecate dummy data generation command by @mariosasko in #5199
Do not sort splits in dataset info by @polinaeterna in #5201
Add missing DownloadConfig.use_auth_token value by @alvarobartt in #5205
Update canonical links to Hub links by @stevhliu in #5203
Refactor CI hub fixtures to use monkeypatch instead of patch by @albertvillanova in #5208
Update github pr docs actions by @mishig25 in #5214
Use hfh hf_hub_url function by @albertvillanova in #5196
Pin typer version in tests to <0.5 to fix Windows CI by @polinaeterna in #5235
Fix shards in IterableDataset.from_generator by @lhoestq in #5233
Fix class name of symbolic link by @riccardobucco in #5126
Make Version hashable by @mariosasko in #5238
Handle ArrowNotImplementedError caused by try_type being Image or Audio in cast by @mariosasko in #5236
Encode path only for old versions of hfh by @lhoestq in #5237
Fix CI require_beam maximum compatible dill version by @albertvillanova in #5212
Support hfh rc version by @lhoestq in #5241
Cleaner error tracebacks for dataset script errors by @mariosasko in #5240

New Contributors

@david1542 made their first contribution in #5120
@ayushthe1 made their first contribution in #5142
@gpucce made their first contribution in #5169
@sayakpaul made their first contribution in #5187
@NightMachinery made their first contribution in #5155

Full Changelog: 2.6.1...2.7.0

Contributors

cakiki, albertvillanova, and 13 other contributors

Assets 2

14 Oct 15:45

lhoestq

2.6.1

1742cf1

2.6.1

Bug fixes

Fix filter indices when batched by @albertvillanova in #5113
- fixed a bug where filter could return examples with the wrong indices
Fix iter_batches by @lhoestq in #5115
- fixed a bug where map with batch=True could return a dataset with less examples
Fix a typo in arrow_dataset.py by @yangky11 in #5108

New Contributors

@yangky11 made their first contribution in #5108

Full Changelog: 2.6.0...2.6.1

Contributors

yangky11, albertvillanova, and lhoestq

Assets 2

13 Oct 11:00

lhoestq

2.6.0

dc3f72e

2.6.0

Important

[GH->HF] Remove all dataset scripts from github by @lhoestq in #4974
- all the dataset scripts and dataset cards are now on https://hf.co/datasets
- we invite users and contributors to open discussions or pull requests on the Hugging Face Hub from now on

Datasets features

Add ability to read-write to SQL databases. by @Dref360 in #4928

Read from sqlite file:

from datasets import Dataset
dataset = Dataset.from_sql("data_table", "sqlite:///sqlite_file.db")

Allow connection objects in from_sql + small doc improvement by @mariosasko in #5091

from datasets import Dataset
from sqlite3 import connect
con = connect(...)
dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con)

Image & Audio formatting for numpy/torch/tf/jax by @lhoestq in #5072

return numpy/torch/tf/jax tensors with

from datasets import load_dataset
ds = load_dataset("imagenet-1k").with_format("torch")  # or numpy/tf/jax
ds[0]["image"]

Added IterableDataset.from_generator by @hamid-vakilzadeh in #5052
Fast dataset iter by @mariosasko in #5030
- speed up by a factor of 2 using the Arrow Table reader
Dataset infos in yaml by @lhoestq in #4926
- you can now specify the feature types and number of samples in the dataset card, see https://huggingface.co/docs/datasets/dataset_card
Add kwargs to Dataset.from_generator by @mariosasko in #5049
Support converters in CsvBuilder by @mariosasko in #5057
Restore saved format state in load_from_disk by @asofiaoliveira in #5073

Dataset changes

Update: hendrycks_test - support streaming by @albertvillanova in #5041
Update: swiss judgment prediction by @JoelNiklaus in #5019
- Update swiss judgment prediction by @JoelNiklaus in #5042
Fix: xcsr - fix languages of X-CSQA configs by @albertvillanova in #5022
Fix: sbu_captions - fix URLs by @donglixp in #5020
Fix: xcsr - fix string features by @albertvillanova in #5024
Fix: hendrycks_test - fix NonMatchingChecksumError by @albertvillanova in #5040
Fix: cats_vs_dogs - fix number of samples by @lhoestq in #5047
Fix: lex_glue - fix bug with labels of eurlex config of lex_glue dataset by @iliaschalkidis in #5048
Fix: msr_sqa - fix dataset generation by @Timothyxxx in #3715

Dataset cards

Add description to hellaswag dataset by @julien-c in #4810
Add deprecation warning to multilingual_librispeech dataset card by @albertvillanova in #5010
Update languages in aeslc dataset card by @apergo-ai in #3357
Update license to bookcorpus dataset card by @meg-huggingface in #3526
Update paper link in medmcqa dataset card by @monk1337 in #4290
Add oversampling strategy iterable datasets interleave by @ylacombe in #5036
Fix license/citation information of squadshifts dataset card by @albertvillanova in #5054

General improvements and bug fixes

Fix missing use_auth_token in streaming docstrings by @albertvillanova in #5003
Add some note about running the transformers ci before a release by @lhoestq in #5007
Remove license tag file and validation by @albertvillanova in #5004
Re-apply input columns change by @mariosasko in #5008
patch CI_HUB_TOKEN_PATH with Path instead of str by @Wauplin in #5026
Fix typo in error message by @severo in #5027
Fix import in ClassLabel docstring example by @alvarobartt in #5029
Remove redundant code from some dataset module factories by @albertvillanova in #5033
Fix typos in load docstrings and comments by @albertvillanova in #5035
Prefer split patterns from directories over split patterns from filenames by @polinaeterna in #4985
Fix tar extraction vuln by @lhoestq in #5016
Support hfh 0.10 implicit auth by @lhoestq in #5031
Fix flatten_indices with empty indices mapping by @mariosasko in #5043
Improve CI performance speed of PackagedDatasetTest by @albertvillanova in #5037
Revert task removal in folder-based builders by @mariosasko in #5051
Fix backward compatibility for dataset_infos.json by @lhoestq in #5055
Fix typo by @stevhliu in #5059
Fix CI hfh token warning by @albertvillanova in #5062
Mark CI tests as xfail when 502 error by @albertvillanova in #5058
Fix passed download_config in HubDatasetModuleFactoryWithoutScript by @albertvillanova in #5077
Fix CONTRIBUTING once dataset scripts transferred to Hub by @albertvillanova in #5067
Fix header level in Audio docs by @stevhliu in #5078
Support DEFAULT_CONFIG_NAME when no BUILDER_CONFIGS by @albertvillanova in #5071
Support streaming gzip.open by @albertvillanova in #5066
adding keep in memory by @Mustapha-AJEGHRIR in #5082
refactor: replace AssertionError with more meaningful exceptions (#5074) by @galbwe in #5079
fix: update exception throw from OSError to EnvironmentError in `push… by @rahulXs in #5076
Align signature of list_repo_files with latest hfh by @albertvillanova in #5063
Align signature of create/delete_repo with latest hfh by @albertvillanova in #5064
Fix filter with empty indices by @Mouhanedg56 in #5087
Fix tutorial (#5093) by @riccardobucco in #5095
Use HTML relative paths for tiles in the docs by @lewtun in #5092
Fix loading how to guide (#5102) by @riccardobucco in #5104
url encode hub url (#5099) by @riccardobucco in #5103
Free the "hf" filesystem protocol for hffs by @lhoestq in #5101
Fix task template reload from dict by @lhoestq in #5106

New Contributors

@Wauplin made their first contribution in #5026
@donglixp made their first contribution in #5020
@Timothyxxx made their first contribution in #3715
@hamid-vakilzadeh made their first contribution in #5052
@Mustapha-AJEGHRIR made their first contribution in #5082
@galbwe made their first contribution in #5079
@rahulXs made their first contribution in #5076
@Mouhanedg56 made their first contribution in #5087
@riccardobucco made their first contribution in #5095
@asofiaoliveira made their first contribution in #5073

Full Changelog: 2.5.1...2.6.0

Contributors

julien-c, donglixp, and 24 other contributors

Assets 2

05 Oct 10:17

lhoestq

2.5.2

c59cc34

2.5.2

Bug fixes

Revert task removal in folder-based builders (#5051)
Support hfh 0.10 implicit auth (#5031)

Full Changelog: 2.5.1...2.5.2

Assets 2

21 Sep 15:17

lhoestq

2.5.1

0c84b71

2.5.1

Bug fixes

Revert input_columns change by @lhoestq in #5006

Full Changelog: 2.5.0...2.5.1

Contributors

lhoestq

Assets 2

21 Sep 13:14

lhoestq

2.5.0

6fc30c1

2.5.0

Important

Drop Python 3.6 support by @mariosasko in #4460
Deprecate metrics by @albertvillanova in #4739
- Metrics are now deprecated and have been moved to evaluate:
```
!pip install evaluate
import evaluate
metric = evaluate.load("accuracy")
```
Load GitHub datasets from Hub by @albertvillanova in #4059
- datasets with no namespace like "squad" were loaded from this GitHub repository, now they're loaded from https://huggingface.co/datasets
Decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in #4923
- latest version of torchaudio 0.12 now requires ffmpeg (version 4) to read MP3 files, please downgrade to 0.12 for now or use librosa
Use HTTP requests to access data and metadata through the Datasets REST API (docs here)

Datasets features

No-code loaders

Add AudioFolder packaged loader by @polinaeterna in #4530
Add support for CSV metadata files to ImageFolder by @mariosasko in #4837
Add support for parsing JSON files in array form by @mariosasko in #4997

Dataset methods

add Dataset.from_list by @sanderland in #4890
Add Dataset.from_generator by @mariosasko in #4957
Add oversampling strategies to interleave datasets by @ylacombe in #4831
Preserve non-input_colums in Dataset.map if input_columns are specified by @mariosasko in #4971
Add fn_kwargs param to IterableDataset.map by @mariosasko in #4975
More rigorous shape inference in to_tf_dataset by @Rocketknight1 in #4763

Parquet support

Download and prepare as Parquet for cloud storage by @lhoestq in #4724
Shard parquet in download_and_prepare by @lhoestq in #4747
Embed image/audio data in dl_and_prepare parquet by @lhoestq in #4987

Datasets changes

Update: natural questions - Add long answer candidates by @seirasto in #4368
Update: opus_paracrawl - update version by @albertvillanova in #4816
Update: ReCoRD - Include entity positions as feature by @richarddwang in #4479
Update: swda - Support streaming by @albertvillanova in #4914
Update: Enwik8 - update broken link and information by @mtanghu in #4
Update: compguesswhat - Support streaming by @albertvillanova in #4968
Update: nli_tr - Support streaming by @albertvillanova in #4970
Update: IndicGLUE - update download links by @sumanthd17 in #4978
Update: iwslt2017 - Support streaming by @albertvillanova in #4992
Fix: mbpp - fix NonMatchingChecksumError by @albertvillanova in #4788
Fix: mkqa - Update data URL by @albertvillanova in #4823
Fix: exams - fix bug and checksums by @albertvillanova in #4853
Fix: trec - use fine classes by @albertvillanova in #4801
Fix: wmt datasets - fix CWMT zh subsets by @lhoestq in #4871
Fix: LibriSpeech - Fix dev split local_extracted_archive for 'all' config by @sanchit-gandhi in #4904
Fix: compguesswhat - fix data URLs by @albertvillanova in #4959
Fix: vivos - fix data URL and metadata by @albertvillanova in #4969
Fix: MBPP - Add splits by @cwarny in #4943

Dataset cards

Add language_bcp47 tag by @lhoestq in #4753
Added more information in the README about contributors of the Arabic Speech Corpus by @nawarhalabi in #4701
Remove "unkown" language tags by @lhoestq in #4754
Highlight non-commercial license in amazon_reviews_multi dataset card by @sbroadhurst-hf in #4712
Added dataset information in clinic oos dataset card by @arnav-ladkat in #4751
Fix opus_gnome dataset card by @gojiteji in #4806
Complete the mlqa dataset card by @eldhoittangeorge in #4809
Fix loading example in opus dataset cards by @albertvillanova in #4813
Add missing language tags to resources by @albertvillanova in #4819
Fix titles in dataset cards by @albertvillanova in #4824
Fix language tags in dataset cards by @albertvillanova in #4826
Add license metadata to pg19 by @julien-c in #4827
Fix task tags in dataset cards by @albertvillanova in #4830
Fix tags in dataset cards by @albertvillanova in #4832
Fix missing tags in dataset cards by @albertvillanova in #4833
Fix documentation card of recipe_nlg dataset by @albertvillanova in #4834
Fix documentation card of ethos dataset by @albertvillanova in #4835
Update documentation card of miam dataset by @PierreColombo in #4846
Update stackexchange license by @cakiki in #4842
Update ted_talks_iwslt license to include ND by @cakiki in #4841
Fix documentation card of adv_glue dataset by @albertvillanova in #4838
Complete tags of superglue dataset card by @richarddwang in https://github.com/huggingface/datasets/pull/48674869
Fix license tag and Source Data section in billsum dataset card by @kashif in #4851
Fix documentation card of covid_qa_castorini dataset by @albertvillanova in #4877
Fix Citation Information section in dataset cards by @albertvillanova in #4879
Fix documentation card of math_qa dataset by @albertvillanova in #4884
Added names of less-studied languages by @BenjaminGalliot in #4880
Fix language tags resource file by @albertvillanova in #4882
Add citation to ro_sts and ro_sts_parallel datasets by @albertvillanova in #4892
Add citation information to makhzan dataset by @albertvillanova in #4894
Fix missing tags in dataset cards by @albertvillanova in #4891
Fix missing tags in dataset cards by @albertvillanova in #4896
Re-add code and und language tags by @albertvillanova in #4899
Add "cc-by-nc-sa-2.0" to list of licenses by @osanseviero in https://github.com/huggingface/datasets/pull/48874903
Update GLUE evaluation metadata by @lewtun in #4909
Fix missing tags in dataset cards by @albertvillanova in #4908
Add license and citation information to cosmos_qa dataset by @albertvillanova in #4913
Fix missing tags in dataset cards by @albertvillanova in #4921
Add cc-by-nc-2.0 to list of licenses by @albertvillanova in #4930
Fix missing tags in dataset cards by @albertvillanova in #4931
Add Papers with Code ID to scifact dataset by @albertvillanova in #4941
Fix license information in qasc dataset card by @albertvillanova in #4951
Fix multilinguality tag and missing sections in xquad_r dataset card by @albertvillanova in #4940
Fix missing tags in dataset cards by @albertvillanova in #4979
Fix missing tags in dataset cards by @albertvillanova in #4991