Releases: huggingface/datasets
Releases · huggingface/datasets
2.9.0
Datasets Features
-
Parallel implementation of to_tf_dataset() by @Rocketknight1 in #5377
- Pass
num_workers=
to.to_tf_dataset()
to make your dataset faster with multiprocessing
- Pass
-
Distributed support by @lhoestq in #5369
- Split your dataset for each node for distributed training
- It supports both
Dataset
andIterableDataset
(e.g. in streaming mode) - See the documentation for more details
import os from datasets.distributed import split_dataset_by_node rank = int(os.environ["RANK"]) world_size = int(os.environ["WORLD_SIZE"]) ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)
-
Support streaming datasets with os.path.exists and Path.exists by @albertvillanova in #5400
-
Tqdm progress bar for
to_parquet
by @zanussbaum in #5456 -
ZIP files support in iter_archive with better compression type check by @Mehdi2402 in #3379
-
Support other formats than uint8 for image arrays by @vigsterkr in #5365
Documentation
- Depth estimation dataset guide by @sayakpaul in #5379
- Imagefolder docs: mention support of CSV and ZIP by @lhoestq in #5463
- Update docs of S3 filesystem with async aiobotocore by @maheshpec in #5411
General improvements and bug fixes
- Raise error if ClassLabel names is not python list by @freddyheppell in #5359
- Temporarily pin pydantic test dependency by @albertvillanova in #5395
- Unpin pydantic test dependency by @albertvillanova in #5397
- Replace one letter import in docs by @MKhalusova in #5403
- Fix Colab notebook link by @albertvillanova in #5392
- Fix
fs.open
resource leaks by @tkukurin in #5358 - Fix deprecation warning when use_auth_token passed to download_and_prepare by @albertvillanova in #5409
- Fix streaming pandas.read_excel by @albertvillanova in #5372
- ci: 🎡 remove two obsolete issue templates by @severo in #5420
- Handle 0-dim tensors in
cast_to_python_objects
by @mariosasko in #5384 - Fix CI by temporarily pinning apache-beam < 2.44.0 by @albertvillanova in #5429
- Fix CI benchmarks by temporarily pinning Docker image version by @albertvillanova in #5432
- Revert container image pin in CI benchmarks by @0x2b3bfa0 in #5436
- Finish deprecating the fs argument by @dconathan in #5393
- Update actions/checkout in CD Conda release by @albertvillanova in #5438
- Fix RuntimeError: Sharding is ambiguous for this dataset by @albertvillanova in #5416
- Fix documentation about batch samplers by @thomasw21 in #5440
- Fix CI by temporarily pinning fsspec < 2023.1.0 by @albertvillanova in #5447
- Support fsspec 2023.1.0 in CI by @albertvillanova in #5449
- Update share tutorial by @stevhliu in #5443
- Swap log messages for symbolic/hard links in tar extractor by @albertvillanova in #5452
- Fix base directory while extracting insecure TAR files by @albertvillanova in #5453
- Fix link in
load_dataset
docstring by @mariosasko in #5389 - Document that removing all the columns returns an empty document and the num_row is lost by @thomasw21 in #5460
- Concatenate on axis=1 with misaligned blocks by @lhoestq in #5462
- Raise from disconnect error in xopen by @lhoestq in #5382
- remove pathlib.Path with URIs by @jonny-cyberhaven in #5466
- Remove deprecated
shard_size
arg from.push_to_hub()
by @polinaeterna in #5469
New Contributors
- @freddyheppell made their first contribution in #5359
- @MKhalusova made their first contribution in #5403
- @tkukurin made their first contribution in #5358
- @0x2b3bfa0 made their first contribution in #5436
- @maheshpec made their first contribution in #5411
- @dconathan made their first contribution in #5393
- @zanussbaum made their first contribution in #5456
- @jonny-cyberhaven made their first contribution in #5466
Full Changelog: 2.8.0...2.9.0
2.8.0
Important
- Removed YAML integer keys from class_label metadata by @albertvillanova in #5277
- From now on, datasets pushed on the Hub and using ClassLabel will use a new YAML model to store the feature types
- The new model uses strings instead of integers for the ids in label name mapping (e.g. 0 -> "0"). This is due to the Hub limitations. In a few months the Hub may stop allowing users to push the old YAML model.
- Old versions of
datasets
are not able to reload datasets pushed with this new model, so we encourage everyone to update.
Datasets Features
- Fix methods using
IterableDataset.map
that lead tofeatures=None
by @alvarobartt in #5287- Datasets in streaming mode now update their
features
after column renaming or removal
- Datasets in streaming mode now update their
- Add num_proc to from_csv/generator/json/parquet/text by @lhoestq in #5239
- Use multiprocessing to load multiple files in parallel
- Add
features
param toIterableDataset.map
by @alvarobartt in #5311 - Sharded save_to_disk + multiprocessing by @lhoestq in #5268
- Pass
num_shards
ormax_shard_size
tods.save_to_disk()
ords.push_to_hub()
- Pass
num_proc
to use multiprocessing.
- Pass
- Support for decoding Image/Audio types in map when format type is not default one by @mariosasko in #5252
- Support torch dataloader without torch formatting for IterableDataset by @lhoestq in #5357
- You can now pass any dataset in streaming mode to a PyTorch DataLoader directly:
from datasets import load_dataset ds = load_dataset("c4", "en", streaming=True, split="train") dataloader = DataLoader(ds, batch_size=32, num_workers=4)
Docs
General improvements and bug fixes
- typo by @WrRan in #5253
- typo by @WrRan in #5254
- remove an unused statement by @WrRan in #5257
- fix wrong print by @WrRan in #5256
- Fix
max_shard_size
docs by @lhoestq in #5267 - Specify arguments as keywords in librosa.reshape to avoid future errors by @polinaeterna in #5266
- Change release procedure to use only pull requests by @albertvillanova in #5250
- Warn about checksums by @lhoestq in #5279
- Tweak readme by @lhoestq in #5210
- Save file name in embed_storage by @lhoestq in #5285
- Use correct dataset type in
from_generator
docs by @mariosasko in #5307 - Support streaming datasets with pathlib.Path.with_suffix by @albertvillanova in #5294
- Fix xjoin for Windows pathnames by @albertvillanova in #5297
- Fix xopen for Windows pathnames by @albertvillanova in #5299
- Ci py3.10 by @lhoestq in #5065
- Update Overview.ipynb google colab by @lhoestq in #5211
- Support xPath for Windows pathnames by @albertvillanova in #5310
- Fix description of streaming in the docs by @polinaeterna in #5313
- Fix Text sample_by paragraph by @albertvillanova in #5319
- [Extract] Place the lock file next to the destination directory by @lhoestq in #5320
- Fix loading from HF GCP cache by @lhoestq in #5321
- This was affecting datasets like
wikipedia
ornatural_questions
- This was affecting datasets like
- Fix docs building for main by @albertvillanova in #5328
- Origin/fix missing features error by @eunseojo in #5318
- fix: 🐛 pass the token to get the list of config names by @severo in #5333
- Clarify imagefolder is for small datasets by @stevhliu in #5329
- Close stream in
ArrowWriter.finalize
before inference error by @mariosasko in #5309 - Use same
num_proc
for dataset download and generation by @mariosasko in #5300 - Set
IterableDataset.map
parambatch_size
typing as optional by @alvarobartt in #5336 - fix: dataset path should be absolute by @vigsterkr in #5234
- Clean up DatasetInfo and Dataset docstrings by @stevhliu in #5340
- Clean up docstrings by @stevhliu in #5334
- Remove tasks.json by @lhoestq in #5341
- Support
topdown
parameter inxwalk
by @mariosasko in #5308 - Improve
use_auth_token
docstring and deprecateuse_auth_token
indownload_and_prepare
by @mariosasko in #5302 - Clean up Loading methods docstrings by @stevhliu in #5350
- Clean up remaining Main Classes docstrings by @stevhliu in #5349
- Clean up Dataset and DatasetDict by @stevhliu in #5344
- Clean up Table class docstrings by @stevhliu in #5355
- Raise error for
.tar
archives in the same way as for.tar.gz
and.tgz
in_get_extraction_protocol
by @polinaeterna in #5322 - Clean filesystem and logging docstrings by @stevhliu in #5356
- ExamplesIterable fixes by @lhoestq in #5366
- Simplify skipping by @Muennighoff in #5373
- Release: 2.8.0 by @lhoestq in #5375
New Contributors
- @WrRan made their first contribution in #5253
- @eunseojo made their first contribution in #5318
- @vigsterkr made their first contribution in #5234
- @Muennighoff made their first contribution in #5373
Full Changelog: 2.7.0...2.8.0
2.7.1
Bug fixes
- Remove YAML integer keys from class_label metadata by @albertvillanova in #5277
Full Changelog: 2.7.0...2.7.1
2.6.2
Bug fixes
- Remove YAML integer keys from class_label metadata by @albertvillanova in #5277
Full Changelog: 2.6.1...2.6.2
2.7.0
Dataset Features
- Multiprocessed dataset builder by @TevenLeScao in #5107
- Load big datasets faster than before using multiprocessing:
from datasets import load_dataset ds = load_dataset("imagenet-1k", num_proc=4)
- Make torch.Tensor and spacy models cacheable by @mariosasko in #5191
- Function passed to
map
orfilter
that uses tensors or pipelines can now be cached
- Function passed to
- Drop labels in Image and Audio folders if files are on different levels in directory or if there is only one label by @polinaeterna in #5192
- TextConfig: added "errors" by @NightMachinery in #5155
Audio setup
- Add ffmpeg4 installation instructions in warnings by @polinaeterna in #5167
Docs
- Update create image dataset docs by @stevhliu in #5177
- add: segmentation guide. by @sayakpaul in #5188
- Reword E2E training and inference tips in the vision guides by @sayakpaul in #5217
- Add SQL guide by @stevhliu in #5223
General improvements and bug fixes
- Add
pyproject.toml
forblack
by @mariosasko in #5125 - Fix
tqdm
zip bug by @david1542 in #5120 - Install tensorflow-macos dependency conditionally by @albertvillanova in #5124
- [TYPO] Update new_dataset_script.py by @cakiki in #5119
- Avoid extra cast in
class_encode_column
by @mariosasko in #5130 - Use yaml for issue templates + revamp by @mariosasko in #5116
- Update docs once dataset scripts transferred to the Hub by @albertvillanova in #5136
- Delete duplicate issue template file by @albertvillanova in #5146
- Deprecate num_proc parameter in DownloadManager.extract by @ayushthe1 in #5142
- Raise ImportError instead of OSError by @ayushthe1 in #5141
- Fix CI require beam by @albertvillanova in #5168
- Make iter_files deterministic by @albertvillanova in #5149
- Add PB and TB in convert_file_size_to_int by @lhoestq in #5171
- Reduce default max
writer_batch_size
by @mariosasko in #5163 - Support dill 0.3.6 by @albertvillanova in #5166
- Make filename matching more robust by @riccardobucco in #5128
- Preserve None in list type cast in PyArrow 10 by @mariosasko in #5174
- Raise ffmpeg warnings only once by @polinaeterna in #5173
- Add "ipykernel" to list of
co_filename
s to remove by @gpucce in #5169 - chore: add notebook links to img cls and obj det. by @sayakpaul in #5187
- Fix docs about dataset_info in YAML by @albertvillanova in #5194
- fsspec lock reset in multiprocessing by @lhoestq in #5159
- Add note about the name of a dataset script by @polinaeterna in #5198
- Deprecate dummy data generation command by @mariosasko in #5199
- Do not sort splits in dataset info by @polinaeterna in #5201
- Add missing
DownloadConfig.use_auth_token
value by @alvarobartt in #5205 - Update canonical links to Hub links by @stevhliu in #5203
- Refactor CI hub fixtures to use monkeypatch instead of patch by @albertvillanova in #5208
- Update github pr docs actions by @mishig25 in #5214
- Use hfh hf_hub_url function by @albertvillanova in #5196
- Pin
typer
version in tests to <0.5 to fix Windows CI by @polinaeterna in #5235 - Fix shards in IterableDataset.from_generator by @lhoestq in #5233
- Fix class name of symbolic link by @riccardobucco in #5126
- Make
Version
hashable by @mariosasko in #5238 - Handle ArrowNotImplementedError caused by try_type being Image or Audio in cast by @mariosasko in #5236
- Encode path only for old versions of hfh by @lhoestq in #5237
- Fix CI require_beam maximum compatible dill version by @albertvillanova in #5212
- Support hfh rc version by @lhoestq in #5241
- Cleaner error tracebacks for dataset script errors by @mariosasko in #5240
New Contributors
- @david1542 made their first contribution in #5120
- @ayushthe1 made their first contribution in #5142
- @gpucce made their first contribution in #5169
- @sayakpaul made their first contribution in #5187
- @NightMachinery made their first contribution in #5155
Full Changelog: 2.6.1...2.7.0
2.6.1
Bug fixes
- Fix filter indices when batched by @albertvillanova in #5113
- fixed a bug where
filter
could return examples with the wrong indices
- fixed a bug where
- Fix iter_batches by @lhoestq in #5115
- fixed a bug where
map
withbatch=True
could return a dataset with less examples
- fixed a bug where
- Fix a typo in arrow_dataset.py by @yangky11 in #5108
New Contributors
Full Changelog: 2.6.0...2.6.1
2.6.0
Important
- [GH->HF] Remove all dataset scripts from github by @lhoestq in #4974
- all the dataset scripts and dataset cards are now on https://hf.co/datasets
- we invite users and contributors to open discussions or pull requests on the Hugging Face Hub from now on
Datasets features
- Add ability to read-write to SQL databases. by @Dref360 in #4928
- Read from sqlite file:
from datasets import Dataset dataset = Dataset.from_sql("data_table", "sqlite:///sqlite_file.db")
- Allow connection objects in
from_sql
+ small doc improvement by @mariosasko in #5091
from datasets import Dataset from sqlite3 import connect con = connect(...) dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con)
- Image & Audio formatting for numpy/torch/tf/jax by @lhoestq in #5072
- return numpy/torch/tf/jax tensors with
from datasets import load_dataset ds = load_dataset("imagenet-1k").with_format("torch") # or numpy/tf/jax ds[0]["image"]
- Added
IterableDataset.from_generator
by @hamid-vakilzadeh in #5052 - Fast dataset iter by @mariosasko in #5030
- speed up by a factor of 2 using the Arrow Table reader
- Dataset infos in yaml by @lhoestq in #4926
- you can now specify the feature types and number of samples in the dataset card, see https://huggingface.co/docs/datasets/dataset_card
- Add
kwargs
toDataset.from_generator
by @mariosasko in #5049 - Support
converters
inCsvBuilder
by @mariosasko in #5057 - Restore saved format state in
load_from_disk
by @asofiaoliveira in #5073
Dataset changes
- Update: hendrycks_test - support streaming by @albertvillanova in #5041
- Update: swiss judgment prediction by @JoelNiklaus in #5019
- Update swiss judgment prediction by @JoelNiklaus in #5042
- Fix: xcsr - fix languages of X-CSQA configs by @albertvillanova in #5022
- Fix: sbu_captions - fix URLs by @donglixp in #5020
- Fix: xcsr - fix string features by @albertvillanova in #5024
- Fix: hendrycks_test - fix NonMatchingChecksumError by @albertvillanova in #5040
- Fix: cats_vs_dogs - fix number of samples by @lhoestq in #5047
- Fix: lex_glue - fix bug with labels of eurlex config of lex_glue dataset by @iliaschalkidis in #5048
- Fix: msr_sqa - fix dataset generation by @Timothyxxx in #3715
Dataset cards
- Add description to hellaswag dataset by @julien-c in #4810
- Add deprecation warning to multilingual_librispeech dataset card by @albertvillanova in #5010
- Update languages in aeslc dataset card by @apergo-ai in #3357
- Update license to bookcorpus dataset card by @meg-huggingface in #3526
- Update paper link in medmcqa dataset card by @monk1337 in #4290
- Add oversampling strategy iterable datasets interleave by @ylacombe in #5036
- Fix license/citation information of squadshifts dataset card by @albertvillanova in #5054
General improvements and bug fixes
- Fix missing use_auth_token in streaming docstrings by @albertvillanova in #5003
- Add some note about running the transformers ci before a release by @lhoestq in #5007
- Remove license tag file and validation by @albertvillanova in #5004
- Re-apply input columns change by @mariosasko in #5008
- patch CI_HUB_TOKEN_PATH with Path instead of str by @Wauplin in #5026
- Fix typo in error message by @severo in #5027
- Fix import in
ClassLabel
docstring example by @alvarobartt in #5029 - Remove redundant code from some dataset module factories by @albertvillanova in #5033
- Fix typos in load docstrings and comments by @albertvillanova in #5035
- Prefer split patterns from directories over split patterns from filenames by @polinaeterna in #4985
- Fix tar extraction vuln by @lhoestq in #5016
- Support hfh 0.10 implicit auth by @lhoestq in #5031
- Fix
flatten_indices
with empty indices mapping by @mariosasko in #5043 - Improve CI performance speed of PackagedDatasetTest by @albertvillanova in #5037
- Revert task removal in folder-based builders by @mariosasko in #5051
- Fix backward compatibility for dataset_infos.json by @lhoestq in #5055
- Fix typo by @stevhliu in #5059
- Fix CI hfh token warning by @albertvillanova in #5062
- Mark CI tests as xfail when 502 error by @albertvillanova in #5058
- Fix passed download_config in HubDatasetModuleFactoryWithoutScript by @albertvillanova in #5077
- Fix CONTRIBUTING once dataset scripts transferred to Hub by @albertvillanova in #5067
- Fix header level in Audio docs by @stevhliu in #5078
- Support DEFAULT_CONFIG_NAME when no BUILDER_CONFIGS by @albertvillanova in #5071
- Support streaming gzip.open by @albertvillanova in #5066
- adding keep in memory by @Mustapha-AJEGHRIR in #5082
- refactor: replace AssertionError with more meaningful exceptions (#5074) by @galbwe in #5079
- fix: update exception throw from OSError to EnvironmentError in `push… by @rahulXs in #5076
- Align signature of list_repo_files with latest hfh by @albertvillanova in #5063
- Align signature of create/delete_repo with latest hfh by @albertvillanova in #5064
- Fix filter with empty indices by @Mouhanedg56 in #5087
- Fix tutorial (#5093) by @riccardobucco in #5095
- Use HTML relative paths for tiles in the docs by @lewtun in #5092
- Fix loading how to guide (#5102) by @riccardobucco in #5104
- url encode hub url (#5099) by @riccardobucco in #5103
- Free the "hf" filesystem protocol for
hffs
by @lhoestq in #5101 - Fix task template reload from dict by @lhoestq in #5106
New Contributors
- @Wauplin made their first contribution in #5026
- @donglixp made their first contribution in #5020
- @Timothyxxx made their first contribution in #3715
- @hamid-vakilzadeh made their first contribution in #5052
- @Mustapha-AJEGHRIR made their first contribution in #5082
- @galbwe made their first contribution in #5079
- @rahulXs made their first contribution in #5076
- @Mouhanedg56 made their first contribution in #5087
- @riccardobucco made their first contribution in #5095
- @asofiaoliveira made their first contribution in #5073
Full Changelog: 2.5.1...2.6.0
2.5.2
2.5.1
2.5.0
Important
- Drop Python 3.6 support by @mariosasko in #4460
- Deprecate metrics by @albertvillanova in #4739
- Metrics are now deprecated and have been moved to evaluate:
!pip install evaluate import evaluate metric = evaluate.load("accuracy")
- Metrics are now deprecated and have been moved to evaluate:
- Load GitHub datasets from Hub by @albertvillanova in #4059
- datasets with no namespace like "squad" were loaded from this GitHub repository, now they're loaded from https://huggingface.co/datasets
- Decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in #4923
- latest version of torchaudio 0.12 now requires ffmpeg (version 4) to read MP3 files, please downgrade to 0.12 for now or use librosa
- Use HTTP requests to access data and metadata through the Datasets REST API (docs here)
Datasets features
No-code loaders
- Add AudioFolder packaged loader by @polinaeterna in #4530
- Add support for CSV metadata files to ImageFolder by @mariosasko in #4837
- Add support for parsing JSON files in array form by @mariosasko in #4997
Dataset methods
- add
Dataset.from_list
by @sanderland in #4890 - Add
Dataset.from_generator
by @mariosasko in #4957 - Add oversampling strategies to interleave datasets by @ylacombe in #4831
- Preserve non-
input_colums
inDataset.map
ifinput_columns
are specified by @mariosasko in #4971 - Add
fn_kwargs
param toIterableDataset.map
by @mariosasko in #4975 - More rigorous shape inference in to_tf_dataset by @Rocketknight1 in #4763
Parquet support
- Download and prepare as Parquet for cloud storage by @lhoestq in #4724
- Shard parquet in download_and_prepare by @lhoestq in #4747
- Embed image/audio data in dl_and_prepare parquet by @lhoestq in #4987
Datasets changes
- Update: natural questions - Add long answer candidates by @seirasto in #4368
- Update: opus_paracrawl - update version by @albertvillanova in #4816
- Update: ReCoRD - Include entity positions as feature by @richarddwang in #4479
- Update: swda - Support streaming by @albertvillanova in #4914
- Update: Enwik8 - update broken link and information by @mtanghu in #4
- Update: compguesswhat - Support streaming by @albertvillanova in #4968
- Update: nli_tr - Support streaming by @albertvillanova in #4970
- Update: IndicGLUE - update download links by @sumanthd17 in #4978
- Update: iwslt2017 - Support streaming by @albertvillanova in #4992
- Fix: mbpp - fix NonMatchingChecksumError by @albertvillanova in #4788
- Fix: mkqa - Update data URL by @albertvillanova in #4823
- Fix: exams - fix bug and checksums by @albertvillanova in #4853
- Fix: trec - use fine classes by @albertvillanova in #4801
- Fix: wmt datasets - fix CWMT zh subsets by @lhoestq in #4871
- Fix: LibriSpeech - Fix dev split local_extracted_archive for 'all' config by @sanchit-gandhi in #4904
- Fix: compguesswhat - fix data URLs by @albertvillanova in #4959
- Fix: vivos - fix data URL and metadata by @albertvillanova in #4969
- Fix: MBPP - Add splits by @cwarny in #4943
Dataset cards
- Add
language_bcp47
tag by @lhoestq in #4753 - Added more information in the README about contributors of the Arabic Speech Corpus by @nawarhalabi in #4701
- Remove "unkown" language tags by @lhoestq in #4754
- Highlight non-commercial license in amazon_reviews_multi dataset card by @sbroadhurst-hf in #4712
- Added dataset information in clinic oos dataset card by @arnav-ladkat in #4751
- Fix opus_gnome dataset card by @gojiteji in #4806
- Complete the mlqa dataset card by @eldhoittangeorge in #4809
- Fix loading example in opus dataset cards by @albertvillanova in #4813
- Add missing language tags to resources by @albertvillanova in #4819
- Fix titles in dataset cards by @albertvillanova in #4824
- Fix language tags in dataset cards by @albertvillanova in #4826
- Add license metadata to pg19 by @julien-c in #4827
- Fix task tags in dataset cards by @albertvillanova in #4830
- Fix tags in dataset cards by @albertvillanova in #4832
- Fix missing tags in dataset cards by @albertvillanova in #4833
- Fix documentation card of recipe_nlg dataset by @albertvillanova in #4834
- Fix documentation card of ethos dataset by @albertvillanova in #4835
- Update documentation card of miam dataset by @PierreColombo in #4846
- Update stackexchange license by @cakiki in #4842
- Update ted_talks_iwslt license to include ND by @cakiki in #4841
- Fix documentation card of adv_glue dataset by @albertvillanova in #4838
- Complete tags of superglue dataset card by @richarddwang in https://github.com/huggingface/datasets/pull/48674869
- Fix license tag and Source Data section in billsum dataset card by @kashif in #4851
- Fix documentation card of covid_qa_castorini dataset by @albertvillanova in #4877
- Fix Citation Information section in dataset cards by @albertvillanova in #4879
- Fix documentation card of math_qa dataset by @albertvillanova in #4884
- Added names of less-studied languages by @BenjaminGalliot in #4880
- Fix language tags resource file by @albertvillanova in #4882
- Add citation to ro_sts and ro_sts_parallel datasets by @albertvillanova in #4892
- Add citation information to makhzan dataset by @albertvillanova in #4894
- Fix missing tags in dataset cards by @albertvillanova in #4891
- Fix missing tags in dataset cards by @albertvillanova in #4896
- Re-add code and und language tags by @albertvillanova in #4899
- Add "cc-by-nc-sa-2.0" to list of licenses by @osanseviero in https://github.com/huggingface/datasets/pull/48874903
- Update GLUE evaluation metadata by @lewtun in #4909
- Fix missing tags in dataset cards by @albertvillanova in #4908
- Add license and citation information to cosmos_qa dataset by @albertvillanova in #4913
- Fix missing tags in dataset cards by @albertvillanova in #4921
- Add cc-by-nc-2.0 to list of licenses by @albertvillanova in #4930
- Fix missing tags in dataset cards by @albertvillanova in #4931
- Add Papers with Code ID to scifact dataset by @albertvillanova in #4941
- Fix license information in qasc dataset card by @albertvillanova in #4951
- Fix multilinguality tag and missing sections in xquad_r dataset card by @albertvillanova in #4940
- Fix missing tags in dataset cards by @albertvillanova in #4979
- Fix missing tags in dataset cards by @albertvillanova in #4991