Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NonMatchingSplitsSizesError when using data_dir #6918

Closed
srehaag opened this issue May 24, 2024 · 2 comments · Fixed by #6925
Closed

NonMatchingSplitsSizesError when using data_dir #6918

srehaag opened this issue May 24, 2024 · 2 comments · Fixed by #6925
Assignees
Labels
bug Something isn't working

Comments

@srehaag
Copy link

srehaag commented May 24, 2024

Describe the bug

Loading a dataset from with a data_dir argument generates a NonMatchingSplitsSizesError if there are multiple directories in the dataset.

This appears to happen because the expected split is calculated based on the data in all the directories whereas the recorded split is calculated based on the data in the directory specified using the data_dir argument.

This is recent behavior. Until the past few weeks loading using the data_dir argument worked without any issue.

Steps to reproduce the bug

Simple test dataset available here: https://huggingface.co/datasets/srehaag/hf-bug-temp

The dataset contains two directories "data1" and "data2", each with a file called "train.parquet" with a 2 x 5 table.

from datasets import load_dataset
dataset = load_dataset("srehaag/hf-bug-temp", data_dir = "data1")

Generates:


NonMatchingSplitsSizesError Traceback (most recent call last)
Cell In[3], line 2
1 from datasets import load_dataset
----> 2 dataset = load_dataset("srehaag/hf-bug-temp", data_dir = "data1")

File ~/.python/current/lib/python3.10/site-packages/datasets/load.py:2609, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
2606 return builder_instance.as_streaming_dataset(split=split)
2608 # Download and prepare data
-> 2609 builder_instance.download_and_prepare(
2610 download_config=download_config,
2611 download_mode=download_mode,
2612 verification_mode=verification_mode,
2613 num_proc=num_proc,
2614 storage_options=storage_options,
2615 )
2617 # Build dataset for splits
2618 keep_in_memory = (
2619 keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
2620 )

File ~/.python/current/lib/python3.10/site-packages/datasets/builder.py:1027, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
1025 if num_proc is not None:
1026 prepare_split_kwargs["num_proc"] = num_proc
-> 1027 self._download_and_prepare(
1028 dl_manager=dl_manager,
1029 verification_mode=verification_mode,
1030 **prepare_split_kwargs,
1031 **download_and_prepare_kwargs,
1032 )
1033 # Sync info
1034 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File ~/.python/current/lib/python3.10/site-packages/datasets/builder.py:1140, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
1137 dl_manager.manage_extracted_files()
1139 if verification_mode == VerificationMode.BASIC_CHECKS or verification_mode == VerificationMode.ALL_CHECKS:
-> 1140 verify_splits(self.info.splits, split_dict)
1142 # Update the info object with the splits.
1143 self.info.splits = split_dict

File ~/.python/current/lib/python3.10/site-packages/datasets/utils/info_utils.py:101, in verify_splits(expected_splits, recorded_splits)
95 bad_splits = [
96 {"expected": expected_splits[name], "recorded": recorded_splits[name]}
97 for name in expected_splits
98 if expected_splits[name].num_examples != recorded_splits[name].num_examples
99 ]
100 if len(bad_splits) > 0:
--> 101 raise NonMatchingSplitsSizesError(str(bad_splits))
102 logger.info("All the splits matched successfully.")

NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=212, num_examples=10, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=106, num_examples=5, shard_lengths=None, dataset_name='hf-bug-temp')}]


By contrast, this loads the data from both data1/train.parquet and data2/train.parquet without any error message:

from datasets import load_dataset
dataset = load_dataset("srehaag/hf-bug-temp")

Expected behavior

Should load the 5 x 2 table from data1/train.parquet without error message.

Environment info

Used Codespaces to simplify environment (see details below), but bug is present across various configurations.

  • datasets version: 2.19.1
  • Platform: Linux-6.5.0-1021-azure-x86_64-with-glibc2.31
  • Python version: 3.10.13
  • huggingface_hub version: 0.23.1
  • PyArrow version: 16.1.0
  • Pandas version: 2.2.2
  • fsspec version: 2024.3.1
@albertvillanova albertvillanova self-assigned this May 28, 2024
@albertvillanova
Copy link
Member

Thanks for reporting, @srehaag.

We are investigating this issue.

@albertvillanova albertvillanova added the bug Something isn't working label May 28, 2024
@albertvillanova
Copy link
Member

I confirm there is a bug for data-based Hub datasets when the user passes data_dir, which was introduced by PR:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants