-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix inferring module for unsupported data files #5787
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
if len(set(list(zip(*module_names.values()))[0])) > 1: | ||
raise ValueError(f"Couldn't infer the same data file format for all splits. Got {module_names}") | ||
module_name, builder_kwargs = next(iter(module_names.values())) | ||
module_name, builder_kwargs = next(iter(split_modules.values())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe check that i split_modules is not empty ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review, @lhoestq.
I think it can only be empty if the user passes data_files={}
, otherwise there are 2 options: either it is not empty or an exception is raised.
split_modules
is derived fromdata_files
, which is instance ofDataFilesDict.from_local_or_remote
withpatterns
patterns
is derived either fromsanitize_patterns
orget_data_patterns_locally
sanitize_patterns
can only return an empty dict if the user passesdata_files={}
get_data_patterns_locally
can only return a non-empty dict or raise aEmptyDatasetError
I think the validation of data_files={}
should be elsewhere though. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe changing?
sanitize_patterns(self.data_files) if self.data_files is not None else get_data_patterns_locally(base_path)
to
sanitize_patterns(self.data_files) if self.data_files else get_data_patterns_locally(base_path)
This way, we are sure split_modules
is never empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the validation of data_files={} should be elsewhere though. What do you think?
Yea indeed, probably in load_dataset_builder ?
Maybe changing?
I think it's better if it raises an error rather than trying to make it run with data files that were not requested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to merge then :)
if len(set(list(zip(*module_names.values()))[0])) > 1: | ||
raise ValueError(f"Couldn't infer the same data file format for all splits. Got {module_names}") | ||
module_name, builder_kwargs = next(iter(module_names.values())) | ||
module_name, builder_kwargs = next(iter(split_modules.values())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
I think you can revert the last commit - it should fail if data_files={} IMO |
This reverts commit 650141d. As requested by reviewer.
The validation of non-empty data_files is addressed in this PR: |
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
This PR raises a FileNotFoundError instead:
Fix #5785.