Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Raise more informative error message for directories containing files with mixed extensions #17480

Merged
merged 2 commits into from
Jul 8, 2024

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Jul 8, 2024

If the user passes a single directory to a scan_* function, we will now check that all files underneath it have the same file extension. If this is not the case an error message is raised showing the offending paths and recommending to use glob patterns if they still wish to scan all files.

Combining this with #17478 will also make it so that file extensions of empty files are ignored.

Fixes #17436

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jul 8, 2024
Copy link

codecov bot commented Jul 8, 2024

Codecov Report

Attention: Patch coverage is 97.43590% with 1 line in your changes missing coverage. Please review.

Project coverage is 80.47%. Comparing base (553f30b) to head (64c38a2).

Files Patch % Lines
crates/polars-lazy/src/scan/file_list_reader.rs 97.05% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main   #17480   +/-   ##
=======================================
  Coverage   80.46%   80.47%           
=======================================
  Files        1483     1483           
  Lines      195138   195159   +21     
  Branches     2782     2782           
=======================================
+ Hits       157014   157047   +33     
+ Misses      37612    37600   -12     
  Partials      512      512           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nameexhaustion nameexhaustion marked this pull request as ready for review July 8, 2024 02:41
@nameexhaustion nameexhaustion changed the title feat: Raise user-friendly exception if directory contained files with different extensions feat: Raise more informative error message for directories containing files with mixed extensions Jul 8, 2024
@nameexhaustion nameexhaustion marked this pull request as draft July 8, 2024 06:37
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jul 8, 2024

If you call scan_parquet, wouldn't it be cleaner to just load all the parquet files found in the given directory? It's pretty common for there to be an additional file representing a checksum, metadata, etc, alongside the parquet payload, and the caller's intention seems clear (given that they called a function that can only load parquet - or the filetype of the given scan_* function).

@ritchie46
Copy link
Member

We decided not to determine what parquet files are by checking whitelisted extensions. E.g. parquet, par, pqt.

I also don't want to check for magic bytes on all the files in a directory, as that would require downloading all of them potentially, and this is impossible for (compressed), csv, json.

The idea is that if you pass a directory, you guarantee it is a (hive partitioned) dataset. If you want to load all files with a certain file extension, we give you the possibility to do so via globbing patterns.

@nameexhaustion nameexhaustion marked this pull request as ready for review July 8, 2024 09:44
@nameexhaustion nameexhaustion marked this pull request as draft July 8, 2024 09:49
@nameexhaustion nameexhaustion marked this pull request as ready for review July 8, 2024 09:50
@ritchie46 ritchie46 merged commit 36eff75 into pola-rs:main Jul 8, 2024
26 checks passed
@nameexhaustion nameexhaustion deleted the dir-ext-check branch July 8, 2024 12:39
henryharbeck pushed a commit to henryharbeck/polars that referenced this pull request Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve error messages / documentation around scanning hive directories
3 participants