-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FileNotFoundError when using scan_parquet
on an s3 file
#10008
Comments
I am surprised we were able to scan from cloud? I don't think we ever supported that? |
|
This has been a frustrating hidden behavior of polars (using 0.18.8) The behavior of
In the case of remote single parquet file, everything works as expected. But if the path is a directory in s3 containing many parquet files or a glob pattern, polars doesn't throw an exception, but will silently only scan a single matching parquet file. It would be more helpful to throw and exception in the case where it is trying to scan multiple remote parquet files. It might also be helpful warn the user that scanning a remote parquet file is not officially supported. This behavior is surprising when coming from Dask dataframe where both patterns work as expected for local and remote paths. You can reproduce with this public bucket import polars as pl
file_df = pl.scan_parquet("s3://saturn-public-data/nyc-taxi/data/yellow_tripdata_2019-01.parquet")
print(file_df.collect().shape)
glob_df = pl.scan_parquet("s3://saturn-public-data/nyc-taxi/data/yellow_tripdata_2019-*.parquet")
print(glob_df.collect().shape) Output: (7667792, 18)
(7667792, 18) Here is a work around using S3fs and import s3fs
import polars as pl
s3 = s3fs.S3FileSystem(anon=False)
files = s3.glob("s3://saturn-public-data/nyc-taxi/data/yellow_tripdata_2019-*.parquet")
actual_ds = pl.concat([pl.scan_parquet(f"s3://{f}").select("VendorID") for f in files])
actual_ds.collect().shape Output:
2021-01-27 21:40:26 137510381 yellow_tripdata_2019-01.parquet
2022-07-11 17:29:29 103356025 yellow_tripdata_2019-02.parquet
2022-07-11 17:29:29 116017372 yellow_tripdata_2019-03.parquet
2022-07-11 17:29:29 110139137 yellow_tripdata_2019-04.parquet
2022-07-11 17:29:29 111478943 yellow_tripdata_2019-05.parquet
2022-07-11 17:29:29 102903344 yellow_tripdata_2019-06.parquet
2022-07-11 17:29:29 93877343 yellow_tripdata_2019-07.parquet
2022-07-11 17:29:29 89999675 yellow_tripdata_2019-08.parquet
2022-07-11 17:29:29 97110325 yellow_tripdata_2019-09.parquet
2022-07-11 17:29:29 106293373 yellow_tripdata_2019-10.parquet
2022-07-11 17:29:29 100872983 yellow_tripdata_2019-11.parquet
2022-07-11 17:29:29 101044777 yellow_tripdata_2019-12.parquet |
Hey, just chiming in to say that I have the same experience as @josh. I'm using scan_parquet to read single files from Azure. Up to 0.18.3 this worked great. Since then I've been getting FileNotFound errors. |
I don't understand as we never supported |
Same issue here, not sure how this worked, but issue probably stems from/side effect of #9914, whereas polars/py-polars/polars/io/_utils.py Line 149 in fdbdd1c
|
No offence, but @ritchie46 you implemented scan from cloud in #3626 |
This behavior is inherited from |
Hmm.. totally forgot about that. Will take a look. |
Ok, got it fixed. Is there a way to test this without needing aws credentials? |
I have used |
Right thanks for the tip. Anyone care to setup proper moto backed tests? |
Have to read up but I feel storage_options need to be passed here:
|
Checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Issue description
This has errored in other ways since at least 0.18.5
Expected behavior
It should load like normal.
Installed versions
The text was updated successfully, but these errors were encountered: