Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileNotFoundError when using scan_parquet on an s3 file #10008

Closed
2 tasks done
TylerGrantSmith opened this issue Jul 20, 2023 · 13 comments · Fixed by #10098
Closed
2 tasks done

FileNotFoundError when using scan_parquet on an s3 file #10008

TylerGrantSmith opened this issue Jul 20, 2023 · 13 comments · Fixed by #10098
Labels
bug Something isn't working python Related to Python Polars

Comments

@TylerGrantSmith
Copy link

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import pyarrow.dataset as ds 
import polars as pl 

path = 's3://rimrep-data-public/091-aims-sst/test-50-64-spatialpart'
dset = ds.dataset('s3://rimrep-data-public/091-aims-sst/test-50-64-spatialpart')

# works 
print(pl.scan_pyarrow_dataset(dset))

# errors
pl.scan_parquet('s3://rimrep-data-public/091-aims-sst/test-50-64-spatialpart')

Issue description

This has errored in other ways since at least 0.18.5

Expected behavior

It should load like normal.

Installed versions

--------Version info---------
Polars:              0.18.8
Index type:          UInt32
Platform:            Windows-10-10.0.19045-SP0
Python:              3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)]

----Optional dependencies----
--------Version info---------
Polars:              0.18.8
Index type:          UInt32
Platform:            Windows-10-10.0.19045-SP0
Python:              3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              2023.6.0
matplotlib:          <not installed>
numpy:               1.25.1
pandas:              <not installed>
pyarrow:             12.0.1
pydantic:            2.0.1
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>
@TylerGrantSmith TylerGrantSmith added bug Something isn't working python Related to Python Polars labels Jul 20, 2023
@ritchie46
Copy link
Member

I am surprised we were able to scan from cloud? I don't think we ever supported that?

@josh
Copy link
Contributor

josh commented Jul 24, 2023

pl.scan_parquet worked on s3 up until 0.18.3 for me.

@sullivancolin
Copy link

sullivancolin commented Jul 25, 2023

This has been a frustrating hidden behavior of polars (using 0.18.8)

The behavior of pl.scan_parquet(path) is unpredictable:

Path File Directory/glob
Local
Remote

In the case of remote single parquet file, everything works as expected. But if the path is a directory in s3 containing many parquet files or a glob pattern, polars doesn't throw an exception, but will silently only scan a single matching parquet file. It would be more helpful to throw and exception in the case where it is trying to scan multiple remote parquet files. It might also be helpful warn the user that scanning a remote parquet file is not officially supported.

This behavior is surprising when coming from Dask dataframe where both patterns work as expected for local and remote paths.

You can reproduce with this public bucket s3://saturn-public-data/nyc-taxi/data/

import polars as pl

file_df = pl.scan_parquet("s3://saturn-public-data/nyc-taxi/data/yellow_tripdata_2019-01.parquet")
print(file_df.collect().shape)

glob_df = pl.scan_parquet("s3://saturn-public-data/nyc-taxi/data/yellow_tripdata_2019-*.parquet")
print(glob_df.collect().shape)

Output:

(7667792, 18)
(7667792, 18)

Here is a work around using S3fs and pl.concat:

import s3fs
import polars as pl

s3 = s3fs.S3FileSystem(anon=False)
files = s3.glob("s3://saturn-public-data/nyc-taxi/data/yellow_tripdata_2019-*.parquet")
actual_ds = pl.concat([pl.scan_parquet(f"s3://{f}").select("VendorID") for f in files])
actual_ds.collect().shape

Output:

(84569619, 1)

aws s3 ls s3://saturn-public-data/nyc-taxi/data/

2021-01-27 21:40:26  137510381 yellow_tripdata_2019-01.parquet
2022-07-11 17:29:29  103356025 yellow_tripdata_2019-02.parquet
2022-07-11 17:29:29  116017372 yellow_tripdata_2019-03.parquet
2022-07-11 17:29:29  110139137 yellow_tripdata_2019-04.parquet
2022-07-11 17:29:29  111478943 yellow_tripdata_2019-05.parquet
2022-07-11 17:29:29  102903344 yellow_tripdata_2019-06.parquet
2022-07-11 17:29:29   93877343 yellow_tripdata_2019-07.parquet
2022-07-11 17:29:29   89999675 yellow_tripdata_2019-08.parquet
2022-07-11 17:29:29   97110325 yellow_tripdata_2019-09.parquet
2022-07-11 17:29:29  106293373 yellow_tripdata_2019-10.parquet
2022-07-11 17:29:29  100872983 yellow_tripdata_2019-11.parquet
2022-07-11 17:29:29  101044777 yellow_tripdata_2019-12.parquet

@thomas-tran-de
Copy link

Hey, just chiming in to say that I have the same experience as @josh. I'm using scan_parquet to read single files from Azure. Up to 0.18.3 this worked great. Since then I've been getting FileNotFound errors.

@ritchie46
Copy link
Member

I don't understand as we never supported scan_ on cloud files officially? So I am wondering how that has worked?

@Bramtimm
Copy link

Same issue here, not sure how this worked, but issue probably stems from/side effect of #9914, whereas pathlib combines multiple // to a single / when converting to str such that fsspec can't infer the s3.path with infer_storage_options:

if infer_storage_options(file)["protocol"] == "file":

@cjackal
Copy link
Contributor

cjackal commented Jul 26, 2023

I don't understand as we never supported scan_ on cloud files officially? So I am wondering how that has worked?

No offence, but @ritchie46 you implemented scan from cloud in #3626

@cjackal
Copy link
Contributor

cjackal commented Jul 26, 2023

This has been a frustrating hidden behavior of polars (using 0.18.8)

The behavior of pl.scan_parquet(path) is unpredictable:

Path File Directory/glob
Local ✅ ✅
Remote ✅ ⛔
In the case of remote single parquet file, everything works as expected. But if the path is a directory in s3 containing many parquet files or a glob pattern, polars doesn't throw an exception, but will silently only scan a single matching parquet file. It would be more helpful to throw and exception in the case where it is trying to scan multiple remote parquet files. It might also be helpful warn the user that scanning a remote parquet file is not officially supported.

This behavior is surprising when coming from Dask dataframe where both patterns work as expected for local and remote paths.

You can reproduce with this public bucket s3://saturn-public-data/nyc-taxi/data/

import polars as pl

file_df = pl.scan_parquet("s3://saturn-public-data/nyc-taxi/data/yellow_tripdata_2019-01.parquet")
print(file_df.collect().shape)

glob_df = pl.scan_parquet("s3://saturn-public-data/nyc-taxi/data/yellow_tripdata_2019-*.parquet")
print(glob_df.collect().shape)

Output:

(7667792, 18)
(7667792, 18)

Here is a work around using S3fs and pl.concat:

import s3fs
import polars as pl

s3 = s3fs.S3FileSystem(anon=False)
files = s3.glob("s3://saturn-public-data/nyc-taxi/data/yellow_tripdata_2019-*.parquet")
actual_ds = pl.concat([pl.scan_parquet(f"s3://{f}").select("VendorID") for f in files])
actual_ds.collect().shape

Output:

(84569619, 1)

aws s3 ls s3://saturn-public-data/nyc-taxi/data/

2021-01-27 21:40:26  137510381 yellow_tripdata_2019-01.parquet
2022-07-11 17:29:29  103356025 yellow_tripdata_2019-02.parquet
2022-07-11 17:29:29  116017372 yellow_tripdata_2019-03.parquet
2022-07-11 17:29:29  110139137 yellow_tripdata_2019-04.parquet
2022-07-11 17:29:29  111478943 yellow_tripdata_2019-05.parquet
2022-07-11 17:29:29  102903344 yellow_tripdata_2019-06.parquet
2022-07-11 17:29:29   93877343 yellow_tripdata_2019-07.parquet
2022-07-11 17:29:29   89999675 yellow_tripdata_2019-08.parquet
2022-07-11 17:29:29   97110325 yellow_tripdata_2019-09.parquet
2022-07-11 17:29:29  106293373 yellow_tripdata_2019-10.parquet
2022-07-11 17:29:29  100872983 yellow_tripdata_2019-11.parquet
2022-07-11 17:29:29  101044777 yellow_tripdata_2019-12.parquet

This behavior is inherited from fsspec; polars simply pass the path (glob pattern) to fsspec.open, and fsspec.open opens the first file matching the path. It is officially documented that fsspec.open does not expect a glob pattern as input.

@ritchie46
Copy link
Member

Hmm.. totally forgot about that.

Will take a look.

@ritchie46
Copy link
Member

Ok, got it fixed. Is there a way to test this without needing aws credentials?

@cjackal
Copy link
Contributor

cjackal commented Jul 26, 2023

Ok, got it fixed. Is there a way to test this without needing aws credentials?

I have used moto to mock s3 before, it is kind of standard way for tests on s3 I think (e.g. s3fs uses it)

@ritchie46
Copy link
Member

Right thanks for the tip. Anyone care to setup proper moto backed tests?

@marcelbischoff
Copy link

Have to read up but I feel storage_options need to be passed here:

return _read_parquet_schema(source)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants