FileNotFoundError when using `scan_parquet` on an s3 file #10008

TylerGrantSmith · 2023-07-20T22:29:35Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import pyarrow.dataset as ds 
import polars as pl 

path = 's3://rimrep-data-public/091-aims-sst/test-50-64-spatialpart'
dset = ds.dataset('s3://rimrep-data-public/091-aims-sst/test-50-64-spatialpart')

# works 
print(pl.scan_pyarrow_dataset(dset))

# errors
pl.scan_parquet('s3://rimrep-data-public/091-aims-sst/test-50-64-spatialpart')

Issue description

This has errored in other ways since at least 0.18.5

Expected behavior

It should load like normal.

Installed versions

--------Version info---------
Polars:              0.18.8
Index type:          UInt32
Platform:            Windows-10-10.0.19045-SP0
Python:              3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)]

----Optional dependencies----
--------Version info---------
Polars:              0.18.8
Index type:          UInt32
Platform:            Windows-10-10.0.19045-SP0
Python:              3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              2023.6.0
matplotlib:          <not installed>
numpy:               1.25.1
pandas:              <not installed>
pyarrow:             12.0.1
pydantic:            2.0.1
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>

The text was updated successfully, but these errors were encountered:

ritchie46 · 2023-07-21T05:51:36Z

I am surprised we were able to scan from cloud? I don't think we ever supported that?

josh · 2023-07-24T19:13:10Z

pl.scan_parquet worked on s3 up until 0.18.3 for me.

sullivancolin · 2023-07-25T20:16:16Z

This has been a frustrating hidden behavior of polars (using 0.18.8)

The behavior of pl.scan_parquet(path) is unpredictable:

Path	File	Directory/glob
Local	✅	✅
Remote	✅	⛔

In the case of remote single parquet file, everything works as expected. But if the path is a directory in s3 containing many parquet files or a glob pattern, polars doesn't throw an exception, but will silently only scan a single matching parquet file. It would be more helpful to throw and exception in the case where it is trying to scan multiple remote parquet files. It might also be helpful warn the user that scanning a remote parquet file is not officially supported.

This behavior is surprising when coming from Dask dataframe where both patterns work as expected for local and remote paths.

You can reproduce with this public bucket s3://saturn-public-data/nyc-taxi/data/

import polars as pl

file_df = pl.scan_parquet("s3://saturn-public-data/nyc-taxi/data/yellow_tripdata_2019-01.parquet")
print(file_df.collect().shape)

glob_df = pl.scan_parquet("s3://saturn-public-data/nyc-taxi/data/yellow_tripdata_2019-*.parquet")
print(glob_df.collect().shape)

Output:

(7667792, 18)
(7667792, 18)

Here is a work around using S3fs and pl.concat:

import s3fs
import polars as pl

s3 = s3fs.S3FileSystem(anon=False)
files = s3.glob("s3://saturn-public-data/nyc-taxi/data/yellow_tripdata_2019-*.parquet")
actual_ds = pl.concat([pl.scan_parquet(f"s3://{f}").select("VendorID") for f in files])
actual_ds.collect().shape

Output:

(84569619, 1)

aws s3 ls s3://saturn-public-data/nyc-taxi/data/

2021-01-27 21:40:26  137510381 yellow_tripdata_2019-01.parquet
2022-07-11 17:29:29  103356025 yellow_tripdata_2019-02.parquet
2022-07-11 17:29:29  116017372 yellow_tripdata_2019-03.parquet
2022-07-11 17:29:29  110139137 yellow_tripdata_2019-04.parquet
2022-07-11 17:29:29  111478943 yellow_tripdata_2019-05.parquet
2022-07-11 17:29:29  102903344 yellow_tripdata_2019-06.parquet
2022-07-11 17:29:29   93877343 yellow_tripdata_2019-07.parquet
2022-07-11 17:29:29   89999675 yellow_tripdata_2019-08.parquet
2022-07-11 17:29:29   97110325 yellow_tripdata_2019-09.parquet
2022-07-11 17:29:29  106293373 yellow_tripdata_2019-10.parquet
2022-07-11 17:29:29  100872983 yellow_tripdata_2019-11.parquet
2022-07-11 17:29:29  101044777 yellow_tripdata_2019-12.parquet

thomas-tran-de · 2023-07-26T07:46:19Z

Hey, just chiming in to say that I have the same experience as @josh. I'm using scan_parquet to read single files from Azure. Up to 0.18.3 this worked great. Since then I've been getting FileNotFound errors.

ritchie46 · 2023-07-26T09:07:09Z

I don't understand as we never supported scan_ on cloud files officially? So I am wondering how that has worked?

Bramtimm · 2023-07-26T16:04:14Z

Same issue here, not sure how this worked, but issue probably stems from/side effect of #9914, whereas pathlib combines multiple // to a single / when converting to str such that fsspec can't infer the s3.path with infer_storage_options:

polars/py-polars/polars/io/_utils.py

Line 149 in fdbdd1c

if infer_storage_options(file)["protocol"] == "file":

cjackal · 2023-07-26T17:03:12Z

I don't understand as we never supported scan_ on cloud files officially? So I am wondering how that has worked?

No offence, but @ritchie46 you implemented scan from cloud in #3626

cjackal · 2023-07-26T17:07:39Z

This has been a frustrating hidden behavior of polars (using 0.18.8)

The behavior of pl.scan_parquet(path) is unpredictable:

Path File Directory/glob
Local ✅ ✅
Remote ✅ ⛔
In the case of remote single parquet file, everything works as expected. But if the path is a directory in s3 containing many parquet files or a glob pattern, polars doesn't throw an exception, but will silently only scan a single matching parquet file. It would be more helpful to throw and exception in the case where it is trying to scan multiple remote parquet files. It might also be helpful warn the user that scanning a remote parquet file is not officially supported.

This behavior is surprising when coming from Dask dataframe where both patterns work as expected for local and remote paths.

You can reproduce with this public bucket s3://saturn-public-data/nyc-taxi/data/
import polars as pl

file_df = pl.scan_parquet("s3://saturn-public-data/nyc-taxi/data/yellow_tripdata_2019-01.parquet")
print(file_df.collect().shape)

glob_df = pl.scan_parquet("s3://saturn-public-data/nyc-taxi/data/yellow_tripdata_2019-*.parquet")
print(glob_df.collect().shape)
Output:
(7667792, 18)
(7667792, 18)
Here is a work around using S3fs and pl.concat:
import s3fs
import polars as pl

s3 = s3fs.S3FileSystem(anon=False)
files = s3.glob("s3://saturn-public-data/nyc-taxi/data/yellow_tripdata_2019-*.parquet")
actual_ds = pl.concat([pl.scan_parquet(f"s3://{f}").select("VendorID") for f in files])
actual_ds.collect().shape
Output:
(84569619, 1)
aws s3 ls s3://saturn-public-data/nyc-taxi/data/
2021-01-27 21:40:26  137510381 yellow_tripdata_2019-01.parquet
2022-07-11 17:29:29  103356025 yellow_tripdata_2019-02.parquet
2022-07-11 17:29:29  116017372 yellow_tripdata_2019-03.parquet
2022-07-11 17:29:29  110139137 yellow_tripdata_2019-04.parquet
2022-07-11 17:29:29  111478943 yellow_tripdata_2019-05.parquet
2022-07-11 17:29:29  102903344 yellow_tripdata_2019-06.parquet
2022-07-11 17:29:29   93877343 yellow_tripdata_2019-07.parquet
2022-07-11 17:29:29   89999675 yellow_tripdata_2019-08.parquet
2022-07-11 17:29:29   97110325 yellow_tripdata_2019-09.parquet
2022-07-11 17:29:29  106293373 yellow_tripdata_2019-10.parquet
2022-07-11 17:29:29  100872983 yellow_tripdata_2019-11.parquet
2022-07-11 17:29:29  101044777 yellow_tripdata_2019-12.parquet

This behavior is inherited from fsspec; polars simply pass the path (glob pattern) to fsspec.open, and fsspec.open opens the first file matching the path. It is officially documented that fsspec.open does not expect a glob pattern as input.

ritchie46 · 2023-07-26T17:18:07Z

Hmm.. totally forgot about that.

Will take a look.

ritchie46 · 2023-07-26T17:58:48Z

Ok, got it fixed. Is there a way to test this without needing aws credentials?

cjackal · 2023-07-26T18:04:20Z

Ok, got it fixed. Is there a way to test this without needing aws credentials?

I have used moto to mock s3 before, it is kind of standard way for tests on s3 I think (e.g. s3fs uses it)

ritchie46 · 2023-07-26T18:10:46Z

Right thanks for the tip. Anyone care to setup proper moto backed tests?

marcelbischoff · 2023-07-27T02:40:38Z

Have to read up but I feel storage_options need to be passed here:

polars/py-polars/polars/io/parquet/functions.py

Line 156 in b769edd

return _read_parquet_schema(source)

TylerGrantSmith added bug Something isn't working python Related to Python Polars labels Jul 20, 2023

ritchie46 mentioned this issue Jul 26, 2023

fix(python): undo regression in scan_parquet from s3 #10098

Merged

ritchie46 closed this as completed in #10098 Jul 27, 2023

cjackal mentioned this issue Jul 29, 2023

test(python): Test S3 functionality using moto server #10164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FileNotFoundError when using `scan_parquet` on an s3 file #10008

FileNotFoundError when using `scan_parquet` on an s3 file #10008

TylerGrantSmith commented Jul 20, 2023

ritchie46 commented Jul 21, 2023

josh commented Jul 24, 2023

sullivancolin commented Jul 25, 2023 •

edited

Loading

thomas-tran-de commented Jul 26, 2023

ritchie46 commented Jul 26, 2023

Bramtimm commented Jul 26, 2023

cjackal commented Jul 26, 2023

cjackal commented Jul 26, 2023

ritchie46 commented Jul 26, 2023

ritchie46 commented Jul 26, 2023

cjackal commented Jul 26, 2023

ritchie46 commented Jul 26, 2023

marcelbischoff commented Jul 27, 2023

FileNotFoundError when using scan_parquet on an s3 file #10008

FileNotFoundError when using scan_parquet on an s3 file #10008

Comments

TylerGrantSmith commented Jul 20, 2023

Checks

Reproducible example

Issue description

Expected behavior

Installed versions

ritchie46 commented Jul 21, 2023

josh commented Jul 24, 2023

sullivancolin commented Jul 25, 2023 • edited Loading

thomas-tran-de commented Jul 26, 2023

ritchie46 commented Jul 26, 2023

Bramtimm commented Jul 26, 2023

cjackal commented Jul 26, 2023

cjackal commented Jul 26, 2023

ritchie46 commented Jul 26, 2023

ritchie46 commented Jul 26, 2023

cjackal commented Jul 26, 2023

ritchie46 commented Jul 26, 2023

marcelbischoff commented Jul 27, 2023

FileNotFoundError when using `scan_parquet` on an s3 file #10008

FileNotFoundError when using `scan_parquet` on an s3 file #10008

sullivancolin commented Jul 25, 2023 •

edited

Loading