BUG: pandas, s3fs, and pyarrow raises OSError or FileNotFoundError when reading from S3 #55230

ianliu · 2023-09-21T15:13:37Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# install pandas==2.1.1 pyarrow==13.0.0 s3fs

import pandas as pd
pd.read_parquet("s3://my-bucket/data=2000-01-01/") # Raises FileNotFoundError or OSError

Issue Description

There is a very brittle interaction between the Pandas version and Pyarrow version when reading a parquet dataset from S3. Here is a matrix of tests:

Pandas/Pyarrow	8.0.0	9.0.0	10.0.1	11.0.0	12.0.1	13.0.0
1.5.3	FileNotFoundError	FileNotFoundError	FileNotFoundError	FileNotFoundError	FileNotFoundError	FileNotFoundError
2.0.3	FileNotFoundError	FileNotFoundError	FileNotFoundError	FileNotFoundError	FileNotFoundError	FileNotFoundError
2.1.1	OSError	OK	OSError	OSError	OSError	OSError

Expected Behavior

I expect latest pandas to work with latest pyarrow and s3fs.

Installed Versions

Pandas 2.1.1 and Pyarrow 13.0.0 (NOT WORKING EXAMPLE)

commit              : e86ed377639948c64c429059127bcf5b359ab6be
python              : 3.10.12.final.0
python-bits         : 64
OS                  : Linux
OS-release          : 5.15.0-82-generic
Version             : #91~20.04.1-Ubuntu SMP Fri Aug 18 16:24:39 UTC 2023
machine             : x86_64
processor           :
byteorder           : little
LC_ALL              : None
LANG                : en_US.UTF-8
LOCALE              : pt_BR.UTF-8

pandas              : 2.1.1
numpy               : 1.26.0
pytz                : 2023.3.post1
dateutil            : 2.8.2
setuptools          : 65.5.0
pip                 : 23.0.1
Cython              : None
pytest              : None
hypothesis          : None
sphinx              : None
blosc               : None
feather             : None
xlsxwriter          : None
lxml.etree          : None
html5lib            : None
pymysql             : None
psycopg2            : None
jinja2              : None
IPython             : None
pandas_datareader   : None
bs4                 : None
bottleneck          : None
dataframe-api-compat: None
fastparquet         : None
fsspec              : 2023.9.1
gcsfs               : None
matplotlib          : None
numba               : None
numexpr             : None
odfpy               : None
openpyxl            : None
pandas_gbq          : None
pyarrow             : 13.0.0
pyreadstat          : None
pyxlsb              : None
s3fs                : 2023.9.1
scipy               : None
sqlalchemy          : None
tables              : None
tabulate            : None
xarray              : None
xlrd                : None
zstandard           : None
tzdata              : 2023.3
qtpy                : None
pyqt5               : None

Pandas 2.1.1 and Pyarrow 9.0.0 (WORKING EXAMPLE)

commit : e86ed37
python : 3.10.12.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-82-generic
Version : #91~20.04.1-Ubuntu SMP Fri Aug 18 16:24:39 UTC 2023
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : pt_BR.UTF-8

pandas : 2.1.1
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 23.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.9.1
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.9.1
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

ianliu · 2023-09-21T15:15:37Z

Here are the scripts I used to make the test matrix:

$ cat run_test.sh
#!/bin/bash
pandas_version=$1
pyarrow_version=$2
venv=venv_${pandas_version}_${pyarrow_version}

{ test -d $venv || python3.10 -m venv $venv;
  $venv/bin/pip install "pandas==$pandas_version" "pyarrow==$pyarrow_version" s3fs;
} &> /dev/null

cat <<EOF | $venv/bin/python
import pandas as pd
try:
    df = pd.read_parquet("$S3PATH")
    print("$pandas_version", "$pyarrow_version", "OK", len(df), sep=",")
except Exception as e:
    print("$pandas_version", "$pyarrow_version", type(e).__name__, e, sep=",")
EOF

$ cat run_all.sh
#!/bin/bash
pyarrow_versions=(13.0.0 12.0.1 11.0.0 10.0.1 9.0.0 8.0.0)
pandas_versions=(2.1.1 2.0.3 1.5.3)
export S3PATH
{ echo "pandas_version,pyarrow_version,result,msg";
  parallel ./run_test.sh ::: ${pandas_versions[*]} ::: ${pyarrow_versions[*]};
} | tee results.csv

ianliu · 2023-09-21T15:27:00Z

To clarify, I'm trying to read a single partition from a partitioned dataset in S3. The partition directory contains a single parquet file with 120 lines.

ianliu · 2023-09-25T12:51:33Z

I've found out the problem. Some process in my pipeline (which I couldn't pinpoint as of yet) is creating partitioned parquet files in the following way:

Creates an S3 object with the partition name s3://my-bucket/dataset/date=2000-01-01/
Creates the parquet file s3://my-bucket/dataset/date=2000-01-01/data.parquet

This is a reproducible example:

import boto3
import pandas as pd

df = pd.DataFrame({"foo": ["a", "b"], "bar": [1,2]})
df.to_parquet("s3://my-bucket/dataset/", partition_cols=["foo"])

pd.read_parquet("s3://my-bucket/dataset/") # Works

# Now create an empty object with the dataset name:
s3 = boto3.client("s3")
s3.put_object(Bucket="my-bucket", Key="dataset/", Body=b"")

pd.read_parquet("s3://my-bucket/dataset/") # Throws
"""
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 pd.read_parquet("s3://my-bucket/dataset/")

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pandas/io/parquet.py:509, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, **kwargs)
    506     use_nullable_dtypes = False
    507 check_dtype_backend(dtype_backend)
--> 509 return impl.read(
    510     path,
    511     columns=columns,
    512     storage_options=storage_options,
    513     use_nullable_dtypes=use_nullable_dtypes,
    514     dtype_backend=dtype_backend,
    515     **kwargs,
    516 )

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pandas/io/parquet.py:227, in PyArrowImpl.read(self, path, columns, use_nullable_dtypes, dtype_backend, storage_options, **kwargs)
    220 path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
    221     path,
    222     kwargs.pop("filesystem", None),
    223     storage_options=storage_options,
    224     mode="rb",
    225 )
    226 try:
--> 227     pa_table = self.api.parquet.read_table(
    228         path_or_handle, columns=columns, **kwargs
    229     )
    230     result = pa_table.to_pandas(**to_pandas_kwargs)
    232     if manager == "array":

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/parquet/core.py:2955, in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit)
   2948     raise ValueError(
   2949         "The 'metadata' keyword is no longer supported with the new "
   2950         "datasets-based implementation. Specify "
   2951         "'use_legacy_dataset=True' to temporarily recover the old "
   2952         "behaviour."
   2953     )
   2954 try:
-> 2955     dataset = _ParquetDatasetV2(
   2956         source,
   2957         schema=schema,
   2958         filesystem=filesystem,
   2959         partitioning=partitioning,
   2960         memory_map=memory_map,
   2961         read_dictionary=read_dictionary,
   2962         buffer_size=buffer_size,
   2963         filters=filters,
   2964         ignore_prefixes=ignore_prefixes,
   2965         pre_buffer=pre_buffer,
   2966         coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
   2967         thrift_string_size_limit=thrift_string_size_limit,
   2968         thrift_container_size_limit=thrift_container_size_limit,
   2969     )
   2970 except ImportError:
   2971     # fall back on ParquetFile for simple cases when pyarrow.dataset
   2972     # module is not available
   2973     if filters is not None:

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/parquet/core.py:2506, in _ParquetDatasetV2.__init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, **kwargs)
   2502 if partitioning == "hive":
   2503     partitioning = ds.HivePartitioning.discover(
   2504         infer_dictionary=True)
-> 2506 self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
   2507                            schema=schema, format=parquet_format,
   2508                            partitioning=partitioning,
   2509                            ignore_prefixes=ignore_prefixes)

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/dataset.py:773, in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
    762 kwargs = dict(
    763     schema=schema,
    764     filesystem=filesystem,
   (...)
    769     selector_ignore_prefixes=ignore_prefixes
    770 )
    772 if _is_path_like(source):
--> 773     return _filesystem_dataset(source, **kwargs)
    774 elif isinstance(source, (tuple, list)):
    775     if all(_is_path_like(elem) for elem in source):

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/dataset.py:466, in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
    458 options = FileSystemFactoryOptions(
    459     partitioning=partitioning,
    460     partition_base_dir=partition_base_dir,
    461     exclude_invalid_files=exclude_invalid_files,
    462     selector_ignore_prefixes=selector_ignore_prefixes
    463 )
    464 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
--> 466 return factory.finish(schema)

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/_dataset.pyx:2941, in pyarrow._dataset.DatasetFactory.finish()

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/_fs.pyx:1551, in pyarrow._fs._cb_open_input_file()

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/fs.py:424, in FSSpecHandler.open_input_file(self, path)
    421 from pyarrow import PythonFile
    423 if not self.fs.isfile(path):
--> 424     raise FileNotFoundError(path)
    426 return PythonFile(self.fs.open(path, mode="rb"), mode="r")

FileNotFoundError: my-bucket/dataset/
"""

Maybe this "directory" schema could be supported, but I guess this isn't a Pandas issue anymore. I will close the issue.

jamesnunn · 2024-08-21T14:48:05Z

@ianliu For the sake of other people finding this issue (me!) can you say if you found a fix?

ianliu · 2024-08-27T00:56:44Z

@jamesnunn this is a messy interaction between several libraries. Here are some facts:

pyarrow.fs creates "directory" objects in S3 when partitioning datasets
pyarrow.fs doesn't understand the AWS SSO login credentials, so if you are testing with a SSO session, you will get access denied errors
pandas+s3fs doesn't create "directory" objects in S3, and as far as I know reading a dataset with those objects will fail
pandas 2.0.3 is the latest version that will choose s3fs filesystem backend if available;
pandas > 2.0.3 will choose pyarrow.fs even if s3fs is available

So, you can imagine what happens when you update pandas from 2.0.3 to 2.1.*. Before it was using s3fs, and after it is using pyarrow if no filesystem is specified in the to_parquet or read_parquet methods.

My rule of thumb is: always use pyarrow, ditch s3fs.

But this adds an inconvenience when testing locally with AWS SSO session. So what I did was some wrapper script like so:

# save as "awscred" in your path and chmod +x
# now you can execute your pandas script with "awscred python foo.py args"
from subprocess import run
import sys, os, boto3
session = boto3.Session()
cred = session.get_credentials()
os.environ["AWS_ACCESS_KEY_ID"] = cred.access_key
os.environ["AWS_SECRET_ACCESS_KEY"] = cred.secret_key
os.environ["AWS_SESSION_TOKEN"] = cred.token
run(sys.argv[1:])

ianliu added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 21, 2023

rhshadrach added IO Parquet parquet, feather Arrow pyarrow functionality labels Sep 22, 2023

ianliu closed this as completed Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pandas, s3fs, and pyarrow raises OSError or FileNotFoundError when reading from S3 #55230

BUG: pandas, s3fs, and pyarrow raises OSError or FileNotFoundError when reading from S3 #55230

ianliu commented Sep 21, 2023

ianliu commented Sep 21, 2023 •

edited

Loading

ianliu commented Sep 21, 2023

ianliu commented Sep 25, 2023 •

edited

Loading

jamesnunn commented Aug 21, 2024

ianliu commented Aug 27, 2024

BUG: pandas, s3fs, and pyarrow raises OSError or FileNotFoundError when reading from S3 #55230

BUG: pandas, s3fs, and pyarrow raises OSError or FileNotFoundError when reading from S3 #55230

Comments

ianliu commented Sep 21, 2023

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

ianliu commented Sep 21, 2023 • edited Loading

ianliu commented Sep 21, 2023

ianliu commented Sep 25, 2023 • edited Loading

jamesnunn commented Aug 21, 2024

ianliu commented Aug 27, 2024

ianliu commented Sep 21, 2023 •

edited

Loading

ianliu commented Sep 25, 2023 •

edited

Loading