Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pandas, s3fs, and pyarrow raises OSError or FileNotFoundError when reading from S3 #55230

Closed
2 of 3 tasks
ianliu opened this issue Sep 21, 2023 · 5 comments
Closed
2 of 3 tasks
Labels
Arrow pyarrow functionality Bug IO Parquet parquet, feather Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@ianliu
Copy link

ianliu commented Sep 21, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# install pandas==2.1.1 pyarrow==13.0.0 s3fs

import pandas as pd
pd.read_parquet("s3://my-bucket/data=2000-01-01/") # Raises FileNotFoundError or OSError

Issue Description

There is a very brittle interaction between the Pandas version and Pyarrow version when reading a parquet dataset from S3. Here is a matrix of tests:

Pandas/Pyarrow 8.0.0 9.0.0 10.0.1 11.0.0 12.0.1 13.0.0
1.5.3 FileNotFoundError FileNotFoundError FileNotFoundError FileNotFoundError FileNotFoundError FileNotFoundError
2.0.3 FileNotFoundError FileNotFoundError FileNotFoundError FileNotFoundError FileNotFoundError FileNotFoundError
2.1.1 OSError OK OSError OSError OSError OSError

Expected Behavior

I expect latest pandas to work with latest pyarrow and s3fs.

Installed Versions

Pandas 2.1.1 and Pyarrow 13.0.0 (NOT WORKING EXAMPLE)

commit              : e86ed377639948c64c429059127bcf5b359ab6be
python              : 3.10.12.final.0
python-bits         : 64
OS                  : Linux
OS-release          : 5.15.0-82-generic
Version             : #91~20.04.1-Ubuntu SMP Fri Aug 18 16:24:39 UTC 2023
machine             : x86_64
processor           :
byteorder           : little
LC_ALL              : None
LANG                : en_US.UTF-8
LOCALE              : pt_BR.UTF-8

pandas              : 2.1.1
numpy               : 1.26.0
pytz                : 2023.3.post1
dateutil            : 2.8.2
setuptools          : 65.5.0
pip                 : 23.0.1
Cython              : None
pytest              : None
hypothesis          : None
sphinx              : None
blosc               : None
feather             : None
xlsxwriter          : None
lxml.etree          : None
html5lib            : None
pymysql             : None
psycopg2            : None
jinja2              : None
IPython             : None
pandas_datareader   : None
bs4                 : None
bottleneck          : None
dataframe-api-compat: None
fastparquet         : None
fsspec              : 2023.9.1
gcsfs               : None
matplotlib          : None
numba               : None
numexpr             : None
odfpy               : None
openpyxl            : None
pandas_gbq          : None
pyarrow             : 13.0.0
pyreadstat          : None
pyxlsb              : None
s3fs                : 2023.9.1
scipy               : None
sqlalchemy          : None
tables              : None
tabulate            : None
xarray              : None
xlrd                : None
zstandard           : None
tzdata              : 2023.3
qtpy                : None
pyqt5               : None

Pandas 2.1.1 and Pyarrow 9.0.0 (WORKING EXAMPLE)

commit : e86ed37
python : 3.10.12.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-82-generic
Version : #91~20.04.1-Ubuntu SMP Fri Aug 18 16:24:39 UTC 2023
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : pt_BR.UTF-8

pandas : 2.1.1
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 23.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.9.1
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.9.1
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@ianliu ianliu added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 21, 2023
@ianliu
Copy link
Author

ianliu commented Sep 21, 2023

Here are the scripts I used to make the test matrix:

$ cat run_test.sh
#!/bin/bash
pandas_version=$1
pyarrow_version=$2
venv=venv_${pandas_version}_${pyarrow_version}

{ test -d $venv || python3.10 -m venv $venv;
  $venv/bin/pip install "pandas==$pandas_version" "pyarrow==$pyarrow_version" s3fs;
} &> /dev/null

cat <<EOF | $venv/bin/python
import pandas as pd
try:
    df = pd.read_parquet("$S3PATH")
    print("$pandas_version", "$pyarrow_version", "OK", len(df), sep=",")
except Exception as e:
    print("$pandas_version", "$pyarrow_version", type(e).__name__, e, sep=",")
EOF

$ cat run_all.sh
#!/bin/bash
pyarrow_versions=(13.0.0 12.0.1 11.0.0 10.0.1 9.0.0 8.0.0)
pandas_versions=(2.1.1 2.0.3 1.5.3)
export S3PATH
{ echo "pandas_version,pyarrow_version,result,msg";
  parallel ./run_test.sh ::: ${pandas_versions[*]} ::: ${pyarrow_versions[*]};
} | tee results.csv

@ianliu
Copy link
Author

ianliu commented Sep 21, 2023

To clarify, I'm trying to read a single partition from a partitioned dataset in S3. The partition directory contains a single parquet file with 120 lines.

@rhshadrach rhshadrach added IO Parquet parquet, feather Arrow pyarrow functionality labels Sep 22, 2023
@ianliu
Copy link
Author

ianliu commented Sep 25, 2023

I've found out the problem. Some process in my pipeline (which I couldn't pinpoint as of yet) is creating partitioned parquet files in the following way:

  • Creates an S3 object with the partition name s3://my-bucket/dataset/date=2000-01-01/
  • Creates the parquet file s3://my-bucket/dataset/date=2000-01-01/data.parquet

This is a reproducible example:

import boto3
import pandas as pd

df = pd.DataFrame({"foo": ["a", "b"], "bar": [1,2]})
df.to_parquet("s3://my-bucket/dataset/", partition_cols=["foo"])

pd.read_parquet("s3://my-bucket/dataset/") # Works

# Now create an empty object with the dataset name:
s3 = boto3.client("s3")
s3.put_object(Bucket="my-bucket", Key="dataset/", Body=b"")

pd.read_parquet("s3://my-bucket/dataset/") # Throws
"""
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 pd.read_parquet("s3://my-bucket/dataset/")

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pandas/io/parquet.py:509, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, **kwargs)
    506     use_nullable_dtypes = False
    507 check_dtype_backend(dtype_backend)
--> 509 return impl.read(
    510     path,
    511     columns=columns,
    512     storage_options=storage_options,
    513     use_nullable_dtypes=use_nullable_dtypes,
    514     dtype_backend=dtype_backend,
    515     **kwargs,
    516 )

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pandas/io/parquet.py:227, in PyArrowImpl.read(self, path, columns, use_nullable_dtypes, dtype_backend, storage_options, **kwargs)
    220 path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
    221     path,
    222     kwargs.pop("filesystem", None),
    223     storage_options=storage_options,
    224     mode="rb",
    225 )
    226 try:
--> 227     pa_table = self.api.parquet.read_table(
    228         path_or_handle, columns=columns, **kwargs
    229     )
    230     result = pa_table.to_pandas(**to_pandas_kwargs)
    232     if manager == "array":

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/parquet/core.py:2955, in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit)
   2948     raise ValueError(
   2949         "The 'metadata' keyword is no longer supported with the new "
   2950         "datasets-based implementation. Specify "
   2951         "'use_legacy_dataset=True' to temporarily recover the old "
   2952         "behaviour."
   2953     )
   2954 try:
-> 2955     dataset = _ParquetDatasetV2(
   2956         source,
   2957         schema=schema,
   2958         filesystem=filesystem,
   2959         partitioning=partitioning,
   2960         memory_map=memory_map,
   2961         read_dictionary=read_dictionary,
   2962         buffer_size=buffer_size,
   2963         filters=filters,
   2964         ignore_prefixes=ignore_prefixes,
   2965         pre_buffer=pre_buffer,
   2966         coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
   2967         thrift_string_size_limit=thrift_string_size_limit,
   2968         thrift_container_size_limit=thrift_container_size_limit,
   2969     )
   2970 except ImportError:
   2971     # fall back on ParquetFile for simple cases when pyarrow.dataset
   2972     # module is not available
   2973     if filters is not None:

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/parquet/core.py:2506, in _ParquetDatasetV2.__init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, **kwargs)
   2502 if partitioning == "hive":
   2503     partitioning = ds.HivePartitioning.discover(
   2504         infer_dictionary=True)
-> 2506 self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
   2507                            schema=schema, format=parquet_format,
   2508                            partitioning=partitioning,
   2509                            ignore_prefixes=ignore_prefixes)

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/dataset.py:773, in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
    762 kwargs = dict(
    763     schema=schema,
    764     filesystem=filesystem,
   (...)
    769     selector_ignore_prefixes=ignore_prefixes
    770 )
    772 if _is_path_like(source):
--> 773     return _filesystem_dataset(source, **kwargs)
    774 elif isinstance(source, (tuple, list)):
    775     if all(_is_path_like(elem) for elem in source):

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/dataset.py:466, in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
    458 options = FileSystemFactoryOptions(
    459     partitioning=partitioning,
    460     partition_base_dir=partition_base_dir,
    461     exclude_invalid_files=exclude_invalid_files,
    462     selector_ignore_prefixes=selector_ignore_prefixes
    463 )
    464 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
--> 466 return factory.finish(schema)

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/_dataset.pyx:2941, in pyarrow._dataset.DatasetFactory.finish()

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/_fs.pyx:1551, in pyarrow._fs._cb_open_input_file()

File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/fs.py:424, in FSSpecHandler.open_input_file(self, path)
    421 from pyarrow import PythonFile
    423 if not self.fs.isfile(path):
--> 424     raise FileNotFoundError(path)
    426 return PythonFile(self.fs.open(path, mode="rb"), mode="r")

FileNotFoundError: my-bucket/dataset/
"""

Maybe this "directory" schema could be supported, but I guess this isn't a Pandas issue anymore. I will close the issue.

@ianliu ianliu closed this as completed Sep 25, 2023
@jamesnunn
Copy link

@ianliu For the sake of other people finding this issue (me!) can you say if you found a fix?

@ianliu
Copy link
Author

ianliu commented Aug 27, 2024

@jamesnunn this is a messy interaction between several libraries. Here are some facts:

  1. pyarrow.fs creates "directory" objects in S3 when partitioning datasets
  2. pyarrow.fs doesn't understand the AWS SSO login credentials, so if you are testing with a SSO session, you will get access denied errors
  3. pandas+s3fs doesn't create "directory" objects in S3, and as far as I know reading a dataset with those objects will fail
  4. pandas 2.0.3 is the latest version that will choose s3fs filesystem backend if available;
  5. pandas > 2.0.3 will choose pyarrow.fs even if s3fs is available

So, you can imagine what happens when you update pandas from 2.0.3 to 2.1.*. Before it was using s3fs, and after it is using pyarrow if no filesystem is specified in the to_parquet or read_parquet methods.

My rule of thumb is: always use pyarrow, ditch s3fs.

But this adds an inconvenience when testing locally with AWS SSO session. So what I did was some wrapper script like so:

# save as "awscred" in your path and chmod +x
# now you can execute your pandas script with "awscred python foo.py args"
from subprocess import run
import sys, os, boto3
session = boto3.Session()
cred = session.get_credentials()
os.environ["AWS_ACCESS_KEY_ID"] = cred.access_key
os.environ["AWS_SECRET_ACCESS_KEY"] = cred.secret_key
os.environ["AWS_SESSION_TOKEN"] = cred.token
run(sys.argv[1:])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug IO Parquet parquet, feather Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

3 participants