-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: pandas, s3fs, and pyarrow raises OSError or FileNotFoundError when reading from S3 #55230
Comments
Here are the scripts I used to make the test matrix: $ cat run_test.sh
#!/bin/bash
pandas_version=$1
pyarrow_version=$2
venv=venv_${pandas_version}_${pyarrow_version}
{ test -d $venv || python3.10 -m venv $venv;
$venv/bin/pip install "pandas==$pandas_version" "pyarrow==$pyarrow_version" s3fs;
} &> /dev/null
cat <<EOF | $venv/bin/python
import pandas as pd
try:
df = pd.read_parquet("$S3PATH")
print("$pandas_version", "$pyarrow_version", "OK", len(df), sep=",")
except Exception as e:
print("$pandas_version", "$pyarrow_version", type(e).__name__, e, sep=",")
EOF
$ cat run_all.sh
#!/bin/bash
pyarrow_versions=(13.0.0 12.0.1 11.0.0 10.0.1 9.0.0 8.0.0)
pandas_versions=(2.1.1 2.0.3 1.5.3)
export S3PATH
{ echo "pandas_version,pyarrow_version,result,msg";
parallel ./run_test.sh ::: ${pandas_versions[*]} ::: ${pyarrow_versions[*]};
} | tee results.csv |
To clarify, I'm trying to read a single partition from a partitioned dataset in S3. The partition directory contains a single parquet file with 120 lines. |
I've found out the problem. Some process in my pipeline (which I couldn't pinpoint as of yet) is creating partitioned parquet files in the following way:
This is a reproducible example: import boto3
import pandas as pd
df = pd.DataFrame({"foo": ["a", "b"], "bar": [1,2]})
df.to_parquet("s3://my-bucket/dataset/", partition_cols=["foo"])
pd.read_parquet("s3://my-bucket/dataset/") # Works
# Now create an empty object with the dataset name:
s3 = boto3.client("s3")
s3.put_object(Bucket="my-bucket", Key="dataset/", Body=b"")
pd.read_parquet("s3://my-bucket/dataset/") # Throws
"""
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[6], line 1
----> 1 pd.read_parquet("s3://my-bucket/dataset/")
File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pandas/io/parquet.py:509, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, **kwargs)
506 use_nullable_dtypes = False
507 check_dtype_backend(dtype_backend)
--> 509 return impl.read(
510 path,
511 columns=columns,
512 storage_options=storage_options,
513 use_nullable_dtypes=use_nullable_dtypes,
514 dtype_backend=dtype_backend,
515 **kwargs,
516 )
File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pandas/io/parquet.py:227, in PyArrowImpl.read(self, path, columns, use_nullable_dtypes, dtype_backend, storage_options, **kwargs)
220 path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
221 path,
222 kwargs.pop("filesystem", None),
223 storage_options=storage_options,
224 mode="rb",
225 )
226 try:
--> 227 pa_table = self.api.parquet.read_table(
228 path_or_handle, columns=columns, **kwargs
229 )
230 result = pa_table.to_pandas(**to_pandas_kwargs)
232 if manager == "array":
File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/parquet/core.py:2955, in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit)
2948 raise ValueError(
2949 "The 'metadata' keyword is no longer supported with the new "
2950 "datasets-based implementation. Specify "
2951 "'use_legacy_dataset=True' to temporarily recover the old "
2952 "behaviour."
2953 )
2954 try:
-> 2955 dataset = _ParquetDatasetV2(
2956 source,
2957 schema=schema,
2958 filesystem=filesystem,
2959 partitioning=partitioning,
2960 memory_map=memory_map,
2961 read_dictionary=read_dictionary,
2962 buffer_size=buffer_size,
2963 filters=filters,
2964 ignore_prefixes=ignore_prefixes,
2965 pre_buffer=pre_buffer,
2966 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
2967 thrift_string_size_limit=thrift_string_size_limit,
2968 thrift_container_size_limit=thrift_container_size_limit,
2969 )
2970 except ImportError:
2971 # fall back on ParquetFile for simple cases when pyarrow.dataset
2972 # module is not available
2973 if filters is not None:
File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/parquet/core.py:2506, in _ParquetDatasetV2.__init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, **kwargs)
2502 if partitioning == "hive":
2503 partitioning = ds.HivePartitioning.discover(
2504 infer_dictionary=True)
-> 2506 self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
2507 schema=schema, format=parquet_format,
2508 partitioning=partitioning,
2509 ignore_prefixes=ignore_prefixes)
File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/dataset.py:773, in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
762 kwargs = dict(
763 schema=schema,
764 filesystem=filesystem,
(...)
769 selector_ignore_prefixes=ignore_prefixes
770 )
772 if _is_path_like(source):
--> 773 return _filesystem_dataset(source, **kwargs)
774 elif isinstance(source, (tuple, list)):
775 if all(_is_path_like(elem) for elem in source):
File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/dataset.py:466, in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
458 options = FileSystemFactoryOptions(
459 partitioning=partitioning,
460 partition_base_dir=partition_base_dir,
461 exclude_invalid_files=exclude_invalid_files,
462 selector_ignore_prefixes=selector_ignore_prefixes
463 )
464 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
--> 466 return factory.finish(schema)
File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/_dataset.pyx:2941, in pyarrow._dataset.DatasetFactory.finish()
File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/_fs.pyx:1551, in pyarrow._fs._cb_open_input_file()
File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/fs.py:424, in FSSpecHandler.open_input_file(self, path)
421 from pyarrow import PythonFile
423 if not self.fs.isfile(path):
--> 424 raise FileNotFoundError(path)
426 return PythonFile(self.fs.open(path, mode="rb"), mode="r")
FileNotFoundError: my-bucket/dataset/
""" Maybe this "directory" schema could be supported, but I guess this isn't a Pandas issue anymore. I will close the issue. |
@ianliu For the sake of other people finding this issue (me!) can you say if you found a fix? |
@jamesnunn this is a messy interaction between several libraries. Here are some facts:
So, you can imagine what happens when you update pandas from 2.0.3 to 2.1.*. Before it was using s3fs, and after it is using pyarrow if no filesystem is specified in the My rule of thumb is: always use pyarrow, ditch s3fs. But this adds an inconvenience when testing locally with AWS SSO session. So what I did was some wrapper script like so: # save as "awscred" in your path and chmod +x
# now you can execute your pandas script with "awscred python foo.py args"
from subprocess import run
import sys, os, boto3
session = boto3.Session()
cred = session.get_credentials()
os.environ["AWS_ACCESS_KEY_ID"] = cred.access_key
os.environ["AWS_SECRET_ACCESS_KEY"] = cred.secret_key
os.environ["AWS_SESSION_TOKEN"] = cred.token
run(sys.argv[1:]) |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
There is a very brittle interaction between the Pandas version and Pyarrow version when reading a parquet dataset from S3. Here is a matrix of tests:
Expected Behavior
I expect latest pandas to work with latest pyarrow and s3fs.
Installed Versions
Pandas 2.1.1 and Pyarrow 13.0.0 (NOT WORKING EXAMPLE)
Pandas 2.1.1 and Pyarrow 9.0.0 (WORKING EXAMPLE)
commit : e86ed37
python : 3.10.12.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-82-generic
Version : #91~20.04.1-Ubuntu SMP Fri Aug 18 16:24:39 UTC 2023
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : pt_BR.UTF-8
pandas : 2.1.1
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 23.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.9.1
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.9.1
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: