BUG: read_excel throws FileNotFoundError with s3fs objects #38788

likealostcause · 2020-12-29T20:53:12Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample

I get a FileNotFoundError when running the following code:

import pandas as pd
import s3fs
fs = s3fs.S3FileSystem()
with fs.open('s3://bucket_name/filename.xlsx') as f:
    pd.read_excel(f)
    # NOTE: pd.ExcelFile(f) throws same error

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/airflow/GVIP/venv/lib/python3.8/site-packages/pandas/util/_decorators.py", line 299, in wrapper
    return func(*args, **kwargs)
  File "/home/airflow/GVIP/venv/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 336, in read_excel
    io = ExcelFile(io, storage_options=storage_options, engine=engine)
  File "/home/airflow/GVIP/venv/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 1062, in __init__
    ext = inspect_excel_format(
  File "/home/airflow/GVIP/venv/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 938, in inspect_excel_format
    with get_handle(
  File "/home/airflow/GVIP/venv/lib/python3.8/site-packages/pandas/io/common.py", line 648, in get_handle
    handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: '<File-like object S3FileSystem, bucket_name/filename.xlsx>'

Problem description

I should be able to read in the File-like object from s3fs when using pd.read_excel or pd.ExcelFile. Pandas 1.1.x allows for this, but it looks like changes to pd.io.common.get_handle in 1.2 have made this impossible. The simple workaround for this is to just use the s3 URI instead of using s3fs to open it first, but to my knowledge, the ability to use read_excel with an s3fs object was not intended to be deprecated in 1.2.

My noob guess on what's going wrong

I'm new to contributing to open source projects, so I don't know exactly how to fix this, but it looks like the issue is that the pd.io.common.get_handle method in 1.2 thinks the s3fs object is a file handle rather than a file-like buffer. To solve this, I would think something similar to the need_text_wrapping boolean option from the get_handle method in 1.1.x needs to be added to 1.2's get_handle in order to tell pandas that the s3fs object needs a TextIOWrapper rather than treating it like a local file handle.

If someone could give me a little guidance on how to fix this, I'd be happy to give my first open-source contribution a go, but if that's not really how this works, I understand.

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.8.0.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-197-generic
Version : #229-Ubuntu SMP Wed Nov 25 11:05:42 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0
numpy : 1.19.4
pytz : 2020.4
dateutil : 2.8.1
pip : 20.3.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : 0.5.2
scipy : 1.5.4
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

twoertwein · 2020-12-29T23:16:24Z

thank you for your report! Is there a public excel file on s3 so that I can test it quickly (edit: any public S3 file should be sufficient)? I assume that affects most read/to_* functions?

get_handle is supposed to work with strings/file objects/buffers. Your handle seems to be converted to a string at some point (probably something wrong in stringify_path?)

jorisvandenbossche · 2020-12-30T07:59:30Z

@twoertwein you can also test it with the mocked s3 filesystem used in the tests. I can reproduce the error with:

--- a/pandas/tests/io/excel/test_readers.py
+++ b/pandas/tests/io/excel/test_readers.py
@@ -645,7 +645,7 @@ class TestReaders:
         local_table = pd.read_excel("test1" + read_ext)
         tm.assert_frame_equal(url_table, local_table)
 
-    @td.skip_if_not_us_locale
     def test_read_from_s3_url(self, read_ext, s3_resource, s3so):
         # Bucket "pandas-test" created in tests/io/conftest.py
         with open("test1" + read_ext, "rb") as f:
@@ -657,6 +657,21 @@ class TestReaders:
         local_table = pd.read_excel("test1" + read_ext)
         tm.assert_frame_equal(url_table, local_table)
 
+    def test_read_from_s3fs_object(self, read_ext, s3_resource, s3so):
+        # Bucket "pandas-test" created in tests/io/conftest.py
+        with open("test1" + read_ext, "rb") as f:
+            s3_resource.Bucket("pandas-test").put_object(Key="test1" + read_ext, Body=f)
+
+        import s3fs
+        s3 = s3fs.S3FileSystem(**s3so)
+
+        with s3.open("s3://pandas-test/test1" + read_ext) as f:
+            url_table = pd.read_excel(f)
+
+        local_table = pd.read_excel("test1" + read_ext)
+        tm.assert_frame_equal(url_table, local_table)
+

test_read_from_s3_url passes for me locally, but the new test_read_from_s3fs_object fails

twoertwein · 2020-12-30T15:38:47Z

the issue is:

pandas/pandas/io/excel/_base.py

Line 1063 in e85d078

path=str(self._io), storage_options=storage_options

this issue should be limited to excel

likealostcause added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 29, 2020

jorisvandenbossche added IO Data IO issues that don't fit into a more specific label Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 30, 2020

jorisvandenbossche added this to the 1.2.1 milestone Dec 30, 2020

twoertwein added IO Excel read_excel, to_excel and removed IO Data IO issues that don't fit into a more specific label labels Dec 30, 2020

twoertwein mentioned this issue Dec 30, 2020

REGR: read_excel does not work for most file handles #38819

Merged

5 tasks

jreback closed this as completed in #38819 Dec 30, 2020

ggold7046 mentioned this issue Aug 10, 2023

Modified doc/make.py to run sphinx-build -b linkcheck #54265

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_excel throws FileNotFoundError with s3fs objects #38788

BUG: read_excel throws FileNotFoundError with s3fs objects #38788

likealostcause commented Dec 29, 2020

INSTALLED VERSIONS

twoertwein commented Dec 29, 2020 •

edited

Loading

jorisvandenbossche commented Dec 30, 2020 •

edited

Loading

twoertwein commented Dec 30, 2020

BUG: read_excel throws FileNotFoundError with s3fs objects #38788

BUG: read_excel throws FileNotFoundError with s3fs objects #38788

Comments

likealostcause commented Dec 29, 2020

Code Sample

Problem description

My noob guess on what's going wrong

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

twoertwein commented Dec 29, 2020 • edited Loading

jorisvandenbossche commented Dec 30, 2020 • edited Loading

twoertwein commented Dec 30, 2020

Output of `pd.show_versions()`

twoertwein commented Dec 29, 2020 •

edited

Loading

jorisvandenbossche commented Dec 30, 2020 •

edited

Loading