Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_excel throws FileNotFoundError with s3fs objects #38788

Closed
3 tasks done
likealostcause opened this issue Dec 29, 2020 · 3 comments · Fixed by #38819
Closed
3 tasks done

BUG: read_excel throws FileNotFoundError with s3fs objects #38788

likealostcause opened this issue Dec 29, 2020 · 3 comments · Fixed by #38819
Labels
IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@likealostcause
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample

I get a FileNotFoundError when running the following code:

import pandas as pd
import s3fs
fs = s3fs.S3FileSystem()
with fs.open('s3://bucket_name/filename.xlsx') as f:
    pd.read_excel(f)
    # NOTE: pd.ExcelFile(f) throws same error
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/airflow/GVIP/venv/lib/python3.8/site-packages/pandas/util/_decorators.py", line 299, in wrapper
    return func(*args, **kwargs)
  File "/home/airflow/GVIP/venv/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 336, in read_excel
    io = ExcelFile(io, storage_options=storage_options, engine=engine)
  File "/home/airflow/GVIP/venv/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 1062, in __init__
    ext = inspect_excel_format(
  File "/home/airflow/GVIP/venv/lib/python3.8/site-packages/pandas/io/excel/_base.py", line 938, in inspect_excel_format
    with get_handle(
  File "/home/airflow/GVIP/venv/lib/python3.8/site-packages/pandas/io/common.py", line 648, in get_handle
    handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: '<File-like object S3FileSystem, bucket_name/filename.xlsx>'

Problem description

I should be able to read in the File-like object from s3fs when using pd.read_excel or pd.ExcelFile. Pandas 1.1.x allows for this, but it looks like changes to pd.io.common.get_handle in 1.2 have made this impossible. The simple workaround for this is to just use the s3 URI instead of using s3fs to open it first, but to my knowledge, the ability to use read_excel with an s3fs object was not intended to be deprecated in 1.2.

My noob guess on what's going wrong

I'm new to contributing to open source projects, so I don't know exactly how to fix this, but it looks like the issue is that the pd.io.common.get_handle method in 1.2 thinks the s3fs object is a file handle rather than a file-like buffer. To solve this, I would think something similar to the need_text_wrapping boolean option from the get_handle method in 1.1.x needs to be added to 1.2's get_handle in order to tell pandas that the s3fs object needs a TextIOWrapper rather than treating it like a local file handle.

If someone could give me a little guidance on how to fix this, I'd be happy to give my first open-source contribution a go, but if that's not really how this works, I understand.

Expected Output

<class 'pandas.core.frame.DataFrame'>

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.8.0.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-197-generic
Version : #229-Ubuntu SMP Wed Nov 25 11:05:42 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0
numpy : 1.19.4
pytz : 2020.4
dateutil : 2.8.1
pip : 20.3.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : 0.5.2
scipy : 1.5.4
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@likealostcause likealostcause added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 29, 2020
@twoertwein
Copy link
Member

twoertwein commented Dec 29, 2020

thank you for your report! Is there a public excel file on s3 so that I can test it quickly (edit: any public S3 file should be sufficient)? I assume that affects most read/to_* functions?

get_handle is supposed to work with strings/file objects/buffers. Your handle seems to be converted to a string at some point (probably something wrong in stringify_path?)

@jorisvandenbossche jorisvandenbossche added IO Data IO issues that don't fit into a more specific label Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 30, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.2.1 milestone Dec 30, 2020
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 30, 2020

@twoertwein you can also test it with the mocked s3 filesystem used in the tests. I can reproduce the error with:

--- a/pandas/tests/io/excel/test_readers.py
+++ b/pandas/tests/io/excel/test_readers.py
@@ -645,7 +645,7 @@ class TestReaders:
         local_table = pd.read_excel("test1" + read_ext)
         tm.assert_frame_equal(url_table, local_table)
 
-    @td.skip_if_not_us_locale
     def test_read_from_s3_url(self, read_ext, s3_resource, s3so):
         # Bucket "pandas-test" created in tests/io/conftest.py
         with open("test1" + read_ext, "rb") as f:
@@ -657,6 +657,21 @@ class TestReaders:
         local_table = pd.read_excel("test1" + read_ext)
         tm.assert_frame_equal(url_table, local_table)
 
+    def test_read_from_s3fs_object(self, read_ext, s3_resource, s3so):
+        # Bucket "pandas-test" created in tests/io/conftest.py
+        with open("test1" + read_ext, "rb") as f:
+            s3_resource.Bucket("pandas-test").put_object(Key="test1" + read_ext, Body=f)
+
+        import s3fs
+        s3 = s3fs.S3FileSystem(**s3so)
+
+        with s3.open("s3://pandas-test/test1" + read_ext) as f:
+            url_table = pd.read_excel(f)
+
+        local_table = pd.read_excel("test1" + read_ext)
+        tm.assert_frame_equal(url_table, local_table)
+

test_read_from_s3_url passes for me locally, but the new test_read_from_s3fs_object fails

@twoertwein
Copy link
Member

the issue is:

path=str(self._io), storage_options=storage_options

this issue should be limited to excel

@twoertwein twoertwein added IO Excel read_excel, to_excel and removed IO Data IO issues that don't fit into a more specific label labels Dec 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants