Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wr.s3.read_excel can't read multi-sheet .xls files #739

Closed
johntelforduk opened this issue Jun 8, 2021 · 4 comments · Fixed by #742
Closed

wr.s3.read_excel can't read multi-sheet .xls files #739

johntelforduk opened this issue Jun 8, 2021 · 4 comments · Fixed by #742
Assignees
Labels
bug Something isn't working minor release Will be addressed in the next minor release ready to release
Milestone

Comments

@johntelforduk
Copy link

Describe the bug

I'm using Data Wrangler to process some Excel spreadsheets stored on S3. The spreadsheets are multi-sheet. I've found that wr.s3.read_excel reads .xlsx objects fine. But it fails when reading .xls objects. My testing indicates that the problem with .xls is for multi-sheet spreadsheets only. BTW, my testing also indicates that reading my test multi-sheet .xls files works OK using the basic Pandas pd.read_excel method.

Environment

asn1crypto==1.4.0
astroid==2.3.0
awswrangler==2.8.0
backcall==0.2.0
beautifulsoup4==4.9.3
boto3==1.17.86
botocore==1.20.45
certifi==2021.5.30
chardet==4.0.0
decorator==5.0.5
Django==2.0.2
et-xmlfile==1.1.0
git-remote-codecommit==1.15.1
idna==2.10
ikp3db==1.4.1
importlib-metadata==3.10.0
ipython==7.22.0
ipython-genutils==0.2.0
isort==4.3.21
jedi==0.18.0
jmespath==0.10.0
lazy-object-proxy==1.6.0
lxml==4.6.3
mccabe==0.6.1
numpy==1.20.3
openpyxl==3.0.7
pandas==1.2.4
parso==0.8.2
pbr==5.5.1
pexpect==4.8.0
pg8000==1.19.5
pickleshare==0.7.5
prompt-toolkit==3.0.18
ptyprocess==0.7.0
pyarrow==4.0.1
Pygments==2.8.1
pylint==2.4.4
pylint-django==2.3.0
pylint-flask==0.6
pylint-plugin-utils==0.6
PyMySQL==1.0.2
python-dateutil==2.8.1
pytz==2021.1
redshift-connector==2.0.881
requests==2.25.1
s3transfer==0.4.2
scramp==1.4.0
six==1.15.0
soupsieve==2.2.1
stevedore==3.3.0
traitlets==5.0.5
typed-ast==1.4.2
typing-extensions==3.7.4.3
urllib3==1.26.4
virtualenv==16.2.0
virtualenv-clone==0.5.4
virtualenvwrapper==4.8.4
wcwidth==0.2.5
wrapt==1.12.1
xlrd==2.0.1
zipp==3.4.1

To Reproduce

  1. In Excel create a new spreadsheet. Add a second tab to it called 'Sheet2'. Type a small table of rows and columns into that sheet.
  2. Save the spreadsheet as demo_excel_multi_sheet.xlsx.
  3. Save the spreadsheet a second time as demo_excel_multi_sheet.xls.
  4. Upload copies of both spreadsheets to a test S3 bucket.
  5. Run the following program using Python 3,
import awswrangler as wr
import pandas as pd
import boto3
import io

data_bucket = <Name of your bucket>
xlsx_object = 'demo_excel_multi_sheet.xlsx'
xls_object = 'demo_excel_multi_sheet.xls'

# XLSX on S3 using Data Wrangler works.
xlsx_s3_df = wr.s3.read_excel('s3://' + data_bucket + '/' + xlsx_object, sheet_name='Sheet2')
print(xlsx_s3_df.head())

# XLS locally also works.
xls_local_df = pd.read_excel(xls_object, sheet_name='Sheet2')
print(xls_local_df.head())

# XLS on S3 using Boto3 works.
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')
obj = s3_client.get_object(Bucket=data_bucket, Key=xls_object)
data = obj['Body'].read()
df_boto = pd.read_excel(io.BytesIO(data), sheet_name='Sheet2')
print(df_boto.head())

# XLS on S3 using Data Wrangler doesn't work.
xls_s3_df = wr.s3.read_excel('s3://' + data_bucket + '/' + xls_object, sheet_name='Sheet2')
print(xls_s3_df.head())

Notice that the first 3 print(xyz.head()) statements succeed, but the 4th one is never reached (error message on previous line of code).

P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.

@johntelforduk johntelforduk added the bug Something isn't working label Jun 8, 2021
@jaidisido
Copy link
Contributor

Thanks for raising this @johntelforduk. The culprit is this guy. Removing that line solves it.

From the pandas docs, it seems that the openpyxl engine cannot handle xls. I believe we should simply rely on Pandas to infer the right engine but will try to understand if there are any caveats first and then will open a PR for a fix.

@johntelforduk
Copy link
Author

Hi @jaidisido,
Thanks for investigating this.
I commented out the pandas_kwargs["engine"] = "openpyxl" line that you mentioned in my dev environment. I'm pleased to say that it fixed the bug for my test spreadsheet.
Let me know if I can assist in any way; testing the completed fix, etc.
Regards, John T.

@jaidisido
Copy link
Contributor

Thanks, it has now been merged to main and can be tested:

pip uninstall awswrangler -y
pip install git+https://github.com/awslabs/aws-data-wrangler.git@main

until it's available in the next release.

@jaidisido jaidisido added minor release Will be addressed in the next minor release ready to release labels Jun 11, 2021
@jaidisido jaidisido self-assigned this Jun 11, 2021
@jaidisido jaidisido added this to the 2.9.0 milestone Jun 11, 2021
@jaidisido jaidisido linked a pull request Jun 11, 2021 that will close this issue
@johntelforduk
Copy link
Author

My testing indicates that this is fixed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working minor release Will be addressed in the next minor release ready to release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants