wr.s3.read_excel can't read multi-sheet .xls files #739

johntelforduk · 2021-06-08T12:43:43Z

Describe the bug

I'm using Data Wrangler to process some Excel spreadsheets stored on S3. The spreadsheets are multi-sheet. I've found that wr.s3.read_excel reads .xlsx objects fine. But it fails when reading .xls objects. My testing indicates that the problem with .xls is for multi-sheet spreadsheets only. BTW, my testing also indicates that reading my test multi-sheet .xls files works OK using the basic Pandas pd.read_excel method.

Environment

asn1crypto==1.4.0
astroid==2.3.0
awswrangler==2.8.0
backcall==0.2.0
beautifulsoup4==4.9.3
boto3==1.17.86
botocore==1.20.45
certifi==2021.5.30
chardet==4.0.0
decorator==5.0.5
Django==2.0.2
et-xmlfile==1.1.0
git-remote-codecommit==1.15.1
idna==2.10
ikp3db==1.4.1
importlib-metadata==3.10.0
ipython==7.22.0
ipython-genutils==0.2.0
isort==4.3.21
jedi==0.18.0
jmespath==0.10.0
lazy-object-proxy==1.6.0
lxml==4.6.3
mccabe==0.6.1
numpy==1.20.3
openpyxl==3.0.7
pandas==1.2.4
parso==0.8.2
pbr==5.5.1
pexpect==4.8.0
pg8000==1.19.5
pickleshare==0.7.5
prompt-toolkit==3.0.18
ptyprocess==0.7.0
pyarrow==4.0.1
Pygments==2.8.1
pylint==2.4.4
pylint-django==2.3.0
pylint-flask==0.6
pylint-plugin-utils==0.6
PyMySQL==1.0.2
python-dateutil==2.8.1
pytz==2021.1
redshift-connector==2.0.881
requests==2.25.1
s3transfer==0.4.2
scramp==1.4.0
six==1.15.0
soupsieve==2.2.1
stevedore==3.3.0
traitlets==5.0.5
typed-ast==1.4.2
typing-extensions==3.7.4.3
urllib3==1.26.4
virtualenv==16.2.0
virtualenv-clone==0.5.4
virtualenvwrapper==4.8.4
wcwidth==0.2.5
wrapt==1.12.1
xlrd==2.0.1
zipp==3.4.1

To Reproduce

In Excel create a new spreadsheet. Add a second tab to it called 'Sheet2'. Type a small table of rows and columns into that sheet.
Save the spreadsheet as demo_excel_multi_sheet.xlsx.
Save the spreadsheet a second time as demo_excel_multi_sheet.xls.
Upload copies of both spreadsheets to a test S3 bucket.
Run the following program using Python 3,

import awswrangler as wr
import pandas as pd
import boto3
import io

data_bucket = <Name of your bucket>
xlsx_object = 'demo_excel_multi_sheet.xlsx'
xls_object = 'demo_excel_multi_sheet.xls'

# XLSX on S3 using Data Wrangler works.
xlsx_s3_df = wr.s3.read_excel('s3://' + data_bucket + '/' + xlsx_object, sheet_name='Sheet2')
print(xlsx_s3_df.head())

# XLS locally also works.
xls_local_df = pd.read_excel(xls_object, sheet_name='Sheet2')
print(xls_local_df.head())

# XLS on S3 using Boto3 works.
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')
obj = s3_client.get_object(Bucket=data_bucket, Key=xls_object)
data = obj['Body'].read()
df_boto = pd.read_excel(io.BytesIO(data), sheet_name='Sheet2')
print(df_boto.head())

# XLS on S3 using Data Wrangler doesn't work.
xls_s3_df = wr.s3.read_excel('s3://' + data_bucket + '/' + xls_object, sheet_name='Sheet2')
print(xls_s3_df.head())

Notice that the first 3 print(xyz.head()) statements succeed, but the 4th one is never reached (error message on previous line of code).

P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.

The text was updated successfully, but these errors were encountered:

jaidisido · 2021-06-08T16:00:33Z

Thanks for raising this @johntelforduk. The culprit is this guy. Removing that line solves it.

From the pandas docs, it seems that the openpyxl engine cannot handle xls. I believe we should simply rely on Pandas to infer the right engine but will try to understand if there are any caveats first and then will open a PR for a fix.

johntelforduk · 2021-06-10T11:44:54Z

Hi @jaidisido,
Thanks for investigating this.
I commented out the pandas_kwargs["engine"] = "openpyxl" line that you mentioned in my dev environment. I'm pleased to say that it fixed the bug for my test spreadsheet.
Let me know if I can assist in any way; testing the completed fix, etc.
Regards, John T.

jaidisido · 2021-06-11T13:33:55Z

Thanks, it has now been merged to main and can be tested:

pip uninstall awswrangler -y
pip install git+https://github.com/awslabs/aws-data-wrangler.git@main

until it's available in the next release.

johntelforduk · 2021-06-15T12:43:13Z

My testing indicates that this is fixed. Thanks!

johntelforduk added the bug Something isn't working label Jun 8, 2021

jaidisido mentioned this issue Jun 11, 2021

Removing explicit Excel engine configuration #742

Merged

jaidisido added minor release Will be addressed in the next minor release ready to release labels Jun 11, 2021

jaidisido self-assigned this Jun 11, 2021

jaidisido added this to the 2.9.0 milestone Jun 11, 2021

jaidisido linked a pull request Jun 11, 2021 that will close this issue

Removing explicit Excel engine configuration #742

Merged

johntelforduk closed this as completed Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wr.s3.read_excel can't read multi-sheet .xls files #739

wr.s3.read_excel can't read multi-sheet .xls files #739

johntelforduk commented Jun 8, 2021

jaidisido commented Jun 8, 2021

johntelforduk commented Jun 10, 2021

jaidisido commented Jun 11, 2021

johntelforduk commented Jun 15, 2021

wr.s3.read_excel can't read multi-sheet .xls files #739

wr.s3.read_excel can't read multi-sheet .xls files #739

Comments

johntelforduk commented Jun 8, 2021

jaidisido commented Jun 8, 2021

johntelforduk commented Jun 10, 2021

jaidisido commented Jun 11, 2021

johntelforduk commented Jun 15, 2021