-
Notifications
You must be signed in to change notification settings - Fork 697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading files from S3 fails with compressed files #175
Comments
Tried this out with the mode being import pandas as pd
import s3fs
FILE='s3://bucket/file.csv.gz'
fs = s3fs.S3FileSystem(anon=False)
with fs.open(FILE) as f: # defaults to mode='rb'
df1 = pd.read_csv(f, compression='gzip')
df1.info()
with fs.open(FILE, mode='rb') as f: # same as default above
df2 = pd.read_csv(f, compression='gzip')
df2.info()
# Current implementation, fails
with fs.open(FILE, mode='r') as f:
df3 = pd.read_csv(f, compression='gzip')
df3.info()
# Without compression details, fails
with fs.open(FILE) as f:
df4 = pd.read_csv(f)
df4.info() |
@jar-no1 thanks a lot! Very valuable contributions. I'm trying to work on that here, and I will push a PR in the next minutes. Your feedback will be very welcome. |
Add csv decompression for s3.read_csv #175
Thanks @jar-no1, this issue is resolved on version |
File reading has been working since the new version. Thanks! |
Describe the bug
If a compressed file is tried to be read with s3.read_csv() used s3fs will be providing 'text read mode' for the file as s3fs parameter
'r'
was used instead of binary form'rb'
. Currently Python tries to parse the content with encoding and not Pandas.The loading will fail with e.g.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
To Reproduce
s3.read_csv()
Used version:
awswrangler==1.0.1
Exactly the same file is readable by pure Pandas
read_csv
which also uses s3fs (https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#reading-remote-files)Possible fix
Use file open binary mode
'rb'
instead of text mode'r'
. Looks like on s3fs it would even be default internally and on examples (https://s3fs.readthedocs.io/en/latest/#examples).https://github.com/awslabs/aws-data-wrangler/blob/master/awswrangler/s3.py#L1191-L1193
The same should apply also on the chunksize variant:
https://github.com/awslabs/aws-data-wrangler/blob/master/awswrangler/s3.py#L1178
Possibly would be good to check the same also regarding the write functions.
The text was updated successfully, but these errors were encountered: