Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading files from S3 fails with compressed files #175

Closed
jar-no1 opened this issue Apr 14, 2020 · 4 comments
Closed

Reading files from S3 fails with compressed files #175

jar-no1 opened this issue Apr 14, 2020 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@jar-no1
Copy link

jar-no1 commented Apr 14, 2020

Describe the bug
If a compressed file is tried to be read with s3.read_csv() used s3fs will be providing 'text read mode' for the file as s3fs parameter 'r' was used instead of binary form 'rb'. Currently Python tries to parse the content with encoding and not Pandas.

The loading will fail with e.g.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

...
  File ".../lib/python3.6/site-packages/awswrangler/s3.py", line 1192, in _read_text_full
    return parser_func(f, **pandas_args)
  File ".../lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File ".../lib/python3.6/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File ".../lib/python3.6/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File ".../lib/python3.6/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File ".../lib/python3.6/site-packages/pandas/io/parsers.py", line 1891, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 529, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 720, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 916, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2063, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

To Reproduce

  • Try to open any non-text format file with s3.read_csv()
import awswrangler
df = awswrangler.s3.read_csv('s3://bucket/file.csv.gz')

Used version:
awswrangler==1.0.1

Exactly the same file is readable by pure Pandas read_csv which also uses s3fs (https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#reading-remote-files)

import pandas
df = pandas.read_csv('s3://bucket/file.csv.gz')

Possible fix

Use file open binary mode 'rb' instead of text mode 'r'. Looks like on s3fs it would even be default internally and on examples (https://s3fs.readthedocs.io/en/latest/#examples).
https://github.com/awslabs/aws-data-wrangler/blob/master/awswrangler/s3.py#L1191-L1193

    fs: s3fs.S3FileSystem = _utils.get_fs(session=boto3_session, s3_additional_kwargs=s3_additional_kwargs)
    with fs.open(path, "rb") as f: # fs.open(path, "r") as f:
        return parser_func(f, **pandas_args)

The same should apply also on the chunksize variant:
https://github.com/awslabs/aws-data-wrangler/blob/master/awswrangler/s3.py#L1178
Possibly would be good to check the same also regarding the write functions.

@jar-no1 jar-no1 added the bug Something isn't working label Apr 14, 2020
@jar-no1
Copy link
Author

jar-no1 commented Apr 14, 2020

Tried this out with the mode being 'rb' or left out. Seems to work with binary mode. One caveat in general sure is that the native compression detection of pandas read does not work as it is given file stream instead of a path. As the file extension is used for inferring: (https://github.com/pandas-dev/pandas/blob/v1.0.3/pandas/io/common.py#L259-L284)

import pandas as pd
import s3fs

FILE='s3://bucket/file.csv.gz'
fs = s3fs.S3FileSystem(anon=False)

with fs.open(FILE) as f: # defaults to mode='rb'
    df1 = pd.read_csv(f, compression='gzip')
    df1.info()

with fs.open(FILE, mode='rb') as f: # same as default above
    df2 = pd.read_csv(f, compression='gzip')
    df2.info()

# Current implementation, fails
with fs.open(FILE, mode='r') as f:
    df3 = pd.read_csv(f, compression='gzip')
    df3.info()

# Without compression details, fails
with fs.open(FILE) as f:
    df4 = pd.read_csv(f)
    df4.info()

@igorborgest igorborgest added the WIP Work in progress label Apr 14, 2020
@igorborgest
Copy link
Contributor

@jar-no1 thanks a lot! Very valuable contributions.

I'm trying to work on that here, and I will push a PR in the next minutes.

Your feedback will be very welcome.

igorborgest added a commit that referenced this issue Apr 14, 2020
Add csv decompression for s3.read_csv #175
@igorborgest
Copy link
Contributor

Thanks @jar-no1, this issue is resolved on version 1.0.2. Please, give feedback when possible.

@igorborgest igorborgest removed the WIP Work in progress label Apr 15, 2020
@jar-no1
Copy link
Author

jar-no1 commented Apr 20, 2020

File reading has been working since the new version. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants