Reading files from S3 fails with compressed files #175

jar-no1 · 2020-04-14T07:53:10Z

Describe the bug
If a compressed file is tried to be read with s3.read_csv() used s3fs will be providing 'text read mode' for the file as s3fs parameter 'r' was used instead of binary form 'rb'. Currently Python tries to parse the content with encoding and not Pandas.

The loading will fail with e.g.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

...
  File ".../lib/python3.6/site-packages/awswrangler/s3.py", line 1192, in _read_text_full
    return parser_func(f, **pandas_args)
  File ".../lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File ".../lib/python3.6/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File ".../lib/python3.6/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File ".../lib/python3.6/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File ".../lib/python3.6/site-packages/pandas/io/parsers.py", line 1891, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 529, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 720, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 916, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2063, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

To Reproduce

Try to open any non-text format file with s3.read_csv()

import awswrangler
df = awswrangler.s3.read_csv('s3://bucket/file.csv.gz')

Used version:
awswrangler==1.0.1

Exactly the same file is readable by pure Pandas read_csv which also uses s3fs (https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#reading-remote-files)

import pandas
df = pandas.read_csv('s3://bucket/file.csv.gz')

Possible fix

Use file open binary mode 'rb' instead of text mode 'r'. Looks like on s3fs it would even be default internally and on examples (https://s3fs.readthedocs.io/en/latest/#examples).
https://github.com/awslabs/aws-data-wrangler/blob/master/awswrangler/s3.py#L1191-L1193

    fs: s3fs.S3FileSystem = _utils.get_fs(session=boto3_session, s3_additional_kwargs=s3_additional_kwargs)
    with fs.open(path, "rb") as f: # fs.open(path, "r") as f:
        return parser_func(f, **pandas_args)

The same should apply also on the chunksize variant:
https://github.com/awslabs/aws-data-wrangler/blob/master/awswrangler/s3.py#L1178
Possibly would be good to check the same also regarding the write functions.

The text was updated successfully, but these errors were encountered:

jar-no1 · 2020-04-14T18:38:29Z

Tried this out with the mode being 'rb' or left out. Seems to work with binary mode. One caveat in general sure is that the native compression detection of pandas read does not work as it is given file stream instead of a path. As the file extension is used for inferring: (https://github.com/pandas-dev/pandas/blob/v1.0.3/pandas/io/common.py#L259-L284)

import pandas as pd
import s3fs

FILE='s3://bucket/file.csv.gz'
fs = s3fs.S3FileSystem(anon=False)

with fs.open(FILE) as f: # defaults to mode='rb'
    df1 = pd.read_csv(f, compression='gzip')
    df1.info()

with fs.open(FILE, mode='rb') as f: # same as default above
    df2 = pd.read_csv(f, compression='gzip')
    df2.info()

# Current implementation, fails
with fs.open(FILE, mode='r') as f:
    df3 = pd.read_csv(f, compression='gzip')
    df3.info()

# Without compression details, fails
with fs.open(FILE) as f:
    df4 = pd.read_csv(f)
    df4.info()

igorborgest · 2020-04-14T19:19:30Z

@jar-no1 thanks a lot! Very valuable contributions.

I'm trying to work on that here, and I will push a PR in the next minutes.

Your feedback will be very welcome.

Add csv decompression for s3.read_csv #175

igorborgest · 2020-04-14T21:52:41Z

Thanks @jar-no1, this issue is resolved on version 1.0.2. Please, give feedback when possible.

jar-no1 · 2020-04-20T19:55:54Z

File reading has been working since the new version. Thanks!

jar-no1 added the bug Something isn't working label Apr 14, 2020

jar-no1 assigned igorborgest Apr 14, 2020

igorborgest added the WIP Work in progress label Apr 14, 2020

igorborgest added a commit that referenced this issue Apr 14, 2020

Add csv decompression for s3.read_csv #175

c6f076b

igorborgest mentioned this issue Apr 14, 2020

Add csv decompression for s3.read_csv #175 #177

Merged

igorborgest added a commit that referenced this issue Apr 14, 2020

Merge pull request #177 from awslabs/csv-decompression

94b3805

Add csv decompression for s3.read_csv #175

igorborgest closed this as completed Apr 14, 2020

igorborgest removed the WIP Work in progress label Apr 15, 2020

mdavis-xyz mentioned this issue Nov 24, 2021

Reading files from S3 fails with compressed folder of files #1031

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading files from S3 fails with compressed files #175

Reading files from S3 fails with compressed files #175

jar-no1 commented Apr 14, 2020 •

edited

Loading

jar-no1 commented Apr 14, 2020

igorborgest commented Apr 14, 2020

igorborgest commented Apr 14, 2020

jar-no1 commented Apr 20, 2020

Reading files from S3 fails with compressed files #175

Reading files from S3 fails with compressed files #175

Comments

jar-no1 commented Apr 14, 2020 • edited Loading

jar-no1 commented Apr 14, 2020

igorborgest commented Apr 14, 2020

igorborgest commented Apr 14, 2020

jar-no1 commented Apr 20, 2020

jar-no1 commented Apr 14, 2020 •

edited

Loading