Parse Excel from in-memory file object #1529

nicolashery · 2012-06-25T22:54:25Z

We came across a situation where we had a file object representing Excel data (came from HTTP POST but I'm thinking it could also come from MongoDB for example), and would've liked to pass it directly to Pandas to parse (vs saving it to disk and passing path to Pandas).

Could this be possible?

I saw that xlrd had file_contents as a possible argument of open_workbook:
https://github.com/python-excel/xlrd/blob/master/xlrd/__init__.py#L385

Maybe ExcelFile in Pandas could take path_or_buffer as argument, and pass along the correct one to xlrd.
https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1133

Don't know if that could work for openpyxl also.

Thoughts?

Thanks!
Nicolas

The text was updated successfully, but these errors were encountered:

wesm · 2012-06-25T22:56:43Z

Have you tried passing a file-like object (e.g. StringIO)?

nicolashery · 2012-06-25T23:06:32Z

Thanks for the quick reply!

It seems that open_workbook calls open_workbook_xls which in turn calls biff2_8_load (https://github.com/python-excel/xlrd/blob/master/xlrd/book.py#L549)

They all have the same logic which basically is:

if file_contents:
   filestr = file_contents
else:
   f = open(filename, 'rb')
   filestr = f.read()
   f.close()

So they don't check a single variable to see what type it is (string, ie a path, or StringIO). I guess Pandas would have to do it for them.

wesm · 2012-06-25T23:12:06Z

A fair point. the ExcelFile class doesn't accept file-like objects on closer inspection. I'll have a look (or you can make a pull request if you get energetic)

nicolashery · 2012-06-25T23:26:02Z

Looked at openpyxl, and they do the check "file object vs path" themselves (https://github.com/chronossc/openpyxl/blob/master/openpyxl/reader/excel.py#L43). So would mean only doing it for xlrd.

Tell you what, I feel energetic, so I'll try and look into it :)

wesm · 2012-06-28T22:31:21Z

Beat you to it. xlrd can't accept file-like objects so as a hack I write the data out to a temporary file

nicolashery · 2012-06-28T23:19:48Z

Haha well done! I didn't know we were racing ;)

Yeah didn't get a chance to work on it as much as I wanted. But I did spot:

xlrd accepts path or bytes (coming from file.read())
https://github.com/python-excel/xlrd/blob/master/xlrd/__init__.py#L385
openpyxl accepts path or file object (not file-like, has to be an instance of file, which is kind of limitting..)
https://github.com/chronossc/openpyxl/blob/master/openpyxl/reader/excel.py#L43

I also saw that xlrd was going to support Excel 2007 in future versions, I don't know what you want to do about that, ie keep using both, or switch to only xlrd.

I didn't know about tempfile (nice trick!). I was going to just feed xlrd the bytes from file.read(), but that wouldn't work for openpyxl. I also had put together a function to check what type of Excel file it is, inspired with how the master branch of xlrd does it:

def _excel_type(filepath_or_buffer):
    # Thanks to xlrd for this
    peeksz = 4
    if isinstance(filepath_or_buffer, str):
        f = open(filepath_or_buffer, "rb")
        peek = f.read(peeksz)
        f.close()
    elif hasattr(filepath_or_buffer, 'read') \
            and hasattr(filepath_or_buffer, 'seek'):
        f = filepath_or_buffer
        peek = f.read(peeksz)
        f.seek(0)
    else:
        raise TypeError("You must provide the path to a file "
                        "or a file-like object")
    # Check if ZIP file
    if peek == "PK\x03\x04" \
        or peek == "PK\x03\x04".encode('latin1'): # Python 3
        return 'xlsx'
    else:
        return 'xls'

Then I would've checked the type (and I like your way of checking for file-like, ie needs a read method, possibly a seek too, this way it works with "file objects" coming from Flask HTTP uploads, or MongoDB GridFS..), and done:

wb = xlrd.open_workbook(filename=filename)
# or
f = open(filename, 'rb')
bytes = f.read()
f.close()
wb = xlrd.open_workbook(file_contents=bytes)

I guess the only advantage there is it saves having to use a tempfile and an I/O trip to the disk. But your solution has the advantage that it just works, and also is compatible with openpyxl.

Thanks for taking the time!

wesm · 2012-06-28T23:21:10Z

Ah, well you put in more effort on figuring it out than I did. We should return to this at some point and improve the solution, maybe once xlrd/openpyxl have settled down a bit

nicolashery · 2012-06-28T23:22:02Z

Agreed!

wesm · 2012-06-29T00:15:10Z

OK I refactored to not use a tempfile, a StringIO for openpyxl and passing file_contents to xlrd

nicolashery · 2012-06-29T00:40:25Z

acb5f03
Great! I think that's better :)

Just a couple small things:

you might have 'kind' as a parameter still in your doctring, but you didn't end up using it, might want to take it out to avoid confusion?
I'm guessing the answer is yes, but did you test the BytesIO with openpyxl? Does the ZipFile constructor, which they use, accept a buffer?

Thanks

Version 0.8.0 * tag 'v0.8.0': (21 commits) RLS: version 0.8.0 DOC: release notes BUG: _get_marker_compat insufficient on matplotlib < 1.1.0 BUG: don't use local() in read_* functions, breaks sys.settrace. close pandas-dev#1547 BUG: fix Panel slice setting issue and matplotlib import issues pandas-dev#1548, pandas-dev#1533 ENH: parsers don't use tempfile ENH: implement DataFrameGroupBy.boxplot(), close pandas-dev#1507 BUG: fix MultiIndex indexing issues in pandas-dev#1537, python 2.5 api fix BUG: fix incorrect bin labels from cut when labels=False and NA present. close pandas-dev#1511 ENH: support file-like objects in ExcelFile, close pandas-dev#1529 TST: skip test raising unsortable warning on 32-bit windows, other platforms. pandas-dev#1546 BUG: raise exceptions out of trying to parse iso8601 strings TST: separated test case BUG: custom colors for bar chart pandas-dev#1540 ENH: add 'time' as inferred_type ENH: datetime.time converters for plotting BUG: fix MultiIndex segfault due to internal refactoring. close pandas-dev#1532 BUG: fix MultiIndex compatibility bugs described in pandas-dev#1534 post gutting internal array close pandas-dev#1534 BUG: parser bug when parse_dates is string pandas-dev#1544 BUG: return nameless Series and index from from_csv ...

ludaavics · 2012-08-20T23:16:04Z

sweet

wesm closed this as completed in d8a0b4f Jun 28, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse Excel from in-memory file object #1529

Parse Excel from in-memory file object #1529

nicolashery commented Jun 25, 2012

wesm commented Jun 25, 2012

nicolashery commented Jun 25, 2012

wesm commented Jun 25, 2012

nicolashery commented Jun 25, 2012

wesm commented Jun 28, 2012

nicolashery commented Jun 28, 2012

wesm commented Jun 28, 2012

nicolashery commented Jun 28, 2012

wesm commented Jun 29, 2012

nicolashery commented Jun 29, 2012

ludaavics commented Aug 20, 2012

Parse Excel from in-memory file object #1529

Parse Excel from in-memory file object #1529

Comments

nicolashery commented Jun 25, 2012

wesm commented Jun 25, 2012

nicolashery commented Jun 25, 2012

wesm commented Jun 25, 2012

nicolashery commented Jun 25, 2012

wesm commented Jun 28, 2012

nicolashery commented Jun 28, 2012

wesm commented Jun 28, 2012

nicolashery commented Jun 28, 2012

wesm commented Jun 29, 2012

nicolashery commented Jun 29, 2012

ludaavics commented Aug 20, 2012