Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Excel from in-memory file object #1529

Closed
nicolashery opened this issue Jun 25, 2012 · 11 comments
Closed

Parse Excel from in-memory file object #1529

nicolashery opened this issue Jun 25, 2012 · 11 comments
Milestone

Comments

@nicolashery
Copy link

We came across a situation where we had a file object representing Excel data (came from HTTP POST but I'm thinking it could also come from MongoDB for example), and would've liked to pass it directly to Pandas to parse (vs saving it to disk and passing path to Pandas).

Could this be possible?

I saw that xlrd had file_contents as a possible argument of open_workbook:
https://github.com/python-excel/xlrd/blob/master/xlrd/__init__.py#L385

Maybe ExcelFile in Pandas could take path_or_buffer as argument, and pass along the correct one to xlrd.
https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1133

Don't know if that could work for openpyxl also.

Thoughts?

Thanks!
Nicolas

@wesm
Copy link
Member

wesm commented Jun 25, 2012

Have you tried passing a file-like object (e.g. StringIO)?

@nicolashery
Copy link
Author

Thanks for the quick reply!

It seems that open_workbook calls open_workbook_xls which in turn calls biff2_8_load (https://github.com/python-excel/xlrd/blob/master/xlrd/book.py#L549)

They all have the same logic which basically is:

if file_contents:
   filestr = file_contents
else:
   f = open(filename, 'rb')
   filestr = f.read()
   f.close()

So they don't check a single variable to see what type it is (string, ie a path, or StringIO). I guess Pandas would have to do it for them.

@wesm
Copy link
Member

wesm commented Jun 25, 2012

A fair point. the ExcelFile class doesn't accept file-like objects on closer inspection. I'll have a look (or you can make a pull request if you get energetic)

@nicolashery
Copy link
Author

Looked at openpyxl, and they do the check "file object vs path" themselves (https://github.com/chronossc/openpyxl/blob/master/openpyxl/reader/excel.py#L43). So would mean only doing it for xlrd.

Tell you what, I feel energetic, so I'll try and look into it :)

@wesm wesm closed this as completed in d8a0b4f Jun 28, 2012
@wesm
Copy link
Member

wesm commented Jun 28, 2012

Beat you to it. xlrd can't accept file-like objects so as a hack I write the data out to a temporary file

@nicolashery
Copy link
Author

Haha well done! I didn't know we were racing ;)

Yeah didn't get a chance to work on it as much as I wanted. But I did spot:

I also saw that xlrd was going to support Excel 2007 in future versions, I don't know what you want to do about that, ie keep using both, or switch to only xlrd.

I didn't know about tempfile (nice trick!). I was going to just feed xlrd the bytes from file.read(), but that wouldn't work for openpyxl. I also had put together a function to check what type of Excel file it is, inspired with how the master branch of xlrd does it:

def _excel_type(filepath_or_buffer):
    # Thanks to xlrd for this
    peeksz = 4
    if isinstance(filepath_or_buffer, str):
        f = open(filepath_or_buffer, "rb")
        peek = f.read(peeksz)
        f.close()
    elif hasattr(filepath_or_buffer, 'read') \
            and hasattr(filepath_or_buffer, 'seek'):
        f = filepath_or_buffer
        peek = f.read(peeksz)
        f.seek(0)
    else:
        raise TypeError("You must provide the path to a file "
                        "or a file-like object")
    # Check if ZIP file
    if peek == "PK\x03\x04" \
        or peek == "PK\x03\x04".encode('latin1'): # Python 3
        return 'xlsx'
    else:
        return 'xls'

Then I would've checked the type (and I like your way of checking for file-like, ie needs a read method, possibly a seek too, this way it works with "file objects" coming from Flask HTTP uploads, or MongoDB GridFS..), and done:

wb = xlrd.open_workbook(filename=filename)
# or
f = open(filename, 'rb')
bytes = f.read()
f.close()
wb = xlrd.open_workbook(file_contents=bytes)

I guess the only advantage there is it saves having to use a tempfile and an I/O trip to the disk. But your solution has the advantage that it just works, and also is compatible with openpyxl.

Thanks for taking the time!

@wesm
Copy link
Member

wesm commented Jun 28, 2012

Ah, well you put in more effort on figuring it out than I did. We should return to this at some point and improve the solution, maybe once xlrd/openpyxl have settled down a bit

@nicolashery
Copy link
Author

Agreed!

@wesm
Copy link
Member

wesm commented Jun 29, 2012

OK I refactored to not use a tempfile, a StringIO for openpyxl and passing file_contents to xlrd

@nicolashery
Copy link
Author

acb5f03
Great! I think that's better :)

Just a couple small things:

  • you might have 'kind' as a parameter still in your doctring, but you didn't end up using it, might want to take it out to avoid confusion?
  • I'm guessing the answer is yes, but did you test the BytesIO with openpyxl? Does the ZipFile constructor, which they use, accept a buffer?

Thanks

yarikoptic added a commit to neurodebian/pandas that referenced this issue Jun 30, 2012
Version 0.8.0

* tag 'v0.8.0': (21 commits)
  RLS: version 0.8.0
  DOC: release notes
  BUG: _get_marker_compat insufficient on matplotlib < 1.1.0
  BUG: don't use local() in read_* functions, breaks sys.settrace. close pandas-dev#1547
  BUG: fix Panel slice setting issue and matplotlib import issues pandas-dev#1548, pandas-dev#1533
  ENH: parsers don't use tempfile
  ENH: implement DataFrameGroupBy.boxplot(), close pandas-dev#1507
  BUG: fix MultiIndex indexing issues in pandas-dev#1537, python 2.5 api fix
  BUG: fix incorrect bin labels from cut when labels=False and NA present. close pandas-dev#1511
  ENH: support file-like objects in ExcelFile, close pandas-dev#1529
  TST: skip test raising unsortable warning on 32-bit windows, other platforms. pandas-dev#1546
  BUG: raise exceptions out of trying to parse iso8601 strings
  TST: separated test case
  BUG: custom colors for bar chart pandas-dev#1540
  ENH: add 'time' as inferred_type
  ENH: datetime.time converters for plotting
  BUG: fix MultiIndex segfault due to internal refactoring. close pandas-dev#1532
  BUG: fix MultiIndex compatibility bugs described in pandas-dev#1534 post gutting internal array close pandas-dev#1534
  BUG: parser bug when parse_dates is string pandas-dev#1544
  BUG: return nameless Series and index from from_csv
  ...
@ludaavics
Copy link

sweet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants