-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse Excel from in-memory file object #1529
Comments
Have you tried passing a file-like object (e.g. |
Thanks for the quick reply! It seems that They all have the same logic which basically is: if file_contents:
filestr = file_contents
else:
f = open(filename, 'rb')
filestr = f.read()
f.close() So they don't check a single variable to see what type it is ( |
A fair point. the |
Looked at openpyxl, and they do the check "file object vs path" themselves (https://github.com/chronossc/openpyxl/blob/master/openpyxl/reader/excel.py#L43). So would mean only doing it for xlrd. Tell you what, I feel energetic, so I'll try and look into it :) |
Beat you to it. xlrd can't accept file-like objects so as a hack I write the data out to a temporary file |
Haha well done! I didn't know we were racing ;) Yeah didn't get a chance to work on it as much as I wanted. But I did spot:
I also saw that xlrd was going to support Excel 2007 in future versions, I don't know what you want to do about that, ie keep using both, or switch to only xlrd. I didn't know about def _excel_type(filepath_or_buffer):
# Thanks to xlrd for this
peeksz = 4
if isinstance(filepath_or_buffer, str):
f = open(filepath_or_buffer, "rb")
peek = f.read(peeksz)
f.close()
elif hasattr(filepath_or_buffer, 'read') \
and hasattr(filepath_or_buffer, 'seek'):
f = filepath_or_buffer
peek = f.read(peeksz)
f.seek(0)
else:
raise TypeError("You must provide the path to a file "
"or a file-like object")
# Check if ZIP file
if peek == "PK\x03\x04" \
or peek == "PK\x03\x04".encode('latin1'): # Python 3
return 'xlsx'
else:
return 'xls' Then I would've checked the type (and I like your way of checking for file-like, ie needs a wb = xlrd.open_workbook(filename=filename)
# or
f = open(filename, 'rb')
bytes = f.read()
f.close()
wb = xlrd.open_workbook(file_contents=bytes) I guess the only advantage there is it saves having to use a tempfile and an I/O trip to the disk. But your solution has the advantage that it just works, and also is compatible with openpyxl. Thanks for taking the time! |
Ah, well you put in more effort on figuring it out than I did. We should return to this at some point and improve the solution, maybe once xlrd/openpyxl have settled down a bit |
Agreed! |
OK I refactored to not use a tempfile, a StringIO for openpyxl and passing file_contents to xlrd |
acb5f03 Just a couple small things:
Thanks |
Version 0.8.0 * tag 'v0.8.0': (21 commits) RLS: version 0.8.0 DOC: release notes BUG: _get_marker_compat insufficient on matplotlib < 1.1.0 BUG: don't use local() in read_* functions, breaks sys.settrace. close pandas-dev#1547 BUG: fix Panel slice setting issue and matplotlib import issues pandas-dev#1548, pandas-dev#1533 ENH: parsers don't use tempfile ENH: implement DataFrameGroupBy.boxplot(), close pandas-dev#1507 BUG: fix MultiIndex indexing issues in pandas-dev#1537, python 2.5 api fix BUG: fix incorrect bin labels from cut when labels=False and NA present. close pandas-dev#1511 ENH: support file-like objects in ExcelFile, close pandas-dev#1529 TST: skip test raising unsortable warning on 32-bit windows, other platforms. pandas-dev#1546 BUG: raise exceptions out of trying to parse iso8601 strings TST: separated test case BUG: custom colors for bar chart pandas-dev#1540 ENH: add 'time' as inferred_type ENH: datetime.time converters for plotting BUG: fix MultiIndex segfault due to internal refactoring. close pandas-dev#1532 BUG: fix MultiIndex compatibility bugs described in pandas-dev#1534 post gutting internal array close pandas-dev#1534 BUG: parser bug when parse_dates is string pandas-dev#1544 BUG: return nameless Series and index from from_csv ...
sweet |
We came across a situation where we had a file object representing Excel data (came from HTTP POST but I'm thinking it could also come from MongoDB for example), and would've liked to pass it directly to Pandas to parse (vs saving it to disk and passing path to Pandas).
Could this be possible?
I saw that xlrd had
file_contents
as a possible argument ofopen_workbook
:https://github.com/python-excel/xlrd/blob/master/xlrd/__init__.py#L385
Maybe
ExcelFile
in Pandas could takepath_or_buffer
as argument, and pass along the correct one to xlrd.https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L1133
Don't know if that could work for openpyxl also.
Thoughts?
Thanks!
Nicolas
The text was updated successfully, but these errors were encountered: