Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_excel in version 0.25.0rc0 treats empty columns differently #27252

Closed
snordhausen opened this issue Jul 5, 2019 · 17 comments · Fixed by #27426
Closed

read_excel in version 0.25.0rc0 treats empty columns differently #27252

snordhausen opened this issue Jul 5, 2019 · 17 comments · Fixed by #27426
Labels
IO Excel read_excel, to_excel
Milestone

Comments

@snordhausen
Copy link

I'm using this code to load an Excel file.

df = pandas.read_excel(
    "data.xlsx",
    sheet_name="sheet1",
    usecols=[0, 1], 
    header=None,
    names=["foo", "bar"]
)

print(df.head())

The Excel file has the cells A7=1, A8=2, A9=3, everything else is empty.

With pandas 0.24.2 I get this:

   foo  bar
0    1  NaN
1    2  NaN
2    3  NaN

With pandas 0.25.0rc0 I get:

Traceback (most recent call last):
  File "tester.py", line 8, in <module>
    names=["foo", "bar"]
  File "/home/me/.env/lib/python3.7/site-packages/pandas/util/_decorators.py", line 196, in wrapper
    return func(*args, **kwargs)
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 334, in read_excel
    **kwds
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 877, in parse
    **kwds
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 507, in parse
    **kwds
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/parsers.py", line 2218, in TextParser
    return TextFileReader(*args, **kwds)
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/parsers.py", line 1147, in _make_engine
    self._engine = klass(self.f, **self.options)
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/parsers.py", line 2305, in __init__
    ) = self._infer_columns()
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/parsers.py", line 2712, in _infer_columns
    _validate_usecols_names(self.usecols, range(ncols))
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/parsers.py", line 1255, in _validate_usecols_names
    "columns expected but not found: {missing}".format(missing=missing)
ValueError: Usecols do not match columns, columns expected but not found: [1]

The problem happens because the bar column does not contain any data. As soon as I put a value into it, both versions do the same thing.
I'm using Python 3.7.3 in Ubuntu 19.04.

@snordhausen snordhausen changed the title read_excel in version 0.25.0rc0 treats empty empty columns differently read_excel in version 0.25.0rc0 treats empty columns differently Jul 5, 2019
@WillAyd
Copy link
Member

WillAyd commented Jul 5, 2019

I think this is intentional ref #25623 so not really a regression. Do you have a particular use case for this?

@snordhausen
Copy link
Author

@WillAyd Our use case is that we have daily reports and one of the columns only contains data when something unusual happened. Consequently, in some files this column is completely empty and "the column is completely empty" is exactly the information that we are looking for.

The change in #25623 that you referenced mentions CSV files. For CSV files I agree that the change is very useful, since the CSV file really does not contain the column. But for Excel files, there is no such thing as a non-existing column.

@WillAyd
Copy link
Member

WillAyd commented Jul 5, 2019

I don't think this is something likely to be reverted as it was a bug in core IO handling before that allowed this not to raise but let's see what others think

@jreback
Copy link
Contributor

jreback commented Jul 5, 2019

shouldn’t just specifying names work?

@WillAyd
Copy link
Member

WillAyd commented Jul 5, 2019

Seems to work for me locally - @snordhausen how about on your end?

@simonjayhawkins simonjayhawkins added the IO Excel read_excel, to_excel label Jul 5, 2019
@snordhausen
Copy link
Author

@WillAyd To make sure that we are both testing the same thing, I extended my test program to also create the data.xlsx file:

import pandas
from openpyxl import Workbook

wb = Workbook()
ws = wb.active
ws['A7'] = 1
ws['A8'] = 2
ws['A9'] = 3
wb.save("data.xlsx")

df = pandas.read_excel(
    "data.xlsx",
    sheet_name="Sheet",
    usecols=[0, 1],
    header=None,
    names=["foo", "bar"]
)

print(df)

I also tried this out in a fresh Ubuntu 18.04 docker container and could reproduce the issue.

@WillAyd
Copy link
Member

WillAyd commented Jul 8, 2019 via email

@snordhausen
Copy link
Author

Removing usecols makes the program work with 0.25.0rc0.

However, that looks inconsistent to me: why can I implicitly load empty columns, but when I explicitly ask for them I get an error? Also, it means I cannot load (potentially) empty columns in the middle of the table, e.g. if I only wanted column 0 and 20.

@WillAyd
Copy link
Member

WillAyd commented Jul 8, 2019

However, that looks inconsistent to me: why can I implicitly load empty columns, but when I explicitly ask for them I get an error?

The fact that this worked previously is inconsistent with read_csv. usecols is typically validated and missing indexes or labels throws errors. For example:

>>> data = """a,b,c\n1,2,3"""
>>> pd.read_csv(io.StringIO(test), usecols=['x'])
ValueError: Usecols do not match columns, columns expected but not found: ['x']

>>> pd.read_csv(io.StringIO(test), usecols=[10])
ValueError: Usecols do not match columns, columns expected but not found: [10]

So I don't think there is any reason to have Excel be excepted from that validation. You can use names as suggested above or reindex the output on your own

@tabias
Copy link

tabias commented Jul 15, 2019

The biggest issue is using the parser to read multiple sheets from 1 excel file.

Trying to read multiple sheets in 1 IO causes a lot of issues if the column length varies within a range (eg. "AA, AG:BZ") with AA being the index and AG:BZ the potential columns.
This example will throw an error instead of omitting the empty columns, which caused a lot of headaches and let me to revert to 0.24.

@WillAyd
Copy link
Member

WillAyd commented Jul 15, 2019

@pandas-dev/pandas-core would anyone object to reverting #25623 ? It looks like this is causing confusion in the Excel world as described by users above

To support use cases above with that in place we would need to break Excel usecols handling from the CSV one. I'm not sure this is desired but at the same time I don't think the issue we solved to raise for bad usecols is that urgent so could defer that if its a hang up for RC users

@gfyoung
Copy link
Member

gfyoung commented Jul 15, 2019

I have no objections to reverting the original PR.

However, I would meet that issue half-way and issue warnings instead.

@WillAyd
Copy link
Member

WillAyd commented Jul 15, 2019

A FutureWarning or did you have something else in mind?

@gfyoung
Copy link
Member

gfyoung commented Jul 15, 2019

I would go with UserWarning.

FutureWarning to me implies some kind of deprecation, which I don't think will happen at this point (unless we have some really strong feelings about keeping this behavior).

@jorisvandenbossche
Copy link
Member

I am fine with reverting to restore the functionality of excel for 0.25.0.

But I also wanted to mention that from a user perspective, I wouldn't mind that some options behave differently between csv and excel (in the end, they are different formats with different capabilities). Whether this is possible/desirable from a code perspective, don't know the parsing code well enough for that.

@jorisvandenbossche jorisvandenbossche added this to the 0.25.0 milestone Jul 15, 2019
@WillAyd WillAyd mentioned this issue Jul 15, 2019
@gfyoung
Copy link
Member

gfyoung commented Jul 15, 2019

I wouldn't mind that some options behave differently between csv and excel (in the end, they are different formats with different capabilities)

Whether this is possible/desirable from a code perspective, don't know the parsing code well enough for that

It's definitely possible, but I would want more feedback from users, hence why I suggested the warning. That way we can draw people's attention to it (maybe even reference the two issues).

@isavelli
Copy link

This problem seems to be reintroduced, that is the following code will generate an error:

import pandas
from openpyxl import Workbook

wb = Workbook()
ws = wb.active
ws['A7'] = 1
ws['A8'] = 2
ws['A9'] = 3
wb.save("data.xlsx")

df = pandas.read_excel(
    "data.xlsx",
    sheet_name="Sheet",
    usecols=[0, 1],
    header=None,
    names=["foo", "bar"]
)

print(df)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants