-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_excel in version 0.25.0rc0 treats empty columns differently #27252
Comments
I think this is intentional ref #25623 so not really a regression. Do you have a particular use case for this? |
@WillAyd Our use case is that we have daily reports and one of the columns only contains data when something unusual happened. Consequently, in some files this column is completely empty and "the column is completely empty" is exactly the information that we are looking for. The change in #25623 that you referenced mentions CSV files. For CSV files I agree that the change is very useful, since the CSV file really does not contain the column. But for Excel files, there is no such thing as a non-existing column. |
I don't think this is something likely to be reverted as it was a bug in core IO handling before that allowed this not to raise but let's see what others think |
shouldn’t just specifying names work? |
Seems to work for me locally - @snordhausen how about on your end? |
@WillAyd To make sure that we are both testing the same thing, I extended my test program to also create the
I also tried this out in a fresh Ubuntu 18.04 docker container and could reproduce the issue. |
Try removing usecols from your call
…Sent from my iPhone
On Jul 8, 2019, at 2:41 AM, Stefan Nordhausen ***@***.***> wrote:
@WillAyd To make sure that we are both testing the same thing, I extended my test program to also create the data.xlsx file:
import pandas
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
ws['A7'] = 1
ws['A8'] = 2
ws['A9'] = 3
wb.save("data.xlsx")
df = pandas.read_excel(
"data.xlsx",
sheet_name="Sheet",
usecols=[0, 1],
header=None,
names=["foo", "bar"]
)
print(df)
I also tried this out in a fresh Ubuntu 18.04 docker container and could reproduce the issue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Removing However, that looks inconsistent to me: why can I implicitly load empty columns, but when I explicitly ask for them I get an error? Also, it means I cannot load (potentially) empty columns in the middle of the table, e.g. if I only wanted column 0 and 20. |
The fact that this worked previously is inconsistent with read_csv. >>> data = """a,b,c\n1,2,3"""
>>> pd.read_csv(io.StringIO(test), usecols=['x'])
ValueError: Usecols do not match columns, columns expected but not found: ['x']
>>> pd.read_csv(io.StringIO(test), usecols=[10])
ValueError: Usecols do not match columns, columns expected but not found: [10] So I don't think there is any reason to have Excel be excepted from that validation. You can use |
The biggest issue is using the parser to read multiple sheets from 1 excel file. Trying to read multiple sheets in 1 IO causes a lot of issues if the column length varies within a range (eg. "AA, AG:BZ") with AA being the index and AG:BZ the potential columns. |
@pandas-dev/pandas-core would anyone object to reverting #25623 ? It looks like this is causing confusion in the Excel world as described by users above To support use cases above with that in place we would need to break Excel |
I have no objections to reverting the original PR. However, I would meet that issue half-way and issue warnings instead. |
A FutureWarning or did you have something else in mind? |
I would go with
|
I am fine with reverting to restore the functionality of excel for 0.25.0. But I also wanted to mention that from a user perspective, I wouldn't mind that some options behave differently between csv and excel (in the end, they are different formats with different capabilities). Whether this is possible/desirable from a code perspective, don't know the parsing code well enough for that. |
It's definitely possible, but I would want more feedback from users, hence why I suggested the warning. That way we can draw people's attention to it (maybe even reference the two issues). |
This problem seems to be reintroduced, that is the following code will generate an error:
|
I'm using this code to load an Excel file.
The Excel file has the cells
A7
=1
,A8
=2
,A9
=3
, everything else is empty.With pandas 0.24.2 I get this:
With pandas 0.25.0rc0 I get:
The problem happens because the
bar
column does not contain any data. As soon as I put a value into it, both versions do the same thing.I'm using Python 3.7.3 in Ubuntu 19.04.
The text was updated successfully, but these errors were encountered: