read_excel in version 0.25.0rc0 treats empty columns differently #27252

snordhausen · 2019-07-05T14:54:31Z

I'm using this code to load an Excel file.

df = pandas.read_excel(
    "data.xlsx",
    sheet_name="sheet1",
    usecols=[0, 1], 
    header=None,
    names=["foo", "bar"]
)

print(df.head())

The Excel file has the cells A7=1, A8=2, A9=3, everything else is empty.

With pandas 0.24.2 I get this:

   foo  bar
0    1  NaN
1    2  NaN
2    3  NaN

With pandas 0.25.0rc0 I get:

Traceback (most recent call last):
  File "tester.py", line 8, in <module>
    names=["foo", "bar"]
  File "/home/me/.env/lib/python3.7/site-packages/pandas/util/_decorators.py", line 196, in wrapper
    return func(*args, **kwargs)
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 334, in read_excel
    **kwds
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 877, in parse
    **kwds
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 507, in parse
    **kwds
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/parsers.py", line 2218, in TextParser
    return TextFileReader(*args, **kwds)
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/parsers.py", line 1147, in _make_engine
    self._engine = klass(self.f, **self.options)
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/parsers.py", line 2305, in __init__
    ) = self._infer_columns()
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/parsers.py", line 2712, in _infer_columns
    _validate_usecols_names(self.usecols, range(ncols))
  File "/home/me/.env/lib/python3.7/site-packages/pandas/io/parsers.py", line 1255, in _validate_usecols_names
    "columns expected but not found: {missing}".format(missing=missing)
ValueError: Usecols do not match columns, columns expected but not found: [1]

The problem happens because the bar column does not contain any data. As soon as I put a value into it, both versions do the same thing.
I'm using Python 3.7.3 in Ubuntu 19.04.

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-07-05T15:12:46Z

I think this is intentional ref #25623 so not really a regression. Do you have a particular use case for this?

snordhausen · 2019-07-05T16:32:17Z

@WillAyd Our use case is that we have daily reports and one of the columns only contains data when something unusual happened. Consequently, in some files this column is completely empty and "the column is completely empty" is exactly the information that we are looking for.

The change in #25623 that you referenced mentions CSV files. For CSV files I agree that the change is very useful, since the CSV file really does not contain the column. But for Excel files, there is no such thing as a non-existing column.

WillAyd · 2019-07-05T16:40:09Z

I don't think this is something likely to be reverted as it was a bug in core IO handling before that allowed this not to raise but let's see what others think

jreback · 2019-07-05T17:01:17Z

shouldn’t just specifying names work?

WillAyd · 2019-07-05T19:26:27Z

Seems to work for me locally - @snordhausen how about on your end?

snordhausen · 2019-07-08T09:41:40Z

@WillAyd To make sure that we are both testing the same thing, I extended my test program to also create the data.xlsx file:

import pandas
from openpyxl import Workbook

wb = Workbook()
ws = wb.active
ws['A7'] = 1
ws['A8'] = 2
ws['A9'] = 3
wb.save("data.xlsx")

df = pandas.read_excel(
    "data.xlsx",
    sheet_name="Sheet",
    usecols=[0, 1],
    header=None,
    names=["foo", "bar"]
)

print(df)

I also tried this out in a fresh Ubuntu 18.04 docker container and could reproduce the issue.

WillAyd · 2019-07-08T13:33:19Z

Try removing usecols from your call

…

Sent from my iPhone

On Jul 8, 2019, at 2:41 AM, Stefan Nordhausen ***@***.***> wrote: @WillAyd To make sure that we are both testing the same thing, I extended my test program to also create the data.xlsx file: import pandas from openpyxl import Workbook wb = Workbook() ws = wb.active ws['A7'] = 1 ws['A8'] = 2 ws['A9'] = 3 wb.save("data.xlsx") df = pandas.read_excel( "data.xlsx", sheet_name="Sheet", usecols=[0, 1], header=None, names=["foo", "bar"] ) print(df) I also tried this out in a fresh Ubuntu 18.04 docker container and could reproduce the issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

snordhausen · 2019-07-08T14:11:26Z

Removing usecols makes the program work with 0.25.0rc0.

However, that looks inconsistent to me: why can I implicitly load empty columns, but when I explicitly ask for them I get an error? Also, it means I cannot load (potentially) empty columns in the middle of the table, e.g. if I only wanted column 0 and 20.

WillAyd · 2019-07-08T14:46:14Z

However, that looks inconsistent to me: why can I implicitly load empty columns, but when I explicitly ask for them I get an error?

The fact that this worked previously is inconsistent with read_csv. usecols is typically validated and missing indexes or labels throws errors. For example:

>>> data = """a,b,c\n1,2,3"""
>>> pd.read_csv(io.StringIO(test), usecols=['x'])
ValueError: Usecols do not match columns, columns expected but not found: ['x']

>>> pd.read_csv(io.StringIO(test), usecols=[10])
ValueError: Usecols do not match columns, columns expected but not found: [10]

So I don't think there is any reason to have Excel be excepted from that validation. You can use names as suggested above or reindex the output on your own

tabias · 2019-07-15T11:52:32Z

The biggest issue is using the parser to read multiple sheets from 1 excel file.

Trying to read multiple sheets in 1 IO causes a lot of issues if the column length varies within a range (eg. "AA, AG:BZ") with AA being the index and AG:BZ the potential columns.
This example will throw an error instead of omitting the empty columns, which caused a lot of headaches and let me to revert to 0.24.

WillAyd · 2019-07-15T15:21:26Z

@pandas-dev/pandas-core would anyone object to reverting #25623 ? It looks like this is causing confusion in the Excel world as described by users above

To support use cases above with that in place we would need to break Excel usecols handling from the CSV one. I'm not sure this is desired but at the same time I don't think the issue we solved to raise for bad usecols is that urgent so could defer that if its a hang up for RC users

gfyoung · 2019-07-15T18:29:46Z

I have no objections to reverting the original PR.

However, I would meet that issue half-way and issue warnings instead.

WillAyd · 2019-07-15T18:30:36Z

A FutureWarning or did you have something else in mind?

gfyoung · 2019-07-15T18:57:13Z

I would go with UserWarning.

FutureWarning to me implies some kind of deprecation, which I don't think will happen at this point (unless we have some really strong feelings about keeping this behavior).

jorisvandenbossche · 2019-07-15T20:56:47Z

I am fine with reverting to restore the functionality of excel for 0.25.0.

But I also wanted to mention that from a user perspective, I wouldn't mind that some options behave differently between csv and excel (in the end, they are different formats with different capabilities). Whether this is possible/desirable from a code perspective, don't know the parsing code well enough for that.

gfyoung · 2019-07-15T21:15:38Z

I wouldn't mind that some options behave differently between csv and excel (in the end, they are different formats with different capabilities)

Whether this is possible/desirable from a code perspective, don't know the parsing code well enough for that

It's definitely possible, but I would want more feedback from users, hence why I suggested the warning. That way we can draw people's attention to it (maybe even reference the two issues).

isavelli · 2024-06-26T20:46:55Z

This problem seems to be reintroduced, that is the following code will generate an error:

import pandas
from openpyxl import Workbook

wb = Workbook()
ws = wb.active
ws['A7'] = 1
ws['A8'] = 2
ws['A9'] = 3
wb.save("data.xlsx")

df = pandas.read_excel(
    "data.xlsx",
    sheet_name="Sheet",
    usecols=[0, 1],
    header=None,
    names=["foo", "bar"]
)

print(df)

snordhausen changed the title ~~read_excel in version 0.25.0rc0 treats empty empty columns differently~~ read_excel in version 0.25.0rc0 treats empty columns differently Jul 5, 2019

simonjayhawkins added the IO Excel read_excel, to_excel label Jul 5, 2019

jorisvandenbossche added this to the 0.25.0 milestone Jul 15, 2019

WillAyd mentioned this issue Jul 15, 2019

RLS: 0.25.0 #24950

Closed

WillAyd mentioned this issue Jul 16, 2019

Reallow usecols to reference OOB indices - reverts 25623 #27426

Merged

1 task

jreback closed this as completed in #27426 Jul 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_excel in version 0.25.0rc0 treats empty columns differently #27252

read_excel in version 0.25.0rc0 treats empty columns differently #27252

snordhausen commented Jul 5, 2019

WillAyd commented Jul 5, 2019

snordhausen commented Jul 5, 2019

WillAyd commented Jul 5, 2019

jreback commented Jul 5, 2019

WillAyd commented Jul 5, 2019

snordhausen commented Jul 8, 2019

WillAyd commented Jul 8, 2019 via email

snordhausen commented Jul 8, 2019

WillAyd commented Jul 8, 2019

tabias commented Jul 15, 2019

WillAyd commented Jul 15, 2019

gfyoung commented Jul 15, 2019

WillAyd commented Jul 15, 2019

gfyoung commented Jul 15, 2019 •

edited

Loading

jorisvandenbossche commented Jul 15, 2019

gfyoung commented Jul 15, 2019

isavelli commented Jun 26, 2024

read_excel in version 0.25.0rc0 treats empty columns differently #27252

read_excel in version 0.25.0rc0 treats empty columns differently #27252

Comments

snordhausen commented Jul 5, 2019

WillAyd commented Jul 5, 2019

snordhausen commented Jul 5, 2019

WillAyd commented Jul 5, 2019

jreback commented Jul 5, 2019

WillAyd commented Jul 5, 2019

snordhausen commented Jul 8, 2019

WillAyd commented Jul 8, 2019 via email

snordhausen commented Jul 8, 2019

WillAyd commented Jul 8, 2019

tabias commented Jul 15, 2019

WillAyd commented Jul 15, 2019

gfyoung commented Jul 15, 2019

WillAyd commented Jul 15, 2019

gfyoung commented Jul 15, 2019 • edited Loading

jorisvandenbossche commented Jul 15, 2019

gfyoung commented Jul 15, 2019

isavelli commented Jun 26, 2024

gfyoung commented Jul 15, 2019 •

edited

Loading