pandas pd.read_excel() with sheet_name=None returns wrong result, and dtype does not allow nonstring dtypes #449

Dr-Irv · 2020-10-01T21:16:46Z

Environment data

Language Server version: 2020.9.7
OS and version: Windows 10
Python version (& distribution if applicable, e.g. Anaconda): Anaconda Python 3.7.5

Expected behaviour

No error

Actual behaviour

import pandas as pd
from typing import Dict

dfd: Dict[str, pd.DataFrame] = pd.read_excel("simple.xlsx", sheet_name=None)
print(dfd)

dfd2: pd.DataFrame = pd.read_excel("simple.xlsx", dtype=object)
print(dfd2)
print(dfd2.dtypes)

reports

Expression of type "Dict[int | str, DataFrame]" cannot be assigned to declared type "Dict[str, DataFrame]"
  TypeVar "_KT" is invariant
    Type "int | str" cannot be assigned to type "str"

No overloads for "pd.read_excel("simple.xlsx", dtype=object)" match parameters
  Argument types: (Literal['simple.xlsx'], Type[object])

Two issues here:

In the first case, when sheet_name=None pandas will always return a Dict with the name of the sheets as the key, which is a string. This is true even if there is only one sheet. Only time int is returned is when you ask for explicit sheet numbers (I think).
The dtype parameter can be any valid type

Possible fix:

diff _base.pyi.ORIG _base.pyi
2c2
< from pandas._typing import Scalar
---
> from pandas._typing import DType, Scalar
4c4
< from typing import Any, Callable, Dict, Optional, Sequence, Union, overload
---
> from typing import Any, Callable, Dict, Literal, Optional, Sequence, Union, overload
9c9
<     sheet_name: Optional[Sequence[Union[int, str]]],
---
>     sheet_name: Sequence[Union[int, str]],
15c15
<     dtype: Union[str, Dict[str, Any]] = ...,
---
>     dtype: Union[str, Dict[str, Any], DType] = ...,
280c280
<     dtype: Union[str, Dict[str, Any]] = ...,
---
>     dtype: Union[str, Dict[str, Any], Dtype] = ...,
298a299,326
>
> @overload
> def read_excel(
>     filepath: str,
>     sheet_name: Literal[None],
>     header: Optional[Union[int, Sequence[int]]] = ...,
>     names: Optional[Sequence[str]] = ...,
>     index_col: Optional[Union[int, Sequence[int]]] = ...,
>     usecols: Optional[Union[int, str, Sequence[Union[int, str, Callable]]]] = ...,
>     squeeze: bool = ...,
>     dtype: Union[str, Dict[str, Any], DType] = ...,
>     engine: Optional[str] = ...,
>     converters: Optional[Dict[Union[int, str], Callable]] = ...,
>     true_values: Optional[Sequence[Scalar]] = ...,
>     false_values: Optional[Sequence[Scalar]] = ...,
>     skiprows: Optional[Union[Sequence[int], int, Callable]] = ...,
>     nrows: Optional[int] = ...,
>     na_values = ...,
>     keep_default_na: bool = ...,
>     verbose: bool = ...,
>     parse_dates: Union[bool, Sequence, Dict[str, Sequence]] = ...,
>     date_parser: Optional[Callable] = ...,
>     thousands: Optional[str] = ...,
>     comment: Optional[str] = ...,
>     skipfooter: int = ...,
>     convert_float: bool = ...,
>     mangle_dupe_cols: bool = ...,
> ) -> Dict[str, DataFrame]:

The text was updated successfully, but these errors were encountered:

Dr-Irv · 2020-10-05T15:59:18Z

I reported the above with version 2020.9.7, and you put the label "fixed in next version" on it, but now it wasn't fixed in 2020.9.8. In the meantime, all of the use cases below should work correctly when you get this right. The fix I listed above doesn't handle them all.

import pandas as pd
from typing import Dict

dfd: Dict[str, pd.DataFrame] = pd.read_excel("simple.xlsx", sheet_name=None)
print(dfd)

dfd2: pd.DataFrame = pd.read_excel("simple.xlsx", dtype=object)
print(dfd2)
print(dfd2.dtypes)

dfd3: pd.DataFrame = pd.read_excel("simple.xlsx", sheet_name="Sheet1")
print(dfd3)

dfd4: Dict[int, pd.DataFrame] = pd.read_excel("simple.xlsx", sheet_name=[0])
print(dfd4)

dfd5: Dict[str, pd.DataFrame] = pd.read_excel("simple.xlsx", sheet_name=["Sheet1"])
print(dfd5)

erictraut · 2020-10-05T16:04:43Z

This change didn't made it in to last week's release on Wednesday. We also did a hot-fix release on Friday to fix a specific regression, but that was a very targeted fix.

This issues will be addressed in this week's release (typically on Wednesdays). We will mark the issue as closed once it is included in a release. It will also appear in the release notes.

Dr-Irv · 2020-10-05T17:23:08Z

This issues will be addressed in this week's release (typically on Wednesdays). We will mark the issue as closed once it is included in a release. It will also appear in the release notes.

Just make sure that the 5 test cases I just listed are covered, which are a superset of what was originally reported.

jakebailey · 2020-10-05T20:28:19Z

All of those pass in "basic" in our current unreleased state except for the third. Maybe because the Sequence[str] overload for sheet_name is being picked before str as str is a sequence of itself. Tricky.

I'm not really sure how we were supposed to have fixed a superset of the reported issues, though, if by definition that means some issues were not included...

Dr-Irv · 2020-10-05T22:15:32Z

I've got a set of changes that work for everything except the fourth one, which is less likely to be used (IMHO).

In trying to get it to work, I uncovered the other issues. Let me know how you want to manage.

I think the definition of Sequence[str] matches both str and List[str] . See what happens if you change Sequence to List .

jakebailey · 2020-10-05T22:17:47Z

List is a bad parameter type, because it requires that it explicitly be of the list type. It should only be used as a parameter if the implementation only works with the real list class (or subclasses).

Dr-Irv · 2020-10-05T22:23:12Z

Looking at the pandas source for read_excel, if it is a sequence, it MUST be a list. So you're safe to use List there.

jakebailey · 2020-10-05T22:34:22Z

Thanks for checking; we can flip that back in read_excel.

jakebailey · 2020-10-06T00:50:22Z

That should be fixed in the next release; all five examples typecheck.

jakebailey · 2020-10-07T22:44:21Z

This issue has been fixed in version 2020.10.0, which we've just released. You can find the changelog here: https://github.com/microsoft/pylance-release/blob/master/CHANGELOG.md#2020100-7-october-2020

github-actions bot added the triage label Oct 1, 2020

jakebailey added the typestub Issue relating to our bundled type stubs label Oct 1, 2020

jakebailey assigned gramster Oct 1, 2020

github-actions bot removed the triage label Oct 1, 2020

jakebailey added the fixed in next version (main) A fix has been implemented and will appear in an upcoming version label Oct 2, 2020

jakebailey closed this as completed Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas pd.read_excel() with sheet_name=None returns wrong result, and dtype does not allow nonstring dtypes #449

pandas pd.read_excel() with sheet_name=None returns wrong result, and dtype does not allow nonstring dtypes #449

Dr-Irv commented Oct 1, 2020

Dr-Irv commented Oct 5, 2020

erictraut commented Oct 5, 2020

Dr-Irv commented Oct 5, 2020

jakebailey commented Oct 5, 2020 •

edited

Loading

Dr-Irv commented Oct 5, 2020

jakebailey commented Oct 5, 2020

Dr-Irv commented Oct 5, 2020

jakebailey commented Oct 5, 2020

jakebailey commented Oct 6, 2020

jakebailey commented Oct 7, 2020

pandas pd.read_excel() with sheet_name=None returns wrong result, and dtype does not allow nonstring dtypes #449

pandas pd.read_excel() with sheet_name=None returns wrong result, and dtype does not allow nonstring dtypes #449

Comments

Dr-Irv commented Oct 1, 2020

Environment data

Expected behaviour

Actual behaviour

Dr-Irv commented Oct 5, 2020

erictraut commented Oct 5, 2020

Dr-Irv commented Oct 5, 2020

jakebailey commented Oct 5, 2020 • edited Loading

Dr-Irv commented Oct 5, 2020

jakebailey commented Oct 5, 2020

Dr-Irv commented Oct 5, 2020

jakebailey commented Oct 5, 2020

jakebailey commented Oct 6, 2020

jakebailey commented Oct 7, 2020

jakebailey commented Oct 5, 2020 •

edited

Loading