Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas pd.read_excel() with sheet_name=None returns wrong result, and dtype does not allow nonstring dtypes #449

Closed
Dr-Irv opened this issue Oct 1, 2020 · 10 comments
Assignees
Labels
fixed in next version (main) A fix has been implemented and will appear in an upcoming version typestub Issue relating to our bundled type stubs

Comments

@Dr-Irv
Copy link

Dr-Irv commented Oct 1, 2020

Environment data

  • Language Server version: 2020.9.7
  • OS and version: Windows 10
  • Python version (& distribution if applicable, e.g. Anaconda): Anaconda Python 3.7.5

Expected behaviour

No error

Actual behaviour

import pandas as pd
from typing import Dict

dfd: Dict[str, pd.DataFrame] = pd.read_excel("simple.xlsx", sheet_name=None)
print(dfd)

dfd2: pd.DataFrame = pd.read_excel("simple.xlsx", dtype=object)
print(dfd2)
print(dfd2.dtypes)

reports

Expression of type "Dict[int | str, DataFrame]" cannot be assigned to declared type "Dict[str, DataFrame]"
  TypeVar "_KT" is invariant
    Type "int | str" cannot be assigned to type "str"

No overloads for "pd.read_excel("simple.xlsx", dtype=object)" match parameters
  Argument types: (Literal['simple.xlsx'], Type[object])

Two issues here:

  1. In the first case, when sheet_name=None pandas will always return a Dict with the name of the sheets as the key, which is a string. This is true even if there is only one sheet. Only time int is returned is when you ask for explicit sheet numbers (I think).
  2. The dtype parameter can be any valid type

Possible fix:

diff _base.pyi.ORIG _base.pyi
2c2
< from pandas._typing import Scalar
---
> from pandas._typing import DType, Scalar
4c4
< from typing import Any, Callable, Dict, Optional, Sequence, Union, overload
---
> from typing import Any, Callable, Dict, Literal, Optional, Sequence, Union, overload
9c9
<     sheet_name: Optional[Sequence[Union[int, str]]],
---
>     sheet_name: Sequence[Union[int, str]],
15c15
<     dtype: Union[str, Dict[str, Any]] = ...,
---
>     dtype: Union[str, Dict[str, Any], DType] = ...,
280c280
<     dtype: Union[str, Dict[str, Any]] = ...,
---
>     dtype: Union[str, Dict[str, Any], Dtype] = ...,
298a299,326
>
> @overload
> def read_excel(
>     filepath: str,
>     sheet_name: Literal[None],
>     header: Optional[Union[int, Sequence[int]]] = ...,
>     names: Optional[Sequence[str]] = ...,
>     index_col: Optional[Union[int, Sequence[int]]] = ...,
>     usecols: Optional[Union[int, str, Sequence[Union[int, str, Callable]]]] = ...,
>     squeeze: bool = ...,
>     dtype: Union[str, Dict[str, Any], DType] = ...,
>     engine: Optional[str] = ...,
>     converters: Optional[Dict[Union[int, str], Callable]] = ...,
>     true_values: Optional[Sequence[Scalar]] = ...,
>     false_values: Optional[Sequence[Scalar]] = ...,
>     skiprows: Optional[Union[Sequence[int], int, Callable]] = ...,
>     nrows: Optional[int] = ...,
>     na_values = ...,
>     keep_default_na: bool = ...,
>     verbose: bool = ...,
>     parse_dates: Union[bool, Sequence, Dict[str, Sequence]] = ...,
>     date_parser: Optional[Callable] = ...,
>     thousands: Optional[str] = ...,
>     comment: Optional[str] = ...,
>     skipfooter: int = ...,
>     convert_float: bool = ...,
>     mangle_dupe_cols: bool = ...,
> ) -> Dict[str, DataFrame]:
@github-actions github-actions bot added the triage label Oct 1, 2020
@jakebailey jakebailey added the typestub Issue relating to our bundled type stubs label Oct 1, 2020
@github-actions github-actions bot removed the triage label Oct 1, 2020
@jakebailey jakebailey added the fixed in next version (main) A fix has been implemented and will appear in an upcoming version label Oct 2, 2020
@Dr-Irv
Copy link
Author

Dr-Irv commented Oct 5, 2020

I reported the above with version 2020.9.7, and you put the label "fixed in next version" on it, but now it wasn't fixed in 2020.9.8. In the meantime, all of the use cases below should work correctly when you get this right. The fix I listed above doesn't handle them all.

import pandas as pd
from typing import Dict

dfd: Dict[str, pd.DataFrame] = pd.read_excel("simple.xlsx", sheet_name=None)
print(dfd)

dfd2: pd.DataFrame = pd.read_excel("simple.xlsx", dtype=object)
print(dfd2)
print(dfd2.dtypes)

dfd3: pd.DataFrame = pd.read_excel("simple.xlsx", sheet_name="Sheet1")
print(dfd3)

dfd4: Dict[int, pd.DataFrame] = pd.read_excel("simple.xlsx", sheet_name=[0])
print(dfd4)

dfd5: Dict[str, pd.DataFrame] = pd.read_excel("simple.xlsx", sheet_name=["Sheet1"])
print(dfd5)

@erictraut
Copy link
Contributor

This change didn't made it in to last week's release on Wednesday. We also did a hot-fix release on Friday to fix a specific regression, but that was a very targeted fix.

This issues will be addressed in this week's release (typically on Wednesdays). We will mark the issue as closed once it is included in a release. It will also appear in the release notes.

@Dr-Irv
Copy link
Author

Dr-Irv commented Oct 5, 2020

This issues will be addressed in this week's release (typically on Wednesdays). We will mark the issue as closed once it is included in a release. It will also appear in the release notes.

Just make sure that the 5 test cases I just listed are covered, which are a superset of what was originally reported.

@jakebailey
Copy link
Member

jakebailey commented Oct 5, 2020

All of those pass in "basic" in our current unreleased state except for the third. Maybe because the Sequence[str] overload for sheet_name is being picked before str as str is a sequence of itself. Tricky.

I'm not really sure how we were supposed to have fixed a superset of the reported issues, though, if by definition that means some issues were not included...

@Dr-Irv
Copy link
Author

Dr-Irv commented Oct 5, 2020

I've got a set of changes that work for everything except the fourth one, which is less likely to be used (IMHO).

In trying to get it to work, I uncovered the other issues. Let me know how you want to manage.

I think the definition of Sequence[str] matches both str and List[str] . See what happens if you change Sequence to List .

@jakebailey
Copy link
Member

List is a bad parameter type, because it requires that it explicitly be of the list type. It should only be used as a parameter if the implementation only works with the real list class (or subclasses).

@Dr-Irv
Copy link
Author

Dr-Irv commented Oct 5, 2020

Looking at the pandas source for read_excel, if it is a sequence, it MUST be a list. So you're safe to use List there.

@jakebailey
Copy link
Member

Thanks for checking; we can flip that back in read_excel.

@jakebailey
Copy link
Member

That should be fixed in the next release; all five examples typecheck.

@jakebailey
Copy link
Member

This issue has been fixed in version 2020.10.0, which we've just released. You can find the changelog here: https://github.com/microsoft/pylance-release/blob/master/CHANGELOG.md#2020100-7-october-2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed in next version (main) A fix has been implemented and will appear in an upcoming version typestub Issue relating to our bundled type stubs
Projects
None yet
Development

No branches or pull requests

4 participants