Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Should DataFrame constructor allow sets for index and columns? #47215

Closed
Dr-Irv opened this issue Jun 3, 2022 · 5 comments · Fixed by #47231
Closed

API: Should DataFrame constructor allow sets for index and columns? #47215

Dr-Irv opened this issue Jun 3, 2022 · 5 comments · Fixed by #47231
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors
Milestone

Comments

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 3, 2022

While looking into some typing issues, I found that the following code works:

import pandas as pd

df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=set(["a", "b", "c"]))
print(df)

The result:

   b  a  c
0  1  2  3
1  4  5  6

We document that the columns and index arguments should be Index or "array-like". Certainly a set is not array-like!

Internally, we type those arguments as Axes, which is defined as Collection[Any], but in the pandas-stubs, we are more restrictive, not allowing sets to be passed.

I think we should disallow set as a type in the constructor, since it doesn't follow the documentation, and if you were to use it, the behavior is unclear. We could do one or more of the following:

  1. Add code in the DataFrame constructor to raise an error if index or columns are sets
  2. Change the type declaration for Axes to be more restrictive, i.e., not use Collection

I think we should definitely do (2), but I'm not sure if (1) would be considered a change in the API, since we never documented that sets are acceptable, but they did work (and possibly give funky behavior).

@jreback
Copy link
Contributor

jreback commented Jun 3, 2022

no sets are unordered but definition

we shouldn't allow this (it's possible we convert these though)

@rhshadrach
Copy link
Member

rhshadrach commented Jun 4, 2022

I don't think we should view (1) as an API change, assuming there is no indication in the documentation that sets are allowed. The API is what is documented as being supported, not whether or not it works by chance.

+1 on not allowing sets.

@simonjayhawkins
Copy link
Member

I don't think we should view (1) as an API change, assuming there is no indication in the documentation that sets are allowed. The API is what is documented as being supported, not whether or not it works by chance.

+1 on not allowing sets.

we are not that clear on our type definitions for the parameters section of the docstrings

the only mention of array-like in https://pandas.pydata.org/pandas-docs/stable/development/contributing_docstring.html#pandas-docstring-guide

If the exact type is not relevant, but must be compatible with a NumPy array, array-like can be specified.

Now, if a list is accepted here, then a list does not have say a ndim attribute, so compatible is therefore vague.

If compatible means that can be converted to a numpy array then np.array(set(["a", "b", "c"])) gives a 0-dim array which is not indexable like np.array(list(["a", "b", "c"]))

So for all intents and purposes, set is not array-like

However, I think we still sometimes deprecate things that happen to work by chance first to err on the side of caution.

For the stubs, disallowing sets is fine and the right thing to do.

+1 for deprecating not allowing sets.

@5j9
Copy link
Contributor

5j9 commented Dec 1, 2022

I belive this change is why I'm getting type checking warnings for using a tuple or dict.keys() as columns= in DataFrame constructor. Is this intentionally so, or should I open an issue for it?

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Dec 5, 2022

I belive this change is why I'm getting type checking warnings for using a tuple or dict.keys() as columns= in DataFrame constructor. Is this intentionally so, or should I open an issue for it?

Are you using pandas-stubs ? Those should work, so create an issue at https://github.com/pandas-dev/pandas-stubs/issues/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants