Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: scatter plot with categorical data raises KeyError #16199

Closed
jorisvandenbossche opened this issue May 2, 2017 · 10 comments · Fixed by #16208
Closed

BUG: scatter plot with categorical data raises KeyError #16199

jorisvandenbossche opened this issue May 2, 2017 · 10 comments · Fixed by #16208
Labels
Bug Error Reporting Incorrect or improved errors from pandas
Milestone

Comments

@jorisvandenbossche
Copy link
Member

df = pd.DataFrame({'x':[1,2,3,4], 'y':pd.Categorical(['a', 'b', 'a', 'c'])})
df.plot(x='x', y='y', kind='scatter')

raises KeyError: 'y', while the column certainly exists, which can be very confusing.
Without the scatter (just df.plot(x='x', y='y')), it raises the more informative TypeError: Empty 'DataFrame': no numeric data to plot

@jorisvandenbossche jorisvandenbossche added Bug Error Reporting Incorrect or improved errors from pandas labels May 2, 2017
@stangirala
Copy link
Contributor

@jorisvandenbossche Seems like a simple fix that would check the types of x and y at https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py#L831?

@TomAugspurger
Copy link
Contributor

@stangirala it may have to be earlier, somewhere around https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py#L335

The issue is we drop non-numeric columns fairly early on, and by the type you get to https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py#L831, the data passed to that method already doesn't have the categorical column.

The cleanest way might be to modify MPLPlot._compute_plot_data to check if self.x and self.y are not in numeric_data before setting self.data = numeric_data.

@stangirala
Copy link
Contributor

@TomAugspurger I see. But it looks like PlanePlot has checks on x and y and not MPLPlot, https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py#L770. The error in the code sample above passes the check in MPLPlot at https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py#L335 because x is numeric. So the check would go in PlanePlot right?

@jorisvandenbossche
Copy link
Member Author

@stangirala Another interesting area to contribute, however a much bigger issue, is to actually better support categorical data (as the above example could work)

@stangirala
Copy link
Contributor

@jorisvandenbossche Maybe you can open a new issue for it? :D It seems like we would have a flag for most plots that would wholesale convert categorical data to a label-to-integer mapping?

But I don't think it would make much sense for a scatter plot that requires an inherent ordering. But having equivalent categorical plots is a good idea, for example parallel_sets, #12341

stangirala added a commit to stangirala/pandas that referenced this issue May 3, 2017
Appropriately handles categorical data for dataframe scatter plots which
currently raises KeyError for categorical data
stangirala added a commit to stangirala/pandas that referenced this issue May 3, 2017
Appropriately handles categorical data for dataframe scatter plots which
currently raises KeyError for categorical data
@jreback jreback added this to the 0.20.1 milestone May 3, 2017
stangirala added a commit to stangirala/pandas that referenced this issue May 3, 2017
Appropriately handles categorical data for dataframe scatter plots which
currently raises KeyError for categorical data
@jorisvandenbossche
Copy link
Member Author

There are already some issues about it, eg #12341

But I don't think it would make much sense for a scatter plot that requires an inherent ordering.

Do you mean you think it won't make sense for a scatter plot to support categorical data? I think it could make sense, where you basically use the underlying codes of the categorical as the values to plot

@stangirala
Copy link
Contributor

Oh I meant would someone want a scatter plot for categorical data when most of the times the categories don't have an ordering? I mean in such case won't someone want to use a box plot for example assuming y data is numeric and x is categorical?

@TomAugspurger
Copy link
Contributor

supporting categorical in scatter would be nice when you have a single observation per category. More like a dot plot.

@stangirala
Copy link
Contributor

@TomAugspurger I see, a dot plot would make sense. I don't see an issue for this, do I open one?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 3, 2017 via email

@jreback jreback modified the milestones: 0.20.1, 0.20.2 May 5, 2017
@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.20.2 May 6, 2017
stangirala added a commit to stangirala/pandas that referenced this issue Jun 11, 2017
Appropriately handles categorical data for dataframe scatter plots which
currently raises KeyError for categorical data
@jreback jreback modified the milestones: 0.20.3, Next Major Release Jun 12, 2017
TomAugspurger pushed a commit that referenced this issue Jun 12, 2017
* BUG: Categorical scatter plot has KeyError #16199

Appropriately handles categorical data for dataframe scatter plots which
currently raises KeyError for categorical data

* Add to whatsnew
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue Jul 6, 2017
…ev#16208)

* BUG: Categorical scatter plot has KeyError pandas-dev#16199

Appropriately handles categorical data for dataframe scatter plots which
currently raises KeyError for categorical data

* Add to whatsnew

(cherry picked from commit 11d274f)
TomAugspurger pushed a commit that referenced this issue Jul 7, 2017
* BUG: Categorical scatter plot has KeyError #16199

Appropriately handles categorical data for dataframe scatter plots which
currently raises KeyError for categorical data

* Add to whatsnew

(cherry picked from commit 11d274f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants