Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meaning of color argument in DataFrame.plot.scatter() #16485

Closed
Dr-Irv opened this issue May 24, 2017 · 7 comments · Fixed by #59239
Closed

Meaning of color argument in DataFrame.plot.scatter() #16485

Dr-Irv opened this issue May 24, 2017 · 7 comments · Fixed by #59239
Assignees

Comments

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 24, 2017

Code Sample, a copy-pastable example if possible

df = pd.DataFrame.from_records([[1,1,'g'],[2,2,'r']], columns=['x','y','Color'])
df.plot.scatter(x='x',y='y',c='Color')

Problem description

The issue here is that it is not clear what the values in the column corresponding to the argument c of scatter should be. The example given at http://pandas.pydata.org/pandas-docs/stable/visualization.html#scatter-plot uses numerical values, but in this example, I just want red and green dots. With matplotlib, you can supply the colors as a vector.

IMHO, the API should be consistent. You should be able to specify the column names corresponding to the value of x, y, and the color. This would be especially useful if you have a pattern such as:

df[df.x>=1].plot.scatter(x='x', y='y', c='Color')

where you produce a scatter plot of selected rows.

The code in the simple example generates an error:

KeyError                                  Traceback (most recent call last)
C:\Anaconda3\envs\py36\lib\site-packages\matplotlib\colors.py in to_rgba(c, alpha)
    140     try:
--> 141         rgba = _colors_full_map.cache[c, alpha]
    142     except (KeyError, TypeError):  # Not in cache, or unhashable.

KeyError: ('o', None)

Expected Output

A plot with 2 points, one red and one green.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: None
xarray: None
IPython: 6.0.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@VincentLa
Copy link
Contributor

Oh interesting, I might try to see if I can pick this up

@TomAugspurger TomAugspurger added this to the Next Major Release milestone May 25, 2017
@amichaut
Copy link

amichaut commented Jun 26, 2017

Hi, I think my comment is related (tell me if I'm wrong). Until version 0.19.2, the color argument in df.plot could be specified with a rbga tuple. It is not supported in newer versions. Is it intentional?
In more details, if I call for instance:

df.plot(color=(0.5,0.5,0.5))
plt.show

I get the following error in newer versions:

/usr/local/lib/python2.7/dist-packages/matplotlib/colors.pyc in _to_rgba_no_colorcycle(c, alpha)
    192         # float)` and `np.array(...).astype(float)` all convert "0.5" to 0.5.
    193         # Test dimensionality to reject single floats.
--> 194         raise ValueError("Invalid RGBA argument: {!r}".format(orig_c))
    195     # Return a tuple to prevent the cached value from being modified.
    196     c = tuple(c.astype(float))

ValueError: Invalid RGBA argument: 0.5

But it was supported before.

@jorisvandenbossche
Copy link
Member

@amichaut I think this is fixed on master, see #16695 (and PR #16701). This will probably be released in 0.20.3

@scfrank
Copy link

scfrank commented Mar 26, 2018

I just ran into the original bug (color names not being recognised/used; cryptic error message) and would like to resurrect this issue. It's frustrating because this kind of example should work according to a lot of stack overflow examples (e.g. https://stackoverflow.com/questions/41069676/make-scatter-plot-and-color-points-with-colors-stored-in-data-frame) - as a new pandas user, this is going to cause a lot of confusion.

I've traced the problem (I think) to _compute_plot_data in plotting/_core.py:

def _compute_plot_data(self):

AFAICT this function throws out non-numeric columns - this includes the column containing the string color values, so after this function, the dataframe no longer contains the 'color' column.

The minimal example in the original issue still results in:
/venv3/lib/python3.6/site-packages/matplotlib/colors.py", line 166, in to_rgba
rgba = _colors_full_map.cache[c, alpha]
KeyError: ('o', None)
which is again confusing since one hasn't specified a 'o' color at all (I'm not sure where this default value is coming from).

@sorenwacker
Copy link

sorenwacker commented Sep 26, 2022

I think, it would make more sense if the api would interpret either the color or c argument as category/mappable that should be colored. Instead of adding column with explicit colors, the column should contain eighter categories or numeric values that are then used to color the markers. And a legend should be added. Similar to how it is done in Seaborn with the hue argument.

Current behaviour:

import pandas as pd
df = pd.DataFrame({'dataX': [3,79,90], 'dataY': [7,9,13], 'color': ['Shoe', 'Star', 'Shoe']})
df.plot.scatter('dataX', 'dataY', c='color')
> ValueError: 'c' argument must be a color, a sequence of colors, or a sequence of numbers, not ['Shoe' 'Star' 'Shoe']

Should generate something like:

sns.relplot(data=df, x='dataX', y='dataY', hue='color')

image

This would be way more practical than the current behaviour IMO.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@michaelmannino
Copy link
Contributor

take

@michaelmannino
Copy link
Contributor

Hi all who are still interested in this topic, I have completed the general functionality, and it is out in my PR if you would like to take a look

My only question is, what is the best way to choose default colors for strings here? Currently, I am pulling the largest list of mpl's colors and randomly choosing as just iterating though normally tends to pick too similar of colors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment