Scatter plot with colour_by and size_by variables #16827

nipunbatra · 2017-07-05T01:30:22Z

Problem description

Use case: Say we have a df with 4 columns- a, b, c, d. We want to make a scatter plot, with x=a, y=b, color_by=c and size_by=d. Here, if c is a categorical, we get a discrete set of colours and corresponding legend, else a continuous scale. size_by decides the size of the marker.

Such cases are often needed as evidenced by questions on Stack Overflow.

Image below shows an example.

I wrote a blog post(hand-wavy at times- marker size legend) on how to generate such a plot in Pandas. The code below shows how to make a similar plot.

Code Sample, a copy-pastable example if possible

import matplotlib.pyplot as plt
import pandas as pd
midwest= pd.read_csv("http://goo.gl/G1K41K") 
# Filtering
midwest= midwest[midwest.poptotal<50000]

fig, ax = plt.subplots()
groups = midwest.groupby('state')

# Tableau 20 Colors
tableau20 = [(31, 119, 180), (174, 199, 232), (255, 127, 14), (255, 187, 120),  
             (44, 160, 44), (152, 223, 138), (214, 39, 40), (255, 152, 150),  
             (148, 103, 189), (197, 176, 213), (140, 86, 75), (196, 156, 148),  
             (227, 119, 194), (247, 182, 210), (127, 127, 127), (199, 199, 199),  
             (188, 189, 34), (219, 219, 141), (23, 190, 207), (158, 218, 229)]
             

# Rescale to values between 0 and 1 
for i in range(len(tableau20)):  
    r, g, b = tableau20[i]  
    tableau20[i] = (r / 255., g / 255., b / 255.)

colors = tableau20[::2]

# Plotting each group 
for i, (name, group) in enumerate(groups):
    group.plot(kind='scatter', x='area', y='poptotal', ylim=((0, 50000)), xlim=((0., 0.1)),
               s=10+group['popdensity']*0.1, # hand-wavy :(
               label=name, ax=ax, color=colors[i])

# Legend for State colours
lgd = ax.legend(numpoints=1, loc=1, borderpad=1, 
            frameon=True, framealpha=0.9, title="state")
for handle in lgd.legendHandles:
    handle.set_sizes([100.0])

# Make a legend for popdensity. Hand-wavy. Error prone!
pws = (pd.cut(midwest['popdensity'], bins=4, retbins=True)[1]).round(0)
for pw in pws:
    plt.scatter([], [], s=(pw**2)/2e4, c="k",label=str(pw))

h, l = plt.gca().get_legend_handles_labels()
plt.legend(h[5:], l[5:], labelspacing=1.2, title="popdensity", borderpad=1, 
            frameon=True, framealpha=0.9, loc=4, numpoints=1)

plt.gca().add_artist(lgd)

This produces the following plot:

I was wondering, if the use case is important enough to introduce changes in the API for scatter plot, so that color_by and size_by arguments can be passed? I understand that the same set of arguments are used across different plots, and a size_by will not make sense for many plots.

If this will not make it into the API, it still might be useful to have a detailed example in the cookbook. Or, a function that would work out of the box for such plots.

TomAugspurger · 2017-07-05T12:43:47Z

I think there's some overlap here with

Is that correct? Can you go through those and explain the differences to this one?

IMO, I'd like to see pandas handle the size argument (#8244) and a bit more flexible color (#16485). I don't really think we should expand the grouped / facetted plotting API. Seaborn does a much better job at that.

TomAugspurger · 2017-07-12T19:41:45Z

@nipunbatra do you have any interest in working on either of those two issues? Or in the meantime, submitting that cookbook recipe would be helpful.

VincentAntoine · 2017-09-01T21:39:37Z

Hi,

I would like to help with this. This would be my first contribution to open source so I might need guidance in the process. I'll give it a try and should be back in the next two weeks with something hopefully.

TomAugspurger · 2017-09-01T21:49:25Z

@VincentAntoine great! Just holler early and often if you get stuck.

VincentAntoine · 2017-09-05T22:16:20Z

Hi,

I have started to look around and fiddle with the ScatterPlot class.
ScatterPlot before any modification:

class ScatterPlot(PlanePlot):
    _kind = 'scatter'
 
    def __init__(self, data, x, y, s=None, c=None, **kwargs):
        if s is None:
            # hide the matplotlib default for size, in case we want to change
            # the handling of this argument later
        s = 20

I changed this bit to detect whether s was a column name and grab and normalize the data in the corresponding column. I think a maximum size of 200 pts is a decent default, but of course the most appropriate maximum bubble size will depend on the number of points to display so I think it is necessary to have a new parameter s_grow = 1 to allow users make bubbles bigger or smaller and find the correct scaling for each situation.

So this is what I wrote so far:

class ScatterPlot(PlanePlot):
    _kind = 'scatter'
 
    def __init__(self, data, x, y, s=None, s_grow=1, c=None, **kwargs):
        if s is None:
            # Set default size if no argument is given
            s = 20
        elif is_hashable(s) and s in data.columns:
            # If s is a label of a column of the df, grab and normalize the data to 200 * s_grow
            size_data = data.loc[:, s].values
            if is_numeric_dtype(size_data):
                s = 200 * s_grow * size_data / size_data.max()
            else:
                raise TypeError('s must be of numeric dtype')

So s can be any of the following when creating a scatter plot from a DataFrame:

None --> defaults to s = 20
Column name --> we grab the data in the corresponding column as bubble sizes
Scalar --> fixed bubble size
array --> used as bubble sizes

There is a possible confusion, if for instance "s=50" and 50 is a column name --> should we use a fixed bubble size of 50, or use the data in the column 50? This does not seem like a serious problem to me, and I think it makes more sense to use the data in the column 50 in this case.

I will now make the bubble size scale. Your feedback will be greatly appreciated!

Thanks
Vincent

TomAugspurger · 2017-09-05T23:36:32Z

There is a possible confusion, if for instance "s=50" and 50 is a column name --> should we use a fixed bubble size of 50,

I think we always default to a scalar 50 in this case. They can specify s=df['50'] if they absolutely want a column.

FWIW, I think a simple / partial solution is great here. If s_grow is too awkward / difficult to implement, feel free to leave it out. A 90% solution is just fine, and if people need to customize further they can use matplotlib directly.

VincentAntoine · 2017-09-16T22:05:35Z

Hey!

I've made progress with the sizes, haven't looked at colors yet. Taking the same data as @nipunbatra in his example above, this is what I have now:

import matplotlib.pyplot as plt
import pandas as pd

# fetching and filtering data
midwest = pd.read_csv("http://goo.gl/G1K41K")
midwest = midwest[midwest['poptotal'] < 50000]

# plotting
midwest.plot(kind='scatter', x='area', y='poptotal', s='popdensity',
             title='Popuation vs area and density')
plt.show()

And if you want to make the bubbles smaller or bigger, you can use s_grow (defaut 1) to change that:

midwest.plot(kind='scatter', x='area', y='poptotal', s='popdensity',
             title='Popuation vs area and density', s_grow=0.2)
plt.show()

Here is what I did so far:

1: grab the data, normalize the data appropriately to get reasonable bubble sizes, pass that to matplotlib to make the bubble plot
2: make the legend

Grabbing & normalizing data

Compared to what I explained in my previous post, I only slightly modified the init method of the ScatterPlot class to turn s_grow, size_title, size_data_max and bubble_points (the default bubble max size of 200 points) into attributes of ScatterPlot instances, as that makes these 4 parameters easily accessible to the other methods when building the legend for the bubble sizes.

class ScatterPlot(PlanePlot):
    _kind = 'scatter'

    def __init__(self, data, x, y, s=None, s_grow=1, c=None, **kwargs):
        if s is None:
            # Set default size if no argument is given
            s = 20
        elif is_hashable(s) and s in data.columns:
            # Handle the case where s is a label of a column of the df
            # The data is normalized to 200 * s_grow
            size_data = data.loc[:, s].values
            if is_numeric_dtype(size_data):
                self.size_title = s
                self.size_data_max=size_data.max()
                self.s_grow= s_grow
                self.bubble_points = 200
                s = self.bubble_points * s_grow * size_data / self.size_data_max
            else:
                raise TypeError('s must be of numeric dtype')
        super(ScatterPlot, self).__init__(data, x, y, s=s, **kwargs)

Building the legend

Before actually building the legend, we must define the sizes and labels of the bubbles to include in the legend. For instance if we want 4 bubbles in our legend, a straighforward approach is to use data_max, 0.75 * data_max, 0.5 * data_max and 0.25 * data_max. However as you can see in the graph built by @nipunbatra this leads to values like 82, 733, 1382... which is not as nice having labels with "round" values like in the graph produced by Altair (see @nipunbatra 's blog post).

I have therefore tried to achieve this nice behaviour and to build a legend with round values. In order to make a legend with 4 bubbles, we therefore need to define 4 bubble sizes and the 4 corresponding labels, with 'round' values for the labels, the biggest of which is close to the maximum of the data.

For this I first need a helper function to extract the mantissa (or coefficient) and exponent of a number in decimal base.

#class ScatterPlot(PlanePlot):
    def _sci_notation(self, num):
        scientific_notation = '{:e}'.format(num)
        expnt = float(re.search(r'e([+-]\d*)$', scientific_notation).groups()[0])
        coef = float(re.search(r'^([+-]?\d\.\d)', scientific_notation).groups()[0])
        return coef, expnt

Example: _sci_notation(782489.89247823) returns (7.8, 5.0)

Then, given a data_max, s_grow and bubble_points, this function finds 4 appropriate sizes and labels for the legend:

#class ScatterPlot(PlanePlot):
    def _legend_bubbles(self, data_max, s_grow, bubble_points):
        coef, expnt = self._sci_notation(data_max)
        labels_catalog = {
            (9, 10) : [10, 5, 2.5, 1],
            (7, 9) : [8, 4, 2, 0.5],
            (5.5, 7) : [6, 3, 1.5, 0.5],
            (4.5, 5.5) : [5, 2, 1, 0.2],
            (3.5, 4.5) : [4, 2, 1, 0.2],
            (2.5, 3.5) : [3, 1, 0.5, 0.2],
            (1.5, 2.5) : [2, 1, 0.5, 0.2],
            (0, 1.5) : [1, 0.5, 0.25, 0.1]
        }
        for lower_bound, upper_bound in labels_catalog:
            if (coef >= lower_bound) & (coef < upper_bound):
                labels = np.array(labels_catalog[lower_bound, upper_bound]) * 10**expnt
                sizes = list(bubble_points * s_grow * labels / data_max)
                labels =['{:g}'.format(l) for l in labels]
                return (sizes, labels)

Example: _legend_bubbles(data_max = 2678.0588199999, s_grow = 1, bubble_points = 200) returns:
([224.04287595147829, 74.680958650492769, 37.340479325246385, 14.936191730098553],
['3000', '1000', '500', '200'])

The first list gives 4 bubbles sizes (in points) and the second list the 4 corresponding labels.

In our exemple with population density, the maximum of popdensity is 2678.0588199999. So what happens is:

we compute mantissa (2.6) and exponent (3.0)
2.6 lies between 2.5 and 3.5, so in the labels_catalog we pick [3, 1, 0.5, 0.2]
we compute the labels which are 3e3, 1e3, 0.5e3 and 0.2e3
we compute bubble sizes corresponding to these labels, that is bubble_points * s_grow * 3e3 / 2678.0588199999 etc

Finally, we put all the pieces together in a _make_legend method which is specific to the ScatterPlot class. After building the legend for the bubbles, we call the _make_legend method of the parent.

#class ScatterPlot(PlanePlot):
    def _make_legend(self):
        if hasattr(self, "size_title"):
            ax = self.axes[0]
            size_title = self.size_title
            data_max = self.size_data_max
            s_grow = self.s_grow
            bubble_points = self.bubble_points
            import matplotlib.legend as legend
            sizes, labels=self._legend_bubbles(data_max, s_grow, bubble_points)
            bubbles=[]
            for size in sizes:
                bubbles.append(ax.scatter([], [], s=size, color='white', edgecolor='gray'))
            bubble_legend=legend.Legend(ax, handles=bubbles, labels=labels, loc='lower right')
            bubble_legend.set_title(size_title)
            ax.add_artist(bubble_legend)
        super()._make_legend()

I also have a few questions:

my helper function to grab mantissa and exponent of a number should probably not live in the ScatterPlot class, but I don't really know where to put it. Any idea?
if we use large values for s_grow, and the bubbles become quite large, the bubbles in the legend will become so big that they will overlap and/or they may hide the labels and legend title. We can make the legend layout somewhat adaptative with respect to s_grow by scaling labelspacing, borderpad and handletextpad proportionnaly to sqrt(s_grow), but the result is not always very good. A more flexible approach would be to put the legend in the separate subplot. We could thus place the legend outside of the main plot axes, like in the graph produced by Altair, and having the legend in a separate subplot gives much more layout options and flexibility. Is this an approach you would like me to try or do you prefer keeping it as it is?

How does this look to you?

Thanks!
Vincent

TomAugspurger · 2017-09-17T14:17:29Z

It may be easiest to make a PR at this point so we can review the code.

My only general comment is we shouldn't worry about edge cases, like the values getting too large, with this high-level API. If people need to customize it further, they can just use matplotlib directly.

sorenwacker · 2020-02-20T17:20:16Z

Is anyone still working on this? I miss this functionality. If the column contains strings the method should use distinct colors. Similar to what happens in plotly plots. Same with shapes.

michaelmannino · 2024-07-12T13:09:47Z

take

michaelmannino · 2024-07-12T23:20:03Z

Hi all who are still interested in this topic, I have completed the general functionality, and it is out in my PR if you would like to take a look

My only question is, what is the best way to choose default colors for strings here? Currently, I am pulling the largest list of mpl's colors and randomly choosing as just iterating though normally tends to pick too similar of colors

TomAugspurger added the Docs label Jul 12, 2017

TomAugspurger added this to the Next Major Release milestone Jul 12, 2017

TomAugspurger added Visualization plotting Difficulty Novice labels Jul 12, 2017

VincentAntoine mentioned this issue Sep 18, 2017

Feat/scatter by size #17582

Closed

TomAugspurger added the good first issue label Oct 11, 2017

jreback removed the Difficulty Novice label Dec 15, 2017

VincentAntoine mentioned this issue Apr 1, 2018

Feat/scatter by size #20572

Closed

3 tasks

TomAugspurger mentioned this issue Jul 6, 2018

ENH/VIS: Pass DataFrame column to size argument in df.scatter #8244

Closed

VincentAntoine mentioned this issue Aug 17, 2018

Feat/bubble plot #22403

Closed

5 tasks

jreback modified the milestones: Contributions Welcome, 0.24.0 Sep 18, 2018

jreback modified the milestones: 0.24.0, Contributions Welcome Dec 2, 2018

jbrockmendel removed the Effort Low label Oct 21, 2019

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

github-actions bot assigned michaelmannino Jul 12, 2024

michaelmannino mentioned this issue Jul 12, 2024

ENH: DataFrame.plot.scatter argument c now accepts a column of strings, where rows with the same string are colored identically #59239

Merged

6 tasks

WillAyd closed this as completed in #59239 Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scatter plot with colour_by and size_by variables #16827

Scatter plot with colour_by and size_by variables #16827

nipunbatra commented Jul 5, 2017

TomAugspurger commented Jul 5, 2017

TomAugspurger commented Jul 12, 2017

VincentAntoine commented Sep 1, 2017

TomAugspurger commented Sep 1, 2017

VincentAntoine commented Sep 5, 2017 •

edited

Loading

TomAugspurger commented Sep 5, 2017

VincentAntoine commented Sep 16, 2017

TomAugspurger commented Sep 17, 2017

sorenwacker commented Feb 20, 2020

michaelmannino commented Jul 12, 2024

michaelmannino commented Jul 12, 2024

Scatter plot with colour_by and size_by variables #16827

Scatter plot with colour_by and size_by variables #16827

Comments

nipunbatra commented Jul 5, 2017

Problem description

Code Sample, a copy-pastable example if possible

TomAugspurger commented Jul 5, 2017

TomAugspurger commented Jul 12, 2017

VincentAntoine commented Sep 1, 2017

TomAugspurger commented Sep 1, 2017

VincentAntoine commented Sep 5, 2017 • edited Loading

TomAugspurger commented Sep 5, 2017

VincentAntoine commented Sep 16, 2017

Grabbing & normalizing data

Building the legend

TomAugspurger commented Sep 17, 2017

sorenwacker commented Feb 20, 2020

michaelmannino commented Jul 12, 2024

michaelmannino commented Jul 12, 2024

VincentAntoine commented Sep 5, 2017 •

edited

Loading