DOC: develop a set of standard example DataFrames for use in docstring examples #19710

jorisvandenbossche · 2018-02-15T09:31:51Z

Related to #19704. I didn't find an existing open issue, only a discussion mentioning this in #16520 (@datapythonista it was actually you then! I didn't realize that :-))

I think it would be good to have a set of standard DataFrames that we reuse throughout our docs (to start with in the docstrings, but we could actually also use a standardized set for the user guide):

Some small, more "realistic" dataframes would make it is easier to reason about than dummy random data + adds familiarity when reading multiple docstrings
Makes it easier for contributors to add examples to the docstring as they don't have to invent their own data each time

I don't think there will be "one example dataframe to rule them all", but it would be nice to have a set of them that can cover most of the use cases.
So we can post some ideas here and discuss them, trying to get to a list.

Side question is whether we want to always define them with code in the docstring, or want to have some example data loading capabilities (eg like seaborn, it examples always start with a iris = sns.load_dataset("iris") or other dataset). It can also be a mixture of both of course.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-02-15T12:22:28Z

or want to have some example data loading capabilities

+1 to this.

koaning · 2018-02-20T13:27:54Z

For the upcoming pandas docstring sprint this seems like a nice one to pick up. Any preference on datasets that should be added?

iris, mtcars, chickweight ... others?

TomAugspurger · 2018-02-20T13:56:01Z

Whatever datasets we add, we'll need to check to ensure that they have a license that allows us to redistribute them. And preferable they'd be small or we can download and cache them as needed, if we don't want to included them in the distribution.

…

On Tue, Feb 20, 2018 at 7:27 AM, vincent d warmerdam < ***@***.***> wrote: For the upcoming pandas docstring sprint this seems like a nice one to pick up. Any preference on datasets that should be added? iris, mtcars, chickweight ... others? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#19710 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIhUfoKNFyEN7DhpcB5S6xLQRQ8_Qks5tWshegaJpZM4SGjPf> .

datapythonista · 2018-02-20T14:14:24Z

Personally, for the documentation samples I wouldn't have "real" datasets. We won't be showing more than 5 rows, so even something with 150 samples as Iris seems too much.

To me it would make sense to have something like pandas.io.samples.Countries or pandas.io.samples.Animals with a mix of types, for example:

Country name: object
Continent: category
Population: int
GPD: float

I think everybody will quickly understand data about countries, or animals, or things like that, and I'd avoid something more specialized. I'd hardcode the data into the samples.py, and simply have something like 20 rows.

For things like time series, another dataset (possibly with stock market data) will be needed. And probably one with "complex" multi-indices, or other kinds of data needed to illustrate some pandas functions.

In https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html the examples try to follow this approach. We can surely do better, and I think it's a good ideo to do something like from pandas.io.sample import Countries instead of defining the dataset each time. But that is what makes sense to me.

Does it make sense, or am I missing something for having larger datasets?

jorisvandenbossche · 2018-02-20T15:35:09Z

My original question was mainly about the ability to load example dataframes, instead of always constructing them with code inside the docstring.
Whether it are then actual external datasets we include, or smaller 'made-up' ones, that doesn't really matter to me. But I agree some smaller ones for basic functionality can be enough (but still, if we make a small one on eg countries, I would still do it with 'real' data). And that for other cases we will need some more complex ones.

For really small example datasets, I still see some value in actually constructing them inside the docstring, just to make users familiar with the concept of "creating a small example dataset yourself to show functionality" (which is useful when they submit bug reports :), but I don't know how much effect this would have ..)

For countries, I use the following in my pandas-tutorial (I currrently also use the titanic dataset a lot for small illustrations):

In [15]: data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
    ...:         'population': [11.3, 64.3, 81.3, 16.9, 64.9],
    ...:         'area': [30510, 671308, 357050, 41526, 244820],
    ...:         'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
    ...: countries = pd.DataFrame(data)
    ...: countries
Out[15]: 
     area    capital         country  population
0   30510   Brussels         Belgium        11.3
1  671308      Paris          France        64.3
2  357050     Berlin         Germany        81.3
3   41526  Amsterdam     Netherlands        16.9
4  244820     London  United Kingdom        64.9

a variation on this would indeed be nice.

dukebody · 2018-02-24T17:36:35Z

My two cents:

I agree on that having a set of standard datasets, or at least datasets with similar data, that we reuse across documentation can ease understanding. This is the case with a lot of ML/DS literature where authors reuse the same datasets (iris, mtcars, housing, etc.) to illustrate techniques and models.
I have no strong preference over importing vs. defining them directly on the docstring. However for small datasets I think we should always print them so the user understands the data that is being used, as shown in the last example from @jorisvandenbossche and in https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html.
I see an added benefit of defining the datasets in the docstring itself instead of importing them: people using older versions of pandas where these examples dataframes are not defined yet will be able to copy-paste the example. So for small datasets I'd favor defining them directly in docstrings, which is not a big effort anyhow and can be copy-pasted from other docs and adapted if necessary. In the case of big datasets I guess it can be OK to import them directly and print only their first lines.

jorisvandenbossche · 2018-03-02T15:30:21Z

I was just wondering: do we want to use "real world" data in the docstring examples in all cases? In many cases it certainly makes sense (like eg pivot), but maybe not in all.

Let's consider the Series.mean docstring (which currently has no examples).
I think simply doing:

>>> s = pd.Series([1, 2, 3, 4])
>>> s.mean()
2.5

can be illustrative enough, while loading a realistic series/dataframe from pandas.io.samples or constructing one, can just give overhead that is not really needed.

TomAugspurger · 2018-03-02T15:40:23Z

Agreed with that. Even things like `factorize(['c', 'c', 'a', 'c'])` can be easier to understand than on a "real world" example. It'll have to be case-by-case.

…

On Fri, Mar 2, 2018 at 9:30 AM, Joris Van den Bossche < ***@***.***> wrote: I was just wondering: do we want to use "real world" data in the docstring examples in all cases? In many cases it certainly makes sense (like eg pivot), but maybe not in all. Let's consider the Series.mean docstring (which currently has no examples). I think simply doing: >>> s = pd.Series([1, 2, 3, 4]) >>> s.mean() 2.5 can be illustrative enough, while loading a realistic series/dataframe from pandas.io.samples or constructing one, can just give overhead that is not really needed. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#19710 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIjzFBXhu-hH2CUA1BmjXdSQV3QQAks5taWWUgaJpZM4SGjPf> .

datapythonista · 2018-03-02T15:53:31Z

Yes, that makes sense. I still think it should be room for standardizing, but if that sounds good to you, I'll change the document to encourage people to use examples asap, and I will provide some examples to give an idea in different cases (with missing values, with dates,...).

So, probably makes more sense to add these [1, 2, 3, 4] as well as some suggestions for "real world" ones in the document, and forget about including any data in pandas?

jorisvandenbossche · 2018-03-05T15:20:00Z

Found the original issue that already reported this (or at leas the "realistic examples" part): #16709

Example given there was a groupby one:

In [13]: df2 = pd.DataFrame({'X' : ['B', 'B', 'A', 'A'], 'Y' : [1, 2, 3, 4]})

In [14]: df2.groupby(['X']).sum()
Out[14]: 
   Y
X   
A  7
B  3

might be less useful than this version:

In [13]: pets = pd.DataFrame({'animal' : ['dog', 'dog', 'cat', 'cat'], 'weight' : [10, 20, 8, 9]})

In [14]: pets.groupby(['weight']).mean()
Out[14]: 
   weight
animal   
dog  15
cat  8.5

jorisvandenbossche · 2018-03-06T16:56:26Z

Yes, that makes sense. I still think it should be room for standardizing, but if that sounds good to you, I'll change the document to encourage people to use examples asap, and I will provide some examples to give an idea in different cases (with missing values, with dates,...).

@datapythonista sorry for the late answer. I agree it would still be good to reuse some typical datasets for those cases where it adds value (more complex examples, eg like pivot).

So the above sounds good! That's probably more realistic than getting the examples in a pandas.io.samples

So, probably makes more sense to add these [1, 2, 3, 4] as well as some suggestions for "real world" ones in the document, and forget about including any data in pandas?

I think it is still interesting to have them included in pandas. But let's start with adding some to the guide? (so they can already be used like that in the examples, and discuss including them later)

colinmorris · 2018-04-10T18:47:43Z

As @jorisvandenbossche mentioned at the top, seaborn has a standard set of built-in example datasets that are used throughout the docs. I'd like to point to that as a success story that's worth emulating. Anecdotally, I've found it to be very effective for making examples of unfamiliar functions easier to grasp.

See, e.g. the examples section for the docs on factorplot. If instead they had used randomly generated values and labels like ['A', 'B', 'C'], ['foo', 'bar', 'baz'], those examples would be so much less useful.

kidrahahjo · 2021-01-27T17:28:04Z

This should be closed now!

jorisvandenbossche added Docs Needs Discussion Requires discussion from core team before further action labels Feb 15, 2018

jorisvandenbossche mentioned this issue Feb 15, 2018

DOC: Adding guide for the pandas documentation sprint #19704

Merged

datapythonista mentioned this issue Feb 28, 2018

Adding sample datasets to be used in the documentation #19933

Closed

4 tasks

jorisvandenbossche mentioned this issue Mar 5, 2018

Suggestion: documentation examples should use meaningful data where possible #16709

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: develop a set of standard example DataFrames for use in docstring examples #19710

DOC: develop a set of standard example DataFrames for use in docstring examples #19710

jorisvandenbossche commented Feb 15, 2018

TomAugspurger commented Feb 15, 2018

koaning commented Feb 20, 2018

TomAugspurger commented Feb 20, 2018 via email

datapythonista commented Feb 20, 2018

jorisvandenbossche commented Feb 20, 2018

dukebody commented Feb 24, 2018

jorisvandenbossche commented Mar 2, 2018

TomAugspurger commented Mar 2, 2018 via email

datapythonista commented Mar 2, 2018

jorisvandenbossche commented Mar 5, 2018 •

edited

Loading

jorisvandenbossche commented Mar 6, 2018

colinmorris commented Apr 10, 2018

kidrahahjo commented Jan 27, 2021

DOC: develop a set of standard example DataFrames for use in docstring examples #19710

DOC: develop a set of standard example DataFrames for use in docstring examples #19710

Comments

jorisvandenbossche commented Feb 15, 2018

TomAugspurger commented Feb 15, 2018

koaning commented Feb 20, 2018

TomAugspurger commented Feb 20, 2018 via email

datapythonista commented Feb 20, 2018

jorisvandenbossche commented Feb 20, 2018

dukebody commented Feb 24, 2018

jorisvandenbossche commented Mar 2, 2018

TomAugspurger commented Mar 2, 2018 via email

datapythonista commented Mar 2, 2018

jorisvandenbossche commented Mar 5, 2018 • edited Loading

jorisvandenbossche commented Mar 6, 2018

colinmorris commented Apr 10, 2018

kidrahahjo commented Jan 27, 2021

jorisvandenbossche commented Mar 5, 2018 •

edited

Loading