Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: develop a set of standard example DataFrames for use in docstring examples #19710

Open
jorisvandenbossche opened this issue Feb 15, 2018 · 13 comments
Labels
Docs Needs Discussion Requires discussion from core team before further action

Comments

@jorisvandenbossche
Copy link
Member

Related to #19704. I didn't find an existing open issue, only a discussion mentioning this in #16520 (@datapythonista it was actually you then! I didn't realize that :-))

I think it would be good to have a set of standard DataFrames that we reuse throughout our docs (to start with in the docstrings, but we could actually also use a standardized set for the user guide):

  • Some small, more "realistic" dataframes would make it is easier to reason about than dummy random data + adds familiarity when reading multiple docstrings
  • Makes it easier for contributors to add examples to the docstring as they don't have to invent their own data each time

I don't think there will be "one example dataframe to rule them all", but it would be nice to have a set of them that can cover most of the use cases.
So we can post some ideas here and discuss them, trying to get to a list.

Side question is whether we want to always define them with code in the docstring, or want to have some example data loading capabilities (eg like seaborn, it examples always start with a iris = sns.load_dataset("iris") or other dataset). It can also be a mixture of both of course.

@TomAugspurger
Copy link
Contributor

or want to have some example data loading capabilities

+1 to this.

@koaning
Copy link

koaning commented Feb 20, 2018

For the upcoming pandas docstring sprint this seems like a nice one to pick up. Any preference on datasets that should be added?

iris, mtcars, chickweight ... others?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 20, 2018 via email

@datapythonista
Copy link
Member

Personally, for the documentation samples I wouldn't have "real" datasets. We won't be showing more than 5 rows, so even something with 150 samples as Iris seems too much.

To me it would make sense to have something like pandas.io.samples.Countries or pandas.io.samples.Animals with a mix of types, for example:

  • Country name: object
  • Continent: category
  • Population: int
  • GPD: float

I think everybody will quickly understand data about countries, or animals, or things like that, and I'd avoid something more specialized. I'd hardcode the data into the samples.py, and simply have something like 20 rows.

For things like time series, another dataset (possibly with stock market data) will be needed. And probably one with "complex" multi-indices, or other kinds of data needed to illustrate some pandas functions.

In https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html the examples try to follow this approach. We can surely do better, and I think it's a good ideo to do something like from pandas.io.sample import Countries instead of defining the dataset each time. But that is what makes sense to me.

Does it make sense, or am I missing something for having larger datasets?

@jorisvandenbossche
Copy link
Member Author

My original question was mainly about the ability to load example dataframes, instead of always constructing them with code inside the docstring.
Whether it are then actual external datasets we include, or smaller 'made-up' ones, that doesn't really matter to me. But I agree some smaller ones for basic functionality can be enough (but still, if we make a small one on eg countries, I would still do it with 'real' data). And that for other cases we will need some more complex ones.

For really small example datasets, I still see some value in actually constructing them inside the docstring, just to make users familiar with the concept of "creating a small example dataset yourself to show functionality" (which is useful when they submit bug reports :), but I don't know how much effect this would have ..)

For countries, I use the following in my pandas-tutorial (I currrently also use the titanic dataset a lot for small illustrations):

In [15]: data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
    ...:         'population': [11.3, 64.3, 81.3, 16.9, 64.9],
    ...:         'area': [30510, 671308, 357050, 41526, 244820],
    ...:         'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
    ...: countries = pd.DataFrame(data)
    ...: countries
Out[15]: 
     area    capital         country  population
0   30510   Brussels         Belgium        11.3
1  671308      Paris          France        64.3
2  357050     Berlin         Germany        81.3
3   41526  Amsterdam     Netherlands        16.9
4  244820     London  United Kingdom        64.9

a variation on this would indeed be nice.

@dukebody
Copy link
Contributor

My two cents:

  • I agree on that having a set of standard datasets, or at least datasets with similar data, that we reuse across documentation can ease understanding. This is the case with a lot of ML/DS literature where authors reuse the same datasets (iris, mtcars, housing, etc.) to illustrate techniques and models.

  • I have no strong preference over importing vs. defining them directly on the docstring. However for small datasets I think we should always print them so the user understands the data that is being used, as shown in the last example from @jorisvandenbossche and in https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html.

  • I see an added benefit of defining the datasets in the docstring itself instead of importing them: people using older versions of pandas where these examples dataframes are not defined yet will be able to copy-paste the example. So for small datasets I'd favor defining them directly in docstrings, which is not a big effort anyhow and can be copy-pasted from other docs and adapted if necessary. In the case of big datasets I guess it can be OK to import them directly and print only their first lines.

@jorisvandenbossche
Copy link
Member Author

I was just wondering: do we want to use "real world" data in the docstring examples in all cases? In many cases it certainly makes sense (like eg pivot), but maybe not in all.

Let's consider the Series.mean docstring (which currently has no examples).
I think simply doing:

>>> s = pd.Series([1, 2, 3, 4])
>>> s.mean()
2.5

can be illustrative enough, while loading a realistic series/dataframe from pandas.io.samples or constructing one, can just give overhead that is not really needed.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 2, 2018 via email

@datapythonista
Copy link
Member

Yes, that makes sense. I still think it should be room for standardizing, but if that sounds good to you, I'll change the document to encourage people to use examples asap, and I will provide some examples to give an idea in different cases (with missing values, with dates,...).

So, probably makes more sense to add these [1, 2, 3, 4] as well as some suggestions for "real world" ones in the document, and forget about including any data in pandas?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 5, 2018

Found the original issue that already reported this (or at leas the "realistic examples" part): #16709

Example given there was a groupby one:

In [13]: df2 = pd.DataFrame({'X' : ['B', 'B', 'A', 'A'], 'Y' : [1, 2, 3, 4]})

In [14]: df2.groupby(['X']).sum()
Out[14]: 
   Y
X   
A  7
B  3

might be less useful than this version:

In [13]: pets = pd.DataFrame({'animal' : ['dog', 'dog', 'cat', 'cat'], 'weight' : [10, 20, 8, 9]})

In [14]: pets.groupby(['weight']).mean()
Out[14]: 
   weight
animal   
dog  15
cat  8.5

@jorisvandenbossche
Copy link
Member Author

Yes, that makes sense. I still think it should be room for standardizing, but if that sounds good to you, I'll change the document to encourage people to use examples asap, and I will provide some examples to give an idea in different cases (with missing values, with dates,...).

@datapythonista sorry for the late answer. I agree it would still be good to reuse some typical datasets for those cases where it adds value (more complex examples, eg like pivot).

So the above sounds good! That's probably more realistic than getting the examples in a pandas.io.samples

So, probably makes more sense to add these [1, 2, 3, 4] as well as some suggestions for "real world" ones in the document, and forget about including any data in pandas?

I think it is still interesting to have them included in pandas. But let's start with adding some to the guide? (so they can already be used like that in the examples, and discuss including them later)

@colinmorris
Copy link

As @jorisvandenbossche mentioned at the top, seaborn has a standard set of built-in example datasets that are used throughout the docs. I'd like to point to that as a success story that's worth emulating. Anecdotally, I've found it to be very effective for making examples of unfamiliar functions easier to grasp.

See, e.g. the examples section for the docs on factorplot. If instead they had used randomly generated values and labels like ['A', 'B', 'C'], ['foo', 'bar', 'baz'], those examples would be so much less useful.

@kidrahahjo
Copy link

This should be closed now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants