-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: develop a set of standard example DataFrames for use in docstring examples #19710
Comments
+1 to this. |
For the upcoming pandas docstring sprint this seems like a nice one to pick up. Any preference on datasets that should be added? iris, mtcars, chickweight ... others? |
Whatever datasets we add, we'll need to check to ensure that they have a
license that allows us to redistribute them.
And preferable they'd be small or we can download and cache them as needed,
if we don't want to included them in the distribution.
…On Tue, Feb 20, 2018 at 7:27 AM, vincent d warmerdam < ***@***.***> wrote:
For the upcoming pandas docstring sprint this seems like a nice one to
pick up. Any preference on datasets that should be added?
iris, mtcars, chickweight ... others?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#19710 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIhUfoKNFyEN7DhpcB5S6xLQRQ8_Qks5tWshegaJpZM4SGjPf>
.
|
Personally, for the documentation samples I wouldn't have "real" datasets. We won't be showing more than 5 rows, so even something with 150 samples as Iris seems too much. To me it would make sense to have something like
I think everybody will quickly understand data about countries, or animals, or things like that, and I'd avoid something more specialized. I'd hardcode the data into the For things like time series, another dataset (possibly with stock market data) will be needed. And probably one with "complex" multi-indices, or other kinds of data needed to illustrate some pandas functions. In https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html the examples try to follow this approach. We can surely do better, and I think it's a good ideo to do something like Does it make sense, or am I missing something for having larger datasets? |
My original question was mainly about the ability to load example dataframes, instead of always constructing them with code inside the docstring. For really small example datasets, I still see some value in actually constructing them inside the docstring, just to make users familiar with the concept of "creating a small example dataset yourself to show functionality" (which is useful when they submit bug reports :), but I don't know how much effect this would have ..) For countries, I use the following in my pandas-tutorial (I currrently also use the titanic dataset a lot for small illustrations):
a variation on this would indeed be nice. |
My two cents:
|
I was just wondering: do we want to use "real world" data in the docstring examples in all cases? In many cases it certainly makes sense (like eg pivot), but maybe not in all. Let's consider the
can be illustrative enough, while loading a realistic series/dataframe from |
Agreed with that. Even things like `factorize(['c', 'c', 'a', 'c'])` can be
easier to understand than on a "real world" example. It'll have to be
case-by-case.
…On Fri, Mar 2, 2018 at 9:30 AM, Joris Van den Bossche < ***@***.***> wrote:
I was just wondering: do we want to use "real world" data in the docstring
examples in all cases? In many cases it certainly makes sense (like eg
pivot), but maybe not in all.
Let's consider the Series.mean docstring (which currently has no
examples).
I think simply doing:
>>> s = pd.Series([1, 2, 3, 4])
>>> s.mean()
2.5
can be illustrative enough, while loading a realistic series/dataframe
from pandas.io.samples or constructing one, can just give overhead that
is not really needed.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#19710 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIjzFBXhu-hH2CUA1BmjXdSQV3QQAks5taWWUgaJpZM4SGjPf>
.
|
Yes, that makes sense. I still think it should be room for standardizing, but if that sounds good to you, I'll change the document to encourage people to use examples asap, and I will provide some examples to give an idea in different cases (with missing values, with dates,...). So, probably makes more sense to add these |
Found the original issue that already reported this (or at leas the "realistic examples" part): #16709 Example given there was a groupby one:
might be less useful than this version:
|
@datapythonista sorry for the late answer. I agree it would still be good to reuse some typical datasets for those cases where it adds value (more complex examples, eg like pivot). So the above sounds good! That's probably more realistic than getting the examples in a
I think it is still interesting to have them included in pandas. But let's start with adding some to the guide? (so they can already be used like that in the examples, and discuss including them later) |
As @jorisvandenbossche mentioned at the top, seaborn has a standard set of built-in example datasets that are used throughout the docs. I'd like to point to that as a success story that's worth emulating. Anecdotally, I've found it to be very effective for making examples of unfamiliar functions easier to grasp. See, e.g. the examples section for the docs on factorplot. If instead they had used randomly generated values and labels like |
This should be closed now! |
Related to #19704. I didn't find an existing open issue, only a discussion mentioning this in #16520 (@datapythonista it was actually you then! I didn't realize that :-))
I think it would be good to have a set of standard DataFrames that we reuse throughout our docs (to start with in the docstrings, but we could actually also use a standardized set for the user guide):
I don't think there will be "one example dataframe to rule them all", but it would be nice to have a set of them that can cover most of the use cases.
So we can post some ideas here and discuss them, trying to get to a list.
Side question is whether we want to always define them with code in the docstring, or want to have some example data loading capabilities (eg like seaborn, it examples always start with a
iris = sns.load_dataset("iris")
or other dataset). It can also be a mixture of both of course.The text was updated successfully, but these errors were encountered: