DOC: Updated aggregate docstring #35042

gurukiran07 · 2020-06-28T14:44:20Z

Link to current Series.agg: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.agg.html

Original question asked in gitter:
Does pd.Series.agg with func parameter set to 'median' uses np.nanmedian(Not only median including mean, mode)?

s= pd.Series([np.nan, np.nan,1,1,1])
s.agg('median') 
# 1
s.agg(np.median)
# 1
np.median(s.to_numpy())
# nan
np.nanmedian(s.to_numpy())
# 1

Whenever NaN is present does it fallback to using np.nanmedian?

Reply from @MarcoGorelli
if you look at pandas/core/base.py you'll see

    np.median: "median",
    np.nanmedian: "median",

in _cython_table. So, both resolve to the same internal cython function.

IMO it's better to mention this in the docs under Note: section.

Under note section saying:
Some NumPy functions such as np.mean, np.nanmean, np.median etc. resolve to their corresponding internal cython function.

Output of `python validate_docstrings.py pandas.Series.agg`

################################################################################
######################## Docstring (pandas.Series.agg)  ########################
################################################################################

Aggregate using one or more operations over the specified axis.

.. versionadded:: 0.20.0

Parameters
----------
func : function, str, list or dict
    Function to use for aggregating the data. If a function, must either
    work when passed a Series or when passed to Series.apply.

    Accepted combinations are:

    - function
    - string function name
    - list of functions and/or function names, e.g. ``[np.sum, 'mean']``
    - dict of axis labels -> functions, function names or list of such.
axis : {0 or 'index'}
        Parameter needed for compatibility with DataFrame.
*args
    Positional arguments to pass to `func`.
**kwargs
    Keyword arguments to pass to `func`.

Returns
-------
scalar, Series or DataFrame

    The return can be:

    * scalar : when Series.agg is called with single function
    * Series : when DataFrame.agg is called with a single function
    * DataFrame : when DataFrame.agg is called with several functions

    Return scalar, Series or DataFrame.

See Also
--------
Series.apply : Invoke function on a Series.
Series.transform : Transform function producing a Series with like indexes.

Notes
-----
`agg` is an alias for `aggregate`. Use the alias.
Some NumPy functions such as `np.mean`, `np.nanmean`, `np.median` etc.
resolve to their corresponding internal cython function.

A passed user-defined-function will be passed a Series for evaluation.

Examples
--------
>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64

>>> s.agg('min')
1

>>> s.agg(['min', 'max'])
min   1
max   4
dtype: int64

################################################################################
################################## Validation ##################################
################################################################################

Docstring for "pandas.Series.agg" correct. :)

pep8speaks · 2020-06-28T14:44:24Z

Hello @gurukiran07! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-24 13:26:17 UTC

pandas/core/generic.py

Co-authored-by: Marco Gorelli <33491632+MarcoGorelli@users.noreply.github.com>

pandas/core/generic.py

MarcoGorelli · 2020-07-21T10:22:28Z

@gurukiran07 why did you close this, and why do you say "I guess this is fixed now"?

This reverts commit 15ecaf7.

This reverts commit cc231c8.

gurukiran07 · 2020-07-21T10:57:25Z

@MarcoGorelli

The aggregation operations are always performed over an axis, either the
index (default) or the column axis. This behavior is different from
numpy aggregation functions (mean, median, prod, sum, std,
var), where the default is to compute the aggregation of the flattened
array, e.g., numpy.mean(arr_2d) as opposed to
numpy.mean(arr_2d, axis=0).

This was mentioned in the docs. I thought this would cover NaNs part, but this doesn't look like it. Reopening the PR.

gurukiran07 · 2020-07-21T14:23:08Z

@datapythonista @MarcoGorelli It's green now, all tests passed. If there are any changes to be made let me know.

MarcoGorelli

Minor comment, other than that looks good to me

pandas/core/generic.py

datapythonista · 2020-07-24T00:48:05Z

pandas/core/generic.py

@@ -5076,6 +5076,9 @@ def pipe(self, func, *args, **kwargs):
        -----
        `agg` is an alias for `aggregate`. Use the alias.

+        Pandas operations generally exclude NaNs. For example, ``agg(np.nanmedian)``,
+        ``agg(np.median)``, and ``agg('median')`` will return the same result.


Thanks @gurukiran07, but I don't find this comment very clear. I don't think we want to talk about what pandas operations generally do in this docstring. I guess you mean that aggregate ignores NaN values, even if the function argument does not. Just say that, with a code example if you want to be clearer. But I don't expect readers of this docstring to know about numpy.nanmedian and numpy.median to use them as an example, after a vague sentence about pandas operations in general.

Can you rephrase please?

@datapythonista what they want to say, I think, is that np.nanmedian and np.median will map to the same internal pandas operation, and, as it says in 10 minutes to pandas:

Operations in general exclude missing data

I don't think the comment in the PR is very clear and useful to users. We can surely provide information on what is being used in the notes section, but after reading the proposed comments I feel more confused than before reading it. So, it would be great to rephrase it.

This was in a previous commit

- Some NumPy functions resolve to their corresponding internal Cython function. As pandas operations generally exclude NaNs, for example `.agg(np.nanmedian)`, `.agg(np.median)`, and `.agg('median') will return the same result.

do you think it's clearer / more useful?

@gurukiran07 is this still active? Do you want to try rephrasing?

@gurukiran07 is this still active? Do you want to try rephrasing?

@MarcoGorelli Yes, I'm active. Sorry for my inactiveness. I can try rephrasing it but if you have something in mind, please free to take over.

Can we put it like this @datapythonista @MarcoGorelli

Pandas operations, in general, exclude missing NaNs. For example, mean of Series [1, 2, np.nan] would be 1.5

I don't think the average pandas user knowns anything about Python, so I would exclude that part.

Also, as I said earlier, in the context of this docstring I don't think it's relevant what pandas operations generally do.

I think the point here is to say that THIS pandas operation (agg) will exclude NaN before computing the reduction. So, all the examples listed are equivalent, since the reduction won't be applied to the missing values, and the user doesn't need to bother about them. I think both paragraphs are phrased in a way that it becomes confusing to understand the point in the context of this docstrings. pandas operations generally exclude NaNs is not even saying that this operation is excluding NaNs. And I wouldn't expect users to know the difference between numpy.nanmedian and numpy.median, as I said before, so for the example to be useful that should be explained. Like in numpy, there are two operations to control the impact of NaNs in the result, nanmedia meaning that [...]. In pandas, agg, as most operations just ignore the missing values, and return the operation only considering the values that are present.

IMHO, something like this will let a user really understand what's the point being made here. While to current comment is even difficult to understand for an experienced pandas user.

@datapythonista

In pandas, agg, as most operations just ignore the missing values, and return the operation only considering the values that are present.

IMO, this explains the point very well to both new and experienced users while conveying that agg excludes missing data. I feel mentioning something about NumPy operations might confuse new users.

If we agree, I can commit this line:
In pandas, agg, as most operations just ignore the missing values, and return the operation only considering the values that are present.

Since I did not come up with this originally. Please feel free to take over and commit.

OK yeah, that's clearer, thanks for your input

@gurukiran07 go ahead, I'd just make sure plurality matches, so

In pandas, agg, as most operations just ignores the missing values, and returns the operation only considering the values that are present.

Sorry to drag this out but re-reading this the structure of sentence seems strange.

In pandas, agg, as most operations just ignore the missing values, returns the operation only considering the values that are present

@datapythonista does this reflect what you wanted to say?

datapythonista

Thanks @gurukiran07, looks good to me.

gurukiran07 · 2020-08-24T13:26:07Z

Thanks @gurukiran07, looks good to me.

@datapythonista Guess, I need to retrigger the checks. Will ping once green.

gurukiran07 · 2020-08-24T14:07:30Z

@datapythonista All checks passed, it's green now.

MarcoGorelli · 2020-08-24T15:34:48Z

Cool, merging then following Marc's approval, thanks @gurukiran07

* DOC: Updated aggregate docstring * Doc: updated aggregate docstring * Update pandas/core/generic.py Co-authored-by: Marco Gorelli <33491632+MarcoGorelli@users.noreply.github.com> * Update generic.py * Update generic.py * Revert "Update generic.py" This reverts commit 15ecaf7. * Revert "Revert "Update generic.py"" This reverts commit cc231c8. * Updated docstring of agg * Trailing whitespace removed * DOC: Updated docstring of agg * Update generic.py * Updated Docstring Co-authored-by: Marco Gorelli <33491632+MarcoGorelli@users.noreply.github.com>

DOC: Updated aggregate docstring

b7ee7a6

simonjayhawkins added the Docs label Jun 28, 2020

Doc: updated aggregate docstring

beee950

MarcoGorelli reviewed Jun 28, 2020

View reviewed changes

pandas/core/generic.py Outdated Show resolved Hide resolved

gurukiran07 and others added 2 commits June 29, 2020 13:48

Update pandas/core/generic.py

fff4b7e

Co-authored-by: Marco Gorelli <33491632+MarcoGorelli@users.noreply.github.com>

Update generic.py

26add4a

MarcoGorelli reviewed Jun 29, 2020

View reviewed changes

pandas/core/generic.py Outdated Show resolved Hide resolved

gurukiran07 and others added 2 commits July 21, 2020 15:07

Merge remote-tracking branch 'upstream/master'

d50bbc9

Update generic.py

15ecaf7

gurukiran07 closed this Jul 21, 2020

gurukiran07 added 2 commits July 21, 2020 16:03

Revert "Update generic.py"

cc231c8

This reverts commit 15ecaf7.

Revert "Revert "Update generic.py""

3a52d99

This reverts commit cc231c8.

gurukiran07 reopened this Jul 21, 2020

gurukiran07 added 3 commits July 21, 2020 16:40

Updated docstring of agg

f769179

Trailing whitespace removed

8813780

DOC: Updated docstring of agg

1fc0949

gurukiran07 requested a review from MarcoGorelli July 21, 2020 12:49

Merge remote-tracking branch 'upstream/master'

f4dc10d

MarcoGorelli requested changes Jul 22, 2020

View reviewed changes

pandas/core/generic.py Outdated Show resolved Hide resolved

Update generic.py

0927b61

gurukiran07 closed this Jul 22, 2020

gurukiran07 reopened this Jul 22, 2020

simonjayhawkins added this to the 1.2 milestone Jul 23, 2020

datapythonista reviewed Jul 24, 2020

View reviewed changes

simonjayhawkins removed this from the 1.2 milestone Jul 29, 2020

Updated Docstring

fcb5496

gurukiran07 requested a review from MarcoGorelli August 22, 2020 06:39

datapythonista approved these changes Aug 24, 2020

View reviewed changes

gurukiran07 closed this Aug 24, 2020

gurukiran07 reopened this Aug 24, 2020

simonjayhawkins added this to the 1.2 milestone Aug 24, 2020

MarcoGorelli merged commit 0798380 into pandas-dev:master Aug 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Updated aggregate docstring #35042

DOC: Updated aggregate docstring #35042

gurukiran07 commented Jun 28, 2020 •

edited

Loading

pep8speaks commented Jun 28, 2020 •

edited

Loading

MarcoGorelli commented Jul 21, 2020

gurukiran07 commented Jul 21, 2020

gurukiran07 commented Jul 21, 2020

MarcoGorelli left a comment

datapythonista Jul 24, 2020

MarcoGorelli Aug 21, 2020

datapythonista Aug 21, 2020

MarcoGorelli Aug 21, 2020

gurukiran07 Aug 21, 2020 •

edited

Loading

datapythonista Aug 21, 2020

gurukiran07 Aug 21, 2020

MarcoGorelli Aug 21, 2020

MarcoGorelli Aug 22, 2020 •

edited

Loading

datapythonista left a comment

gurukiran07 commented Aug 24, 2020

gurukiran07 commented Aug 24, 2020

MarcoGorelli commented Aug 24, 2020

DOC: Updated aggregate docstring #35042

DOC: Updated aggregate docstring #35042

Conversation

gurukiran07 commented Jun 28, 2020 • edited Loading

Output of python validate_docstrings.py pandas.Series.agg

pep8speaks commented Jun 28, 2020 • edited Loading

Comment last updated at 2020-08-24 13:26:17 UTC

MarcoGorelli commented Jul 21, 2020

gurukiran07 commented Jul 21, 2020

gurukiran07 commented Jul 21, 2020

MarcoGorelli left a comment

Choose a reason for hiding this comment

datapythonista Jul 24, 2020

Choose a reason for hiding this comment

MarcoGorelli Aug 21, 2020

Choose a reason for hiding this comment

datapythonista Aug 21, 2020

Choose a reason for hiding this comment

MarcoGorelli Aug 21, 2020

Choose a reason for hiding this comment

gurukiran07 Aug 21, 2020 • edited Loading

Choose a reason for hiding this comment

datapythonista Aug 21, 2020

Choose a reason for hiding this comment

gurukiran07 Aug 21, 2020

Choose a reason for hiding this comment

MarcoGorelli Aug 21, 2020

Choose a reason for hiding this comment

MarcoGorelli Aug 22, 2020 • edited Loading

Choose a reason for hiding this comment

datapythonista left a comment

Choose a reason for hiding this comment

gurukiran07 commented Aug 24, 2020

gurukiran07 commented Aug 24, 2020

MarcoGorelli commented Aug 24, 2020

gurukiran07 commented Jun 28, 2020 •

edited

Loading

Output of `python validate_docstrings.py pandas.Series.agg`

pep8speaks commented Jun 28, 2020 •

edited

Loading

gurukiran07 Aug 21, 2020 •

edited

Loading

MarcoGorelli Aug 22, 2020 •

edited

Loading