Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Updated aggregate docstring #35042

Merged
merged 14 commits into from
Aug 24, 2020
Merged

DOC: Updated aggregate docstring #35042

merged 14 commits into from
Aug 24, 2020

Conversation

gurukiran07
Copy link
Contributor

@gurukiran07 gurukiran07 commented Jun 28, 2020

Link to current Series.agg: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.agg.html

Original question asked in gitter:
Does pd.Series.agg with func parameter set to 'median' uses np.nanmedian(Not only median including mean, mode)?

s= pd.Series([np.nan, np.nan,1,1,1])
s.agg('median') 
# 1
s.agg(np.median)
# 1
np.median(s.to_numpy())
# nan
np.nanmedian(s.to_numpy())
# 1

Whenever NaN is present does it fallback to using np.nanmedian?


Reply from @MarcoGorelli
if you look at pandas/core/base.py you'll see

    np.median: "median",
    np.nanmedian: "median",

in _cython_table. So, both resolve to the same internal cython function.


IMO it's better to mention this in the docs under Note: section.

Under note section saying:
Some NumPy functions such as np.mean, np.nanmean, np.median etc. resolve to their corresponding internal cython function.

Output of python validate_docstrings.py pandas.Series.agg

################################################################################
######################## Docstring (pandas.Series.agg)  ########################
################################################################################

Aggregate using one or more operations over the specified axis.

.. versionadded:: 0.20.0

Parameters
----------
func : function, str, list or dict
    Function to use for aggregating the data. If a function, must either
    work when passed a Series or when passed to Series.apply.

    Accepted combinations are:

    - function
    - string function name
    - list of functions and/or function names, e.g. ``[np.sum, 'mean']``
    - dict of axis labels -> functions, function names or list of such.
axis : {0 or 'index'}
        Parameter needed for compatibility with DataFrame.
*args
    Positional arguments to pass to `func`.
**kwargs
    Keyword arguments to pass to `func`.

Returns
-------
scalar, Series or DataFrame

    The return can be:

    * scalar : when Series.agg is called with single function
    * Series : when DataFrame.agg is called with a single function
    * DataFrame : when DataFrame.agg is called with several functions

    Return scalar, Series or DataFrame.

See Also
--------
Series.apply : Invoke function on a Series.
Series.transform : Transform function producing a Series with like indexes.

Notes
-----
`agg` is an alias for `aggregate`. Use the alias.
Some NumPy functions such as `np.mean`, `np.nanmean`, `np.median` etc.
resolve to their corresponding internal cython function.

A passed user-defined-function will be passed a Series for evaluation.

Examples
--------
>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64

>>> s.agg('min')
1

>>> s.agg(['min', 'max'])
min   1
max   4
dtype: int64

################################################################################
################################## Validation ##################################
################################################################################

Docstring for "pandas.Series.agg" correct. :)

@pep8speaks
Copy link

pep8speaks commented Jun 28, 2020

Hello @gurukiran07! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-24 13:26:17 UTC

gurukiran07 and others added 2 commits June 29, 2020 13:48
Co-authored-by: Marco Gorelli <33491632+MarcoGorelli@users.noreply.github.com>
@MarcoGorelli
Copy link
Member

@gurukiran07 why did you close this, and why do you say "I guess this is fixed now"?

@gurukiran07
Copy link
Contributor Author

@MarcoGorelli

The aggregation operations are always performed over an axis, either the
index (default) or the column axis. This behavior is different from
numpy aggregation functions (mean, median, prod, sum, std,
var), where the default is to compute the aggregation of the flattened
array, e.g., numpy.mean(arr_2d) as opposed to
numpy.mean(arr_2d, axis=0).

This was mentioned in the docs. I thought this would cover NaNs part, but this doesn't look like it. Reopening the PR.

@gurukiran07 gurukiran07 reopened this Jul 21, 2020
@gurukiran07
Copy link
Contributor Author

@datapythonista @MarcoGorelli It's green now, all tests passed. If there are any changes to be made let me know.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, other than that looks good to me

pandas/core/generic.py Outdated Show resolved Hide resolved
@gurukiran07 gurukiran07 reopened this Jul 22, 2020
@simonjayhawkins simonjayhawkins added this to the 1.2 milestone Jul 23, 2020
@@ -5076,6 +5076,9 @@ def pipe(self, func, *args, **kwargs):
-----
`agg` is an alias for `aggregate`. Use the alias.

Pandas operations generally exclude NaNs. For example, ``agg(np.nanmedian)``,
``agg(np.median)``, and ``agg('median')`` will return the same result.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gurukiran07, but I don't find this comment very clear. I don't think we want to talk about what pandas operations generally do in this docstring. I guess you mean that aggregate ignores NaN values, even if the function argument does not. Just say that, with a code example if you want to be clearer. But I don't expect readers of this docstring to know about numpy.nanmedian and numpy.median to use them as an example, after a vague sentence about pandas operations in general.

Can you rephrase please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datapythonista what they want to say, I think, is that np.nanmedian and np.median will map to the same internal pandas operation, and, as it says in 10 minutes to pandas:

Operations in general exclude missing data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the comment in the PR is very clear and useful to users. We can surely provide information on what is being used in the notes section, but after reading the proposed comments I feel more confused than before reading it. So, it would be great to rephrase it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was in a previous commit

        - Some NumPy functions resolve to their corresponding internal Cython function.
          As pandas operations generally exclude NaNs, for example `.agg(np.nanmedian)`,
          `.agg(np.median)`, and `.agg('median') will return the same result.

do you think it's clearer / more useful?

@gurukiran07 is this still active? Do you want to try rephrasing?

Copy link
Contributor Author

@gurukiran07 gurukiran07 Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gurukiran07 is this still active? Do you want to try rephrasing?

@MarcoGorelli Yes, I'm active. Sorry for my inactiveness. I can try rephrasing it but if you have something in mind, please free to take over.

Can we put it like this @datapythonista @MarcoGorelli

Pandas operations, in general, exclude missing NaNs.
For example, mean of Series [1, 2, np.nan] would be 1.5

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the average pandas user knowns anything about Python, so I would exclude that part.

Also, as I said earlier, in the context of this docstring I don't think it's relevant what pandas operations generally do.

I think the point here is to say that THIS pandas operation (agg) will exclude NaN before computing the reduction. So, all the examples listed are equivalent, since the reduction won't be applied to the missing values, and the user doesn't need to bother about them. I think both paragraphs are phrased in a way that it becomes confusing to understand the point in the context of this docstrings. pandas operations generally exclude NaNs is not even saying that this operation is excluding NaNs. And I wouldn't expect users to know the difference between numpy.nanmedian and numpy.median, as I said before, so for the example to be useful that should be explained. Like in numpy, there are two operations to control the impact of NaNs in the result, nanmedia meaning that [...]. In pandas, agg, as most operations just ignore the missing values, and return the operation only considering the values that are present.

IMHO, something like this will let a user really understand what's the point being made here. While to current comment is even difficult to understand for an experienced pandas user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datapythonista

In pandas, agg, as most operations just ignore the missing values, and return the operation only considering the values that are present.

IMO, this explains the point very well to both new and experienced users while conveying that agg excludes missing data. I feel mentioning something about NumPy operations might confuse new users.

If we agree, I can commit this line:
In pandas, agg, as most operations just ignore the missing values, and return the operation only considering the values that are present.

Since I did not come up with this originally. Please feel free to take over and commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK yeah, that's clearer, thanks for your input

@gurukiran07 go ahead, I'd just make sure plurality matches, so

In pandas, agg, as most operations just ignores the missing values, and returns the operation only considering the values that are present.

Copy link
Member

@MarcoGorelli MarcoGorelli Aug 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to drag this out but re-reading this the structure of sentence seems strange.

In pandas, agg, as most operations just ignore the missing values, returns the operation only considering the values that are present

@datapythonista does this reflect what you wanted to say?

@simonjayhawkins simonjayhawkins removed this from the 1.2 milestone Jul 29, 2020
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gurukiran07, looks good to me.

@gurukiran07
Copy link
Contributor Author

Thanks @gurukiran07, looks good to me.

@datapythonista Guess, I need to retrigger the checks. Will ping once green.

@gurukiran07 gurukiran07 reopened this Aug 24, 2020
@gurukiran07
Copy link
Contributor Author

@datapythonista All checks passed, it's green now.

@simonjayhawkins simonjayhawkins added this to the 1.2 milestone Aug 24, 2020
@MarcoGorelli
Copy link
Member

Cool, merging then following Marc's approval, thanks @gurukiran07

@MarcoGorelli MarcoGorelli merged commit 0798380 into pandas-dev:master Aug 24, 2020
jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Aug 25, 2020
* DOC: Updated aggregate docstring

* Doc: updated aggregate docstring

* Update pandas/core/generic.py

Co-authored-by: Marco Gorelli <33491632+MarcoGorelli@users.noreply.github.com>

* Update generic.py

* Update generic.py

* Revert "Update generic.py"

This reverts commit 15ecaf7.

* Revert "Revert "Update generic.py""

This reverts commit cc231c8.

* Updated docstring of agg

* Trailing whitespace removed

* DOC: Updated docstring of agg

* Update generic.py

* Updated Docstring

Co-authored-by: Marco Gorelli <33491632+MarcoGorelli@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants