Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Updated aggregate docstring #35042

Merged
merged 14 commits into from
Aug 24, 2020
3 changes: 3 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -5076,6 +5076,9 @@ def pipe(self, func, *args, **kwargs):
-----
`agg` is an alias for `aggregate`. Use the alias.

Pandas operations generally exclude NaNs. For example, ``agg(np.nanmedian)``,
``agg(np.median)``, and ``agg('median')`` will return the same result.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gurukiran07, but I don't find this comment very clear. I don't think we want to talk about what pandas operations generally do in this docstring. I guess you mean that aggregate ignores NaN values, even if the function argument does not. Just say that, with a code example if you want to be clearer. But I don't expect readers of this docstring to know about numpy.nanmedian and numpy.median to use them as an example, after a vague sentence about pandas operations in general.

Can you rephrase please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datapythonista what they want to say, I think, is that np.nanmedian and np.median will map to the same internal pandas operation, and, as it says in 10 minutes to pandas:

Operations in general exclude missing data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the comment in the PR is very clear and useful to users. We can surely provide information on what is being used in the notes section, but after reading the proposed comments I feel more confused than before reading it. So, it would be great to rephrase it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was in a previous commit

        - Some NumPy functions resolve to their corresponding internal Cython function.
          As pandas operations generally exclude NaNs, for example `.agg(np.nanmedian)`,
          `.agg(np.median)`, and `.agg('median') will return the same result.

do you think it's clearer / more useful?

@gurukiran07 is this still active? Do you want to try rephrasing?

Copy link
Contributor Author

@gurukiran07 gurukiran07 Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gurukiran07 is this still active? Do you want to try rephrasing?

@MarcoGorelli Yes, I'm active. Sorry for my inactiveness. I can try rephrasing it but if you have something in mind, please free to take over.

Can we put it like this @datapythonista @MarcoGorelli

Pandas operations, in general, exclude missing NaNs.
For example, mean of Series [1, 2, np.nan] would be 1.5

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the average pandas user knowns anything about Python, so I would exclude that part.

Also, as I said earlier, in the context of this docstring I don't think it's relevant what pandas operations generally do.

I think the point here is to say that THIS pandas operation (agg) will exclude NaN before computing the reduction. So, all the examples listed are equivalent, since the reduction won't be applied to the missing values, and the user doesn't need to bother about them. I think both paragraphs are phrased in a way that it becomes confusing to understand the point in the context of this docstrings. pandas operations generally exclude NaNs is not even saying that this operation is excluding NaNs. And I wouldn't expect users to know the difference between numpy.nanmedian and numpy.median, as I said before, so for the example to be useful that should be explained. Like in numpy, there are two operations to control the impact of NaNs in the result, nanmedia meaning that [...]. In pandas, agg, as most operations just ignore the missing values, and return the operation only considering the values that are present.

IMHO, something like this will let a user really understand what's the point being made here. While to current comment is even difficult to understand for an experienced pandas user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datapythonista

In pandas, agg, as most operations just ignore the missing values, and return the operation only considering the values that are present.

IMO, this explains the point very well to both new and experienced users while conveying that agg excludes missing data. I feel mentioning something about NumPy operations might confuse new users.

If we agree, I can commit this line:
In pandas, agg, as most operations just ignore the missing values, and return the operation only considering the values that are present.

Since I did not come up with this originally. Please feel free to take over and commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK yeah, that's clearer, thanks for your input

@gurukiran07 go ahead, I'd just make sure plurality matches, so

In pandas, agg, as most operations just ignores the missing values, and returns the operation only considering the values that are present.

Copy link
Member

@MarcoGorelli MarcoGorelli Aug 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to drag this out but re-reading this the structure of sentence seems strange.

In pandas, agg, as most operations just ignore the missing values, returns the operation only considering the values that are present

@datapythonista does this reflect what you wanted to say?


A passed user-defined-function will be passed a Series for evaluation.
{examples}"""
)
Expand Down