-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Fix pandas.Series.resample docstring #23197
DOC: Fix pandas.Series.resample docstring #23197
Conversation
* First iteration over some validation errors on pandas.Series.resample. * Fix multi-line examples issue.
Not too sure they were deprecated from v0.18.0. Need to point this out in the PR.
Hello @jmrr! Thanks for submitting the PR.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments.
pandas/core/generic.py
Outdated
to the on or level keyword. | ||
series. Object must have a datetime-like index (``DatetimeIndex``, | ||
``PeriodIndex``, or ``TimedeltaIndex``), or pass datetime-like values | ||
to the ``on`` or ``level`` keyword. | ||
|
||
Parameters | ||
---------- | ||
rule : string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the Python types (str
instead of string
...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done. The module is full of string
by the way. Should I create another issue for that? Do these values str
, int
get picked up by Sphinx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to fix one whole docstring at a time, than fixing a problem in all docstrings.
Sphinx doesn't care about the types, but we try to be consistent, and will start validating the types soon (not sure if validate_docstrings.py
already conmplains about these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, atomic improvements around each docstring make total sense. Pushing the amendments now.
pandas/core/generic.py
Outdated
|
||
.. deprecated:: 0.18.0 | ||
The new syntax is ``.resample(...).mean()``, or | ||
``.resample(...).apply(<func>)`` | ||
axis : int, optional, default 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check other docstrings to see how the axis
type is usually documented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another example is:
apply to each column (``axis=0`` or ``'index'``)
or to each row (``axis=1`` or ``'columns'``) or
The problem is that for Series
the axis is that along which the operation is computed, which in this case it's always along the rows (aka columnwise).
Maybe a more verbose but informative doc would be
Which axis to use for up- or down-sampling. For ``Series`` this
will default to 0, i.e. `along the rows`. Must be
``DatetimeIndex``, ``TimedeltaIndex`` or ``PeriodIndex``.
I have added this tentatively to the PR but let me know if something less verbose would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I wasn't clear. There is not much standardization of the descriptions, use whatever makes more sense to you. But the used type is somehow standard, that's what I think we should copy from another docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmrr can you make this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @datapythonista not your fault at all. I somewhat skipped the word type in your request and I thought you meant how axis
is usually documented. Regarding the type and values there's no much standardization either. From set_axis()
in pandas/core/generic.py
I like the following:
{0 or 'index', 1 or 'columns'}, default 0
If we extrapolate this to Series
then that line should be:
axis : {0 or 'index'}, default 0
I will make this change now, hope it sounds good.
pandas/core/generic.py
Outdated
@@ -7385,18 +7401,22 @@ def resample(self, rule, how=None, axis=0, fill_method=None, closed=None, | |||
for all frequency offsets except for 'M', 'A', 'Q', 'BM', | |||
'BA', 'BQ', and 'W' which all have a default of 'right'. | |||
convention : {'start', 'end', 's', 'e'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some default values are missing, like in this case.
pandas/core/generic.py
Outdated
kind: {'timestamp', 'period'}, optional | ||
For ``PeriodIndex`` only, controls whether to use the start or | ||
end of ``rule``. | ||
kind : {'timestamp', 'period'} optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing comma before optional
pandas/core/generic.py
Outdated
freq='A', | ||
periods=2)) | ||
>>> s = pd.Series([1, 2], index=pd.period_range( | ||
... '2012-01-01', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd keep the date in the previous line
pandas/core/generic.py
Outdated
@@ -7577,9 +7598,9 @@ def resample(self, rule, how=None, axis=0, fill_method=None, closed=None, | |||
|
|||
>>> time = pd.date_range('1/1/2000', periods=5, freq='T') | |||
>>> df2 = pd.DataFrame(data=10*[range(4)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8 issue, missing spaces around operator.
Codecov Report
@@ Coverage Diff @@
## master #23197 +/- ##
==========================================
- Coverage 92.25% 92.25% -0.01%
==========================================
Files 161 161
Lines 51262 51277 +15
==========================================
+ Hits 47292 47305 +13
- Misses 3970 3972 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work, added some comments, mainly to the original code, not your changes.
pandas/core/generic.py
Outdated
|
||
.. deprecated:: 0.18.0 | ||
The new syntax is ``.resample(...).mean()``, or | ||
``.resample(...).apply(<func>)`` | ||
axis : int, optional, default 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmrr can you make this change?
freq='A', | ||
periods=2)) | ||
... freq='A', | ||
... periods=2)) | ||
>>> s | ||
2012 1 | ||
2013 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the following examples, when using resample, it once uses .head()
and the other time it shows 12 months. I don't like any of them. I'd use something like resampling a year to quarters, or a quarter to months... So we can show all the data, and it's just few rows that readers can quickly check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created two examples, one sampling a year into quarters using convention='start'
and another sampling quarters into months using the convention='end'
. If it's too much we can remove one of them.
pandas/core/generic.py
Outdated
@@ -7563,7 +7584,8 @@ def resample(self, rule, how=None, axis=0, fill_method=None, closed=None, | |||
For DataFrame objects, the keyword ``on`` can be used to specify the | |||
column instead of the index for resampling. | |||
|
|||
>>> df = pd.DataFrame(data=9*[range(4)], columns=['a', 'b', 'c', 'd']) | |||
>>> df = pd.DataFrame(data=9 * [range(4)], | |||
... columns=['a', 'b', 'c', 'd']) | |||
>>> df['time'] = pd.date_range('1/1/2000', periods=9, freq='T') | |||
>>> df.resample('3T', on='time').sum() | |||
a b c d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should show the data before resampling, otherwise it's difficult to see what resample
is doing. Also, we can probably have just two columns instead of 4, name them with something meaningful (like price
and volume
instead of a
and b
), and create the DataFrame
in a nicer way (in a single command, with no range
, may be just a dict).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've revamped this example. Pushing now altogether with the rest of the changes. Agree more meaningful examples make sure everyone understands. Just a bit conscious of being too verbose. But looking forward to your feedback.
pandas/core/generic.py
Outdated
>>> df2 = pd.DataFrame(data=10 * [range(4)], | ||
... columns=['a', 'b', 'c', 'd'], | ||
... index=pd.MultiIndex.from_product([time, [1, 2]]) | ||
... ) | ||
>>> df2.resample('3T', level=0).sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as in the previous example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, also with a different, more realistic approach. Please let me know your comments. I also struggled a bit with the formatting of long lines, opted for styling lines 7627-7632 using black (also compatible with PEP8) but let me know it breaks consistency and should be reverted to the old style. Pushing now.
… convention examples. (pandas-dev#22894)
…er and other cosmetical mods. (pandas-dev#22894)
Thanks for the great feedback @datapythonista. I've pushed some changes as per your comments, I've been a bit creative with some examples, let me know if I went too far :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments, and I think we're ready to merge.
pandas/core/generic.py
Outdated
.. deprecated:: 0.18.0 | ||
The new syntax is ``.resample(...).mean()``, or | ||
``.resample(...).apply(<func>)`` | ||
axis : {0 or 'index'}, default 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
methods in generic.py
are reused for both in Series
and DataFrame
. I guess for DataFrame
we can also use axis='columns'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, I wasn't sure, I'll add it. Thinking about it now perhaps we could've added more examples with DataFrame
.
pandas/core/generic.py
Outdated
axis : {0 or 'index'}, default 0 | ||
Which axis to use for up- or down-sampling. For ``Series`` this | ||
will default to 0, i.e. `along the rows`. Must be | ||
``DatetimeIndex``, ``TimedeltaIndex`` or ``PeriodIndex``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the must be. I don't think axis must be DatetimeIndex...
Also, don't use backticks around the "along the rows". Backticks are to tell that it's a reference (to a variable, a function, a class...). We don't use them for text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the must be. I don't think axis must be DatetimeIndex...
I took it from the main function description. As rule
must be an offset string which are time-related magnitudes I decided to be explicit here. Also I ran some tests and got TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex
.
In fact, following this logic, on
should also have that sentence: Must be ``DatetimeIndex``, ``TimedeltaIndex`` or ``PeriodIndex``.
For instance:
>>> df = pd.DataFrame({'price': [10, 11, 9, 13, 14, 18, 17, 19], 'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
>>> df['time'] = pd.date_range('01/01/2018', periods=8, freq='W')
>>> df
price volume time
0 10 50 2018-01-07
1 11 60 2018-01-14
2 9 40 2018-01-21
3 13 100 2018-01-28
4 14 50 2018-02-04
5 18 100 2018-02-11
6 17 40 2018-02-18
7 19 50 2018-02-25
>>> df.resample('M', on='time').sum()
price volume
time
2018-01-31 43 250
2018-02-28 68 240
I hope I'm not missing anything here... resample
is not a function I've used heavily personally.
Also, don't use backticks around the "along the rows". Backticks are to tell that it's a reference (to a variable, a function, a class...). We don't use them for text.
Slip of keyboard, I meant normal textual quotes ("). Changing.
pandas/core/generic.py
Outdated
kind: {'timestamp', 'period'}, optional | ||
convention : {'start', 'end', 's', 'e'}, default 'start' | ||
For ``PeriodIndex`` only, controls whether to use the start or | ||
end of ``rule``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single backticks, in both cases, double backticks are for code (including possible values of variables like NaN). But for classes, variables, parameters... we use single backticks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, wasn't clear from the docstring docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would None be with double backtick in examples like "default None
"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. I think it probably should, but we don't have it anywhere, so let's leave it without backticks in the default for now
pandas/core/generic.py
Outdated
loffset : timedelta, default None | ||
Adjust the resampled time labels. | ||
limit : int, default None | ||
Maximum size gap when reindexing with ``fill_method``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/core/generic.py
Outdated
For a MultiIndex, level (name or number) to use for | ||
resampling. Level must be datetime-like. | ||
resampling. ``level`` must be datetime-like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@@ -7422,6 +7443,10 @@ def resample(self, rule, how=None, axis=0, fill_method=None, closed=None, | |||
To learn more about the offset strings, please see `this link | |||
<http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases>`__. | |||
|
|||
See Also | |||
-------- | |||
groupby : Group by mapping, function, label, or list of labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is for Series
and DataFrame
, I'd add both here too. One of the links will be self-referencing, but the other will point to the equivalent of the other class, which is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
pandas/core/generic.py
Outdated
to the on or level keyword. | ||
series. Object must have a datetime-like index (``DatetimeIndex``, | ||
``PeriodIndex``, or ``TimedeltaIndex``), or pass datetime-like values | ||
to the ``on`` or ``level`` keyword. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single backticks.
pandas/core/generic.py
Outdated
rule : str | ||
The offset string or object representing target conversion. | ||
how : str | ||
Method for down-/re-sampling, default to ‘mean’ for downsampling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the possible values? If they are a fixed set, can we have {'mean',...}
? If it's a problem, leave it like it is, as this will go away soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible values are any valid string function that does some form of aggregation. Interestingly, this would also work with the example dataframe I shared above:
>>> df
price volume time
0 10 50 2018-01-07
1 11 60 2018-01-14
2 9 40 2018-01-21
3 13 100 2018-01-28
4 14 50 2018-02-04
5 18 100 2018-02-11
6 17 40 2018-02-18
7 19 50 2018-02-25
>>> df.resample('M', on='time', how={'price': min, 'volume': sum})
price volume
time
2018-01-31 9 250
2018-02-28 14 240
>>> df.resample('M', on='time', how=['min', 'sum'])
price volume
min sum min sum
time
2018-01-31 9 43 40 250
2018-02-28 14 68 40 240
Happy to change the description of the parameter to something like:
how : str, dict, list
Method for down-/re-sampling, default to `mean` for downsampling. Accepted combinations are:
* string function name
* list of string function names
* dict of column names -> string function names (or list of string function names)
I got inspiration from pandas.DataFrame.aggregate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f the name needs to be a string with the function name, I assume it can't be any arbitrary value. But as it's deprecated, let's not spend time on this.
Thanks @datapythonista. I've addressed the changes but it's good you highlighted that as these docs are shared between Please see my comments and suggestions. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, @jbrockmendel do you mind taking a look, and merge if you're happy?
@jmrr looks like the error in the CI is in this PR, a non-ascii character somewhere. Can you take a look and leave the CI green, so we can merge? |
Sure @datapythonista let me have a look. Thanks for the review! |
Looks like it's still failing, but it's an issue with CI setup. The branch is encountering a |
That's weird, I think that happened before and it was fixed. Can you merge master into your branch? May be it's based in the old failing version. |
pandas/core/generic.py
Outdated
|
||
>>> days = pd.date_range('1/1/2000', periods=4, freq='D') | ||
>>> d2 = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19], | ||
... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more space?
Aside from one missing space, LGTM |
pandas/core/generic.py
Outdated
|
||
>>> days = pd.date_range('1/1/2000', periods=4, freq='D') | ||
>>> d2 = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19], | ||
... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]}) | |
... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]}) |
@jmrr can you run Also, can you remove the |
Thanks for the review @jbrockmendel and @datapythonista. Previous to the branch update with latest master, all tests passed when running Now after the update I encountered some errors, but they're
I spent some time looking but I think this validation is trying to lint the example code as if this wasn't REPL code. For instance, the
So there's two lines before a function. Should I spend some time with this and commit the change or this is something that will get improved with time (e.g. by ignoring that error code). |
Not sure what's the best solution here, but the problem is in how we validate flake8, we surely don't want to leave blank lines that way in here. We'll address that in a separate PR in any case. Thanks for the work on this @jmrr CC: @alimcmaster1 |
…fixed * upstream/master: DOC: Fixes to docstring to add validation to CI (pandas-dev#23560) DOC: Remove incorrect periods at the end of parameter types (pandas-dev#23600) MAINT: tm.assert_raises_regex --> pytest.raises (pandas-dev#23592) DOC: Updating Series.resample and DataFrame.resample docstrings (pandas-dev#23197) ENH: Support for partition_cols in to_parquet (pandas-dev#23321) TST: Use intp as expected dtype in IntervalIndex indexing tests (pandas-dev#23609)
* upstream/master: BUG: Casting tz-aware DatetimeIndex to object-dtype ndarray/Index (pandas-dev#23524) BUG: Delegate more of Excel parsing to CSV (pandas-dev#23544) API: DataFrame.__getitem__ returns Series for sparse column (pandas-dev#23561) CLN: use float64_t consistently instead of double, double_t (pandas-dev#23583) DOC: Fix Order of parameters in docstrings (pandas-dev#23611) TST: Unskip some Categorical Tests (pandas-dev#23613) TST: Fix integer ops comparison test (pandas-dev#23619) DOC: Fixes to docstring to add validation to CI (pandas-dev#23560) DOC: Remove incorrect periods at the end of parameter types (pandas-dev#23600) MAINT: tm.assert_raises_regex --> pytest.raises (pandas-dev#23592) DOC: Updating Series.resample and DataFrame.resample docstrings (pandas-dev#23197)
closes DOC: Fix the docstring of resample in pandas/core/generic.py #22894
tests added / passed:
./scripts/validate_docstrings.py pandas.Series.resample
flake8 --doctests pandas/core/generic.py
passes
git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry --> Needed?
Followed pandas docstring conventions and general reST.
Added deprecation directive on some parameters. Had some doubts but after investigation the recommended API changed after v0.18.0 for these parameters.
Please let me know if I can further improve this.
Thanks!