-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Enhancing pivot / reshape docs #21038
Conversation
…at did not exist. Also added more examples
…index/column pairs
…op of the section
Hello @VincentLa! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on May 16, 2018 at 15:30 Hours UTC |
I get an error when trying to run
|
@VincentLa do you have your upstream pointing to pandas? (pandas_dev) williams-imac:pandas williamayd$ git remote -v show
origin git@github.com:WillAyd/pandas.git (fetch)
origin git@github.com:WillAyd/pandas.git (push)
upstream https://github.com/pandas-dev/pandas.git (fetch)
upstream https://github.com/pandas-dev/pandas.git (push) If not make sure you do that - it's outlined in the contributing guide: https://pandas.pydata.org/pandas-docs/stable/contributing.html#forking |
@WillAyd I believe I do:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Admittedly haven't looked at this in its rendered form but here's some comments on a first pass
doc/source/reshaping.rst
Outdated
@@ -93,6 +92,12 @@ You can then select subsets from the pivoted ``DataFrame``: | |||
Note that this returns a view on the underlying data in the case where the data | |||
are homogeneously-typed. | |||
|
|||
.. note:: | |||
``pandas.pivot`` will error with a ``ValueError: Index contains duplicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we convert pandas.pivot
into an inline reference, similar to how you have pandas.pivot_table?
doc/source/reshaping.rst
Outdated
Question 1 | ||
~~~~~~~~~~ | ||
|
||
How do I pivot ``df`` such that the ``col`` values are columns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double backticks are for literals, where single backticks are for inline code / argument refs. Can you change any argument reference (i.e. df
, col
, etc...) to single backticks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd while that makes sense, this seems inconsistent with how the single backticks and double backticks are being used elsewhere in this doc. It also seems like the double backticks look better in the docs itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK - that's a fair point. @TomAugspurger do you know if there's an official stance on this? Worth updating in a separate issue?
doc/source/reshaping.rst
Outdated
~~~~~~~~~~ | ||
|
||
How do I pivot ``df`` such that the ``col`` values are columns, | ||
``row`` values are the index, and mean of ``val0`` are the values? In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
", and the mean"
doc/source/reshaping.rst
Outdated
|
||
.. ipython:: python | ||
|
||
np.random.seed([3,1415]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to choose a list as a seed here? Not saying there's anything wrong with it per se, just have always seen an int literal like 12345
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No real reason. I'm basing these examples off of the StackOverflow post originally linked in the issue: #19089.
doc/source/reshaping.rst
Outdated
Question 4 | ||
~~~~~~~~~~ | ||
|
||
How can I Group By over multiple columns? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure "Group By" is the right terminology to use here - any reason in particular you went with that?
Wondering if in general it wouldn't be more concise and easier to word if we did away with the "Q/A" format and just preceded each example with something like "Multiple values can be used at once" and leaving it to the example to highlight the effect of that
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -1375,6 +1375,7 @@ Reshaping | |||
- Bug in :func:`isna`, which cannot handle ambiguous typed lists (:issue:`20675`) | |||
- Bug in :func:`concat` which raises an error when concatenating TZ-aware dataframes and all-NaT dataframes (:issue:`12396`) | |||
- Bug in :func:`concat` which raises an error when concatenating empty TZ-aware series (:issue:`18447`) | |||
- Updated :func:`~pandas.pivot_table` with more comprehensive examples. Also updated Reshaping and Pivot Tables documentation with a Frequenty Asked Questions example (:issue:`19089`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't typically add a whatsnew note for documentation-only updates so you can remove this
foo one 4 1 | ||
two NaN 6 | ||
|
||
We can also fill missing values using the `fill_value` parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth calling out in this example that providing the fill_value has preserved the int
dtype, instead of casting to float as np.nan
would
pandas/core/frame.py
Outdated
foo one 4 1 | ||
two 0 6 | ||
|
||
The next example aggregates by taking the mean using values for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"mean across multiple columns" reads a little easier than "mean using values for multiple columns" IMO
Hmm have you fetched or pulled anything from upstream yet then? Perhaps |
Codecov Report
@@ Coverage Diff @@
## master #21038 +/- ##
=======================================
Coverage 92.25% 92.25%
=======================================
Files 161 161
Lines 51200 51200
=======================================
Hits 47232 47232
Misses 3968 3968
Continue to review full report at Codecov.
|
doc/source/reshaping.rst
Outdated
.. note:: | ||
If you just want to handle one column as a categorical variable (like R's factor), | ||
you can use ``df["cat_col"] = pd.Categorical(df["col"])`` or | ||
``df["cat_col"] = df["col"].astype("category")``. For full docs on :class:`~pandas.Categorical`, | ||
see the :ref:`Categorical introduction <categorical>` and the | ||
:ref:`API documentation <api.categorical>`. | ||
|
||
Frequently Asked Questions (and Examples) | ||
------------------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to be the same length as the title (how about just make this title Examples)?
) | ||
|
||
df | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to have these as Question, rather just make an informative title.
np.random.seed([3,1415]) | ||
n = 20 | ||
|
||
cols = np.array(['key', 'row', 'item', 'col']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just do
In [12]: cols + pd.DataFrame((np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str))
Out[12]:
0 1 2 3
0 key1 row3 item2 col0
1 key0 row2 item1 col4
2 key1 row1 item0 col2
3 key1 row1 item0 col1
4 key0 row3 item1 col2
5 key1 row0 item2 col4
6 key2 row2 item0 col3
7 key2 row0 item2 col2
8 key1 row1 item0 col1
9 key0 row4 item0 col4
10 key0 row0 item1 col2
11 key0 row4 item1 col4
12 key0 row4 item2 col1
13 key1 row1 item1 col1
14 key1 row0 item2 col4
15 key2 row2 item1 col0
16 key2 row2 item2 col0
17 key0 row3 item0 col2
18 key1 row0 item1 col4
19 key0 row3 item1 col2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Refactored a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. Couple more comments - may have more whenever this gets pushed to the nightly doc build
@@ -93,6 +92,12 @@ You can then select subsets from the pivoted ``DataFrame``: | |||
Note that this returns a view on the underlying data in the case where the data | |||
are homogeneously-typed. | |||
|
|||
.. note:: | |||
:func:`~pandas.pivot` will error with a ``ValueError: Index contains duplicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this render? Might need a space after directive
doc/source/reshaping.rst
Outdated
Examples | ||
-------- | ||
|
||
In this section, we will review frequently asked questions and examples. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we got rid of the Q and A format don't need this intro
n = 20 | ||
|
||
cols = np.array(['key', 'row', 'item', 'col']) | ||
df = cols + pd.DataFrame((np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit but can add the columns to the constructor and get rid of the line below
cols = np.array(['key', 'row', 'item', 'col']) | ||
df = cols + pd.DataFrame((np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)) | ||
df.columns = cols | ||
df = df.join(pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistic nit but I think it would be better to use pd.concat
instead of join
here
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Suppose we wanted to pivot ``df`` such that the ``col`` values are columns, | ||
``row`` values are the index, and the mean of ``val0`` are the values? In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a question, so replace ?
with .
df.pivot_table( | ||
values=['val0', 'val1'], index='row', columns='col', aggfunc=['mean']) | ||
|
||
Note to subdivide over multiple columns we can pass in a list to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for readability we don't need to start each of these with "Note"
can you update |
can we rebase this @VincentLa (or @datapythonista ) |
@pandas-dev/pandas-core if someone has a chance to rebase this |
Updated. Didn't make any changes other than removing whitespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed a pep8 issue, lgtm.
Updated again, hopefully CI will pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VincentLa can you merge master and address the outstanding questions.
|
||
.. code-block:: ipython | ||
|
||
col col0 col1 col2 col3 col4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a ipython block
@jreback this PR is discontinued. Tom and I made changes to it, so it can be merged (some improvements are left as later work, but I think the current version is correct and an improvement to what we have). Otherwise we'll have to close, or keep making the improvements ourselves. |
thanks @datapythonista |
…fixed * upstream/master: DOC: Enhancing pivot / reshape docs (pandas-dev#21038) TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620) BUILD: Simplifying contributor dependencies (pandas-dev#23522) BUG/REF: TimedeltaIndex.__new__ (pandas-dev#23539) BUG: Casting tz-aware DatetimeIndex to object-dtype ndarray/Index (pandas-dev#23524) BUG: Delegate more of Excel parsing to CSV (pandas-dev#23544) API: DataFrame.__getitem__ returns Series for sparse column (pandas-dev#23561) CLN: use float64_t consistently instead of double, double_t (pandas-dev#23583) DOC: Fix Order of parameters in docstrings (pandas-dev#23611) TST: Unskip some Categorical Tests (pandas-dev#23613) TST: Fix integer ops comparison test (pandas-dev#23619)
* upstream/master: BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527) DEPR: Deprecate usecols as int in read_excel (pandas-dev#23635) More helpful Stata string length error. (pandas-dev#23629) BUG: astype fill_value for SparseArray.astype (pandas-dev#23547) CLN: datetimelike arrays: isort, small reorg (pandas-dev#23587) CI: Check in the CI that assert_raises_regex is not being used (pandas-dev#23627) CLN:Remove unused **kwargs from user facing methods (pandas-dev#23249) DOC: Enhancing pivot / reshape docs (pandas-dev#21038) TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620)
git diff upstream/master -u -- "*.py" | flake8 --diff
Enhancing pivot / reshape docs
Added more examples and added Q + A section.