Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Enhancing pivot / reshape docs #21038

Merged
merged 15 commits into from
Nov 12, 2018
Merged

Conversation

VincentLa
Copy link
Contributor

@VincentLa VincentLa commented May 14, 2018

Enhancing pivot / reshape docs

Added more examples and added Q + A section.

@pep8speaks
Copy link

pep8speaks commented May 14, 2018

Hello @VincentLa! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 16, 2018 at 15:30 Hours UTC

@VincentLa
Copy link
Contributor Author

I get an error when trying to run

(pandas-dev) VincentLa@ch-C02Q6HPBG8WN pandas-vincentla (master) $ git diff upstream/master -u -- "*.py" | flake8 --diff
fatal: bad revision 'upstream/master'

@WillAyd
Copy link
Member

WillAyd commented May 14, 2018

@VincentLa do you have your upstream pointing to pandas?

(pandas_dev) williams-imac:pandas williamayd$ git remote -v show
origin	git@github.com:WillAyd/pandas.git (fetch)
origin	git@github.com:WillAyd/pandas.git (push)
upstream	https://github.com/pandas-dev/pandas.git (fetch)
upstream	https://github.com/pandas-dev/pandas.git (push)

If not make sure you do that - it's outlined in the contributing guide:

https://pandas.pydata.org/pandas-docs/stable/contributing.html#forking

@VincentLa
Copy link
Contributor Author

@WillAyd I believe I do:

VincentLa@ch-C02Q6HPBG8WN pandas-vincentla (master) $ git remote -v show
origin	https://github.com/VincentLa/pandas.git (fetch)
origin	https://github.com/VincentLa/pandas.git (push)
upstream	https://github.com/pandas-dev/pandas.git (fetch)
upstream	https://github.com/pandas-dev/pandas.git (push)

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Admittedly haven't looked at this in its rendered form but here's some comments on a first pass

@@ -93,6 +92,12 @@ You can then select subsets from the pivoted ``DataFrame``:
Note that this returns a view on the underlying data in the case where the data
are homogeneously-typed.

.. note::
``pandas.pivot`` will error with a ``ValueError: Index contains duplicate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we convert pandas.pivot into an inline reference, similar to how you have pandas.pivot_table?

Question 1
~~~~~~~~~~

How do I pivot ``df`` such that the ``col`` values are columns,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double backticks are for literals, where single backticks are for inline code / argument refs. Can you change any argument reference (i.e. df, col, etc...) to single backticks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd while that makes sense, this seems inconsistent with how the single backticks and double backticks are being used elsewhere in this doc. It also seems like the double backticks look better in the docs itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK - that's a fair point. @TomAugspurger do you know if there's an official stance on this? Worth updating in a separate issue?

~~~~~~~~~~

How do I pivot ``df`` such that the ``col`` values are columns,
``row`` values are the index, and mean of ``val0`` are the values? In
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

", and the mean"


.. ipython:: python

np.random.seed([3,1415])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to choose a list as a seed here? Not saying there's anything wrong with it per se, just have always seen an int literal like 12345

Copy link
Contributor Author

@VincentLa VincentLa May 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real reason. I'm basing these examples off of the StackOverflow post originally linked in the issue: #19089.

Question 4
~~~~~~~~~~

How can I Group By over multiple columns?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure "Group By" is the right terminology to use here - any reason in particular you went with that?

Wondering if in general it wouldn't be more concise and easier to word if we did away with the "Q/A" format and just preceded each example with something like "Multiple values can be used at once" and leaving it to the example to highlight the effect of that

@@ -1375,6 +1375,7 @@ Reshaping
- Bug in :func:`isna`, which cannot handle ambiguous typed lists (:issue:`20675`)
- Bug in :func:`concat` which raises an error when concatenating TZ-aware dataframes and all-NaT dataframes (:issue:`12396`)
- Bug in :func:`concat` which raises an error when concatenating empty TZ-aware series (:issue:`18447`)
- Updated :func:`~pandas.pivot_table` with more comprehensive examples. Also updated Reshaping and Pivot Tables documentation with a Frequenty Asked Questions example (:issue:`19089`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't typically add a whatsnew note for documentation-only updates so you can remove this

foo one 4 1
two NaN 6

We can also fill missing values using the `fill_value` parameter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth calling out in this example that providing the fill_value has preserved the int dtype, instead of casting to float as np.nan would

foo one 4 1
two 0 6

The next example aggregates by taking the mean using values for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"mean across multiple columns" reads a little easier than "mean using values for multiple columns" IMO

@WillAyd
Copy link
Member

WillAyd commented May 14, 2018

Hmm have you fetched or pulled anything from upstream yet then? Perhaps git fetch upstream will resolve that issue locally?

@codecov
Copy link

codecov bot commented May 14, 2018

Codecov Report

Merging #21038 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21038   +/-   ##
=======================================
  Coverage   92.25%   92.25%           
=======================================
  Files         161      161           
  Lines       51200    51200           
=======================================
  Hits        47232    47232           
  Misses       3968     3968
Flag Coverage Δ
#multiple 90.63% <ø> (ø) ⬆️
#single 42.29% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/frame.py 97.03% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf4c0b6...67112b8. Read the comment docs.

.. note::
If you just want to handle one column as a categorical variable (like R's factor),
you can use ``df["cat_col"] = pd.Categorical(df["col"])`` or
``df["cat_col"] = df["col"].astype("category")``. For full docs on :class:`~pandas.Categorical`,
see the :ref:`Categorical introduction <categorical>` and the
:ref:`API documentation <api.categorical>`.

Frequently Asked Questions (and Examples)
------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be the same length as the title (how about just make this title Examples)?

)

df

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to have these as Question, rather just make an informative title.

np.random.seed([3,1415])
n = 20

cols = np.array(['key', 'row', 'item', 'col'])
Copy link
Contributor

@jreback jreback May 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do

In [12]: cols + pd.DataFrame((np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str))
Out[12]: 
       0     1      2     3
0   key1  row3  item2  col0
1   key0  row2  item1  col4
2   key1  row1  item0  col2
3   key1  row1  item0  col1
4   key0  row3  item1  col2
5   key1  row0  item2  col4
6   key2  row2  item0  col3
7   key2  row0  item2  col2
8   key1  row1  item0  col1
9   key0  row4  item0  col4
10  key0  row0  item1  col2
11  key0  row4  item1  col4
12  key0  row4  item2  col1
13  key1  row1  item1  col1
14  key1  row0  item2  col4
15  key2  row2  item1  col0
16  key2  row2  item2  col0
17  key0  row3  item0  col2
18  key1  row0  item1  col4
19  key0  row3  item1  col2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Refactored a bit.

@jreback jreback added Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels May 15, 2018
@VincentLa
Copy link
Contributor Author

@jreback @WillAyd made changes based on the feedback! Would love additional review.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. Couple more comments - may have more whenever this gets pushed to the nightly doc build

@@ -93,6 +92,12 @@ You can then select subsets from the pivoted ``DataFrame``:
Note that this returns a view on the underlying data in the case where the data
are homogeneously-typed.

.. note::
:func:`~pandas.pivot` will error with a ``ValueError: Index contains duplicate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this render? Might need a space after directive

Examples
--------

In this section, we will review frequently asked questions and examples. The
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we got rid of the Q and A format don't need this intro

n = 20

cols = np.array(['key', 'row', 'item', 'col'])
df = cols + pd.DataFrame((np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but can add the columns to the constructor and get rid of the line below

cols = np.array(['key', 'row', 'item', 'col'])
df = cols + pd.DataFrame((np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str))
df.columns = cols
df = df.join(pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic nit but I think it would be better to use pd.concat instead of join here

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Suppose we wanted to pivot ``df`` such that the ``col`` values are columns,
``row`` values are the index, and the mean of ``val0`` are the values? In
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a question, so replace ? with .

df.pivot_table(
values=['val0', 'val1'], index='row', columns='col', aggfunc=['mean'])

Note to subdivide over multiple columns we can pass in a list to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for readability we don't need to start each of these with "Note"

@jreback
Copy link
Contributor

jreback commented Jun 19, 2018

can you update

@datapythonista datapythonista self-assigned this Jul 22, 2018
@jreback
Copy link
Contributor

jreback commented Oct 11, 2018

can we rebase this @VincentLa (or @datapythonista )

@jreback
Copy link
Contributor

jreback commented Nov 1, 2018

@pandas-dev/pandas-core if someone has a chance to rebase this

@TomAugspurger
Copy link
Contributor

Updated. Didn't make any changes other than removing whitespace.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed a pep8 issue, lgtm.

@TomAugspurger
Copy link
Contributor

Updated again, hopefully CI will pass.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VincentLa can you merge master and address the outstanding questions.


.. code-block:: ipython

col col0 col1 col2 col3 col4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a ipython block

@datapythonista
Copy link
Member

@jreback this PR is discontinued. Tom and I made changes to it, so it can be merged (some improvements are left as later work, but I think the current version is correct and an improvement to what we have). Otherwise we'll have to close, or keep making the improvements ourselves.

@jreback jreback merged commit dcb8b6a into pandas-dev:master Nov 12, 2018
@jreback
Copy link
Contributor

jreback commented Nov 12, 2018

thanks @datapythonista

thoo added a commit to thoo/pandas that referenced this pull request Nov 12, 2018
…fixed

* upstream/master:
  DOC: Enhancing pivot / reshape docs (pandas-dev#21038)
  TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620)
  BUILD: Simplifying contributor dependencies (pandas-dev#23522)
  BUG/REF: TimedeltaIndex.__new__ (pandas-dev#23539)
  BUG: Casting tz-aware DatetimeIndex to object-dtype ndarray/Index (pandas-dev#23524)
  BUG: Delegate more of Excel parsing to CSV (pandas-dev#23544)
  API: DataFrame.__getitem__ returns Series for sparse column (pandas-dev#23561)
  CLN: use float64_t consistently instead of double, double_t (pandas-dev#23583)
  DOC: Fix Order of parameters in docstrings (pandas-dev#23611)
  TST: Unskip some Categorical Tests (pandas-dev#23613)
  TST: Fix integer ops comparison test (pandas-dev#23619)
thoo added a commit to thoo/pandas that referenced this pull request Nov 12, 2018
* upstream/master:
  BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527)
  DEPR: Deprecate usecols as int in read_excel (pandas-dev#23635)
  More helpful Stata string length error. (pandas-dev#23629)
  BUG: astype fill_value for SparseArray.astype (pandas-dev#23547)
  CLN: datetimelike arrays: isort, small reorg (pandas-dev#23587)
  CI: Check in the CI that assert_raises_regex is not being used (pandas-dev#23627)
  CLN:Remove unused **kwargs from user facing methods (pandas-dev#23249)
  DOC: Enhancing pivot / reshape docs (pandas-dev#21038)
  TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620)
JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: enhance pivot / reshape docs
7 participants