Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(python): Rewrite pivot and melt docstrings #13693

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
28 changes: 26 additions & 2 deletions docs/user-guide/transformations/melt.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,28 @@
# Melts

Melt operations unpivot a DataFrame from wide format to long format
`melt` is the opposite of `pivot`: it transforms a "wide-format" `DataFrame`,
where each element represents an observation, into a "long-format" one, where
each row represents an observation.

To perform a melt, specify one or more columns as identifier variables (via the
`id_vars` argument) and other columns as value variables (via the `value_vars`
argument), either by name or via selectors. Typically, the columns in `id_vars`
and `value_vars` are mutually exclusive; specifying overlapping columns will
not give an error, but is rarely useful. If `value_vars` is `None`, all
remaining columns not in `id_vars` will be treated as `value_vars`.

Each element in each of the `value_vars` columns of the input `DataFrame`
(including `null` elements) will become its own row in the output `DataFrame`.
The row for that element will contain `len(id_vars) + 2` columns:

- One column for each of the `id_vars`, containing the values of the `id_vars`
columns that were on same row as that element in the input `DataFrame`. You
can think of these as the element's row names.
- One column called `'variable'` containing the name of the column in which
that element appeared, i.e. the element's column name. You can change the
name of this column by specifying the `variable_name` argument.
- One column called `'value'` containing the element itself. You can change the
name of this column by specifying the `value_name` argument.

## Dataset

Expand All @@ -12,7 +34,9 @@ Melt operations unpivot a DataFrame from wide format to long format

## Eager + lazy

`Eager` and `lazy` have the same API.
Unlike `pivot`, `melt` works in both eager and lazy mode, with the same API.
This is because all the column names in the output `DataFrame` are known in
advance, and do not depend on the data.

{{code_block('user-guide/transformations/melt','melt',['melt'])}}

Expand Down
86 changes: 70 additions & 16 deletions docs/user-guide/transformations/pivot.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,71 @@
# Pivots

Pivot a column in a `DataFrame` and perform one of the following aggregations:
`pivot` transforms a "long-format" `DataFrame`, where each row represents an
observation, into a "wide-format" one, where each element represents an
observation.

- first
- sum
- min
- max
- mean
- median
To perform a pivot, specify one or more columns for each of `values`, `index`,
and `columns`, either by name or via selectors. Typically, the columns in
`values`, `index`, and `columns` are mutually exclusive; specifying overlapping
columns will not give an error, but is rarely useful.

The pivot operation consists of a group by one, or multiple columns (these will be the
new y-axis), the column that will be pivoted (this will be the new x-axis) and an
aggregation.
In the simplest case where `values`, `index` and `columns` are each a single
column:

- Each unique value of the `index` column will become the name of a row in
the pivoted `DataFrame`. The first column of the pivoted `DataFrame` will
contain these row names.
- Each unique value of the `columns` column will become the name of a column
in the pivoted `DataFrame`.
- Each value of the `values` column will become a value in the pivoted
`DataFrame`. For instance, if the nth row of the input `DataFrame` is
`("values_n", "index_n", "columns_n")`, then the value `"values_n"` will
be placed at row `"index_n"` (i.e. the row where the `index` column has
the value `index_n`) and column `"columns_n"`.

Thus, if there are `N` unique values in the `columns` column, there will be
`N + 1` columns in the pivoted `DataFrame`: one for the row names, the
remaining `N` for the values.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only true if there's a single index column? Which is fine, you explain what happens with multiple index columns below, I just think it needs caveating here


If there are multiple `index` columns instead of one, each unique _combination_
of their values will become a row in the pivoted `DataFrame`, and there will be
`len(index)` columns of row names instead of one.

If there are multiple `columns` columns instead of one, the result will be the
same as if you had combined them into a single `struct` column beforehand. In
other words, `df.pivot(..., columns=['a', 'b', 'c'])` is equivalent to
`df.with_columns(foo=pl.struct(['a', 'b', 'c']).pivot(..., columns='foo')`,
assuming `foo` is not already a column in `df`.

If there are multiple `values` columns instead of one, the pivot will be done
independently for each of the columns in `values`, and the results will be
concatenated horizontally. To avoid having duplicate column names, the names
of the non-index columns will be prefixed with `f'{value}_{columns}_'`, where
`value` is the column name in `values` from which the column's values are
taken. The `'_'` can be changed to a different string using the `separator`
argument.

When multiple rows of the input `DataFrame` have the same `values` for all the
columns in `index` and `columns`, `pivot` will raise an error unless these
multiple values are aggregated into a single value before pivoting. This can be
done prior to pivoting with a :func:`group_by`, but `pivot` also provides a
MarcoGorelli marked this conversation as resolved.
Show resolved Hide resolved
convenient way to do this aggregation internally, by specifying the
`aggregate_function` argument. You can specify one of 8 predefined aggregation
functions as strings:

- `'first'`
- `'last'`
- `'sum'`
- `'max'`
- `'min'`
- `'mean'`
- `'median'`
- `'len'`

or provide an expression that performs a custom aggregation, where
`pl.element()` represents the multiple `values` in each "group" with the same
`index` and `columns`. For example, `aggregate_function='mean'` is short for
`aggregate_function=pl.element().mean()`.

## Dataset

Expand All @@ -32,12 +86,12 @@ aggregation.

## Lazy

A Polars `LazyFrame` always need to know the schema of a computation statically (before collecting the query).
As a pivot's output schema depends on the data, and it is therefore impossible to determine the schema without
running the query.

Polars could have abstracted this fact for you just like Spark does, but we don't want you to shoot yourself in the foot
with a shotgun. The cost should be clear upfront.
A Polars `LazyFrame` always need to know the schema of a computation statically
(before collecting the query). Since the schema of a pivoted DataFrame depends
on the data, it is impossible to determine the schema without running the
query. As a result, `pivot` is not available in lazy mode. To use `collect()`
in a `LazyFrame` pipe chain, you must include a `collect()` before pivoting and
a `lazy()` after pivoting:
MarcoGorelli marked this conversation as resolved.
Show resolved Hide resolved

{{code_block('user-guide/transformations/pivot','lazy',['pivot'])}}

Expand Down
Loading
Loading