Skip to content

Commit

Permalink
Additional DOC and BUG fix related to merging with mix of columns and… (
Browse files Browse the repository at this point in the history
  • Loading branch information
Dr-Irv authored and jreback committed Dec 4, 2018
1 parent 1573340 commit d7e96d8
Show file tree
Hide file tree
Showing 5 changed files with 125 additions and 67 deletions.
143 changes: 87 additions & 56 deletions doc/source/merging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,10 @@ operations.
Concatenating objects
---------------------

The :func:`~pandas.concat` function (in the main pandas namespace) does all of
the heavy lifting of performing concatenation operations along an axis while
performing optional set logic (union or intersection) of the indexes (if any) on
the other axes. Note that I say "if any" because there is only a single possible
The :func:`~pandas.concat` function (in the main pandas namespace) does all of
the heavy lifting of performing concatenation operations along an axis while
performing optional set logic (union or intersection) of the indexes (if any) on
the other axes. Note that I say "if any" because there is only a single possible
axis of concatenation for Series.

Before diving into all of the details of ``concat`` and what it can do, here is
Expand Down Expand Up @@ -109,9 +109,9 @@ some configurable handling of "what to do with the other axes":
to the actual data concatenation.
* ``copy`` : boolean, default True. If False, do not copy data unnecessarily.

Without a little bit of context many of these arguments don't make much sense.
Let's revisit the above example. Suppose we wanted to associate specific keys
with each of the pieces of the chopped up DataFrame. We can do this using the
Without a little bit of context many of these arguments don't make much sense.
Let's revisit the above example. Suppose we wanted to associate specific keys
with each of the pieces of the chopped up DataFrame. We can do this using the
``keys`` argument:

.. ipython:: python
Expand All @@ -138,9 +138,9 @@ It's not a stretch to see how this can be very useful. More detail on this
functionality below.

.. note::
It is worth noting that :func:`~pandas.concat` (and therefore
:func:`~pandas.append`) makes a full copy of the data, and that constantly
reusing this function can create a significant performance hit. If you need
It is worth noting that :func:`~pandas.concat` (and therefore
:func:`~pandas.append`) makes a full copy of the data, and that constantly
reusing this function can create a significant performance hit. If you need
to use the operation over several datasets, use a list comprehension.

::
Expand Down Expand Up @@ -224,8 +224,8 @@ DataFrame:
Concatenating using ``append``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A useful shortcut to :func:`~pandas.concat` are the :meth:`~DataFrame.append`
instance methods on ``Series`` and ``DataFrame``. These methods actually predated
A useful shortcut to :func:`~pandas.concat` are the :meth:`~DataFrame.append`
instance methods on ``Series`` and ``DataFrame``. These methods actually predated
``concat``. They concatenate along ``axis=0``, namely the index:

.. ipython:: python
Expand Down Expand Up @@ -271,8 +271,8 @@ need to be:
.. note::

Unlike the :py:meth:`~list.append` method, which appends to the original list
and returns ``None``, :meth:`~DataFrame.append` here **does not** modify
Unlike the :py:meth:`~list.append` method, which appends to the original list
and returns ``None``, :meth:`~DataFrame.append` here **does not** modify
``df1`` and returns its copy with ``df2`` appended.

.. _merging.ignore_index:
Expand Down Expand Up @@ -370,9 +370,9 @@ Passing ``ignore_index=True`` will drop all name references.
More concatenating with group keys
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A fairly common use of the ``keys`` argument is to override the column names
A fairly common use of the ``keys`` argument is to override the column names
when creating a new ``DataFrame`` based on existing ``Series``.
Notice how the default behaviour consists on letting the resulting ``DataFrame``
Notice how the default behaviour consists on letting the resulting ``DataFrame``
inherit the parent ``Series``' name, when these existed.

.. ipython:: python
Expand Down Expand Up @@ -468,7 +468,7 @@ Appending rows to a DataFrame
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

While not especially efficient (since a new object must be created), you can
append a single row to a ``DataFrame`` by passing a ``Series`` or dict to
append a single row to a ``DataFrame`` by passing a ``Series`` or dict to
``append``, which returns a new ``DataFrame`` as above.

.. ipython:: python
Expand Down Expand Up @@ -513,15 +513,15 @@ pandas has full-featured, **high performance** in-memory join operations
idiomatically very similar to relational databases like SQL. These methods
perform significantly better (in some cases well over an order of magnitude
better) than other open source implementations (like ``base::merge.data.frame``
in R). The reason for this is careful algorithmic design and the internal layout
in R). The reason for this is careful algorithmic design and the internal layout
of the data in ``DataFrame``.

See the :ref:`cookbook<cookbook.merge>` for some advanced strategies.

Users who are familiar with SQL but new to pandas might be interested in a
:ref:`comparison with SQL<compare_with_sql.join>`.

pandas provides a single function, :func:`~pandas.merge`, as the entry point for
pandas provides a single function, :func:`~pandas.merge`, as the entry point for
all standard database join operations between ``DataFrame`` or named ``Series`` objects:

::
Expand Down Expand Up @@ -590,7 +590,7 @@ The return type will be the same as ``left``. If ``left`` is a ``DataFrame`` or
and ``right`` is a subclass of ``DataFrame``, the return type will still be ``DataFrame``.

``merge`` is a function in the pandas namespace, and it is also available as a
``DataFrame`` instance method :meth:`~DataFrame.merge`, with the calling
``DataFrame`` instance method :meth:`~DataFrame.merge`, with the calling
``DataFrame`` being implicitly considered the left object in the join.

The related :meth:`~DataFrame.join` method, uses ``merge`` internally for the
Expand All @@ -602,7 +602,7 @@ Brief primer on merge methods (relational algebra)

Experienced users of relational databases like SQL will be familiar with the
terminology used to describe join operations between two SQL-table like
structures (``DataFrame`` objects). There are several cases to consider which
structures (``DataFrame`` objects). There are several cases to consider which
are very important to understand:

* **one-to-one** joins: for example when joining two ``DataFrame`` objects on
Expand Down Expand Up @@ -642,8 +642,8 @@ key combination:
labels=['left', 'right'], vertical=False);
plt.close('all');
Here is a more complicated example with multiple join keys. Only the keys
appearing in ``left`` and ``right`` are present (the intersection), since
Here is a more complicated example with multiple join keys. Only the keys
appearing in ``left`` and ``right`` are present (the intersection), since
``how='inner'`` by default.

.. ipython:: python
Expand Down Expand Up @@ -759,13 +759,13 @@ Checking for duplicate keys

.. versionadded:: 0.21.0

Users can use the ``validate`` argument to automatically check whether there
are unexpected duplicates in their merge keys. Key uniqueness is checked before
merge operations and so should protect against memory overflows. Checking key
uniqueness is also a good way to ensure user data structures are as expected.
Users can use the ``validate`` argument to automatically check whether there
are unexpected duplicates in their merge keys. Key uniqueness is checked before
merge operations and so should protect against memory overflows. Checking key
uniqueness is also a good way to ensure user data structures are as expected.

In the following example, there are duplicate values of ``B`` in the right
``DataFrame``. As this is not a one-to-one merge -- as specified in the
In the following example, there are duplicate values of ``B`` in the right
``DataFrame``. As this is not a one-to-one merge -- as specified in the
``validate`` argument -- an exception will be raised.


Expand All @@ -778,11 +778,11 @@ In the following example, there are duplicate values of ``B`` in the right
In [53]: result = pd.merge(left, right, on='B', how='outer', validate="one_to_one")
...
MergeError: Merge keys are not unique in right dataset; not a one-to-one merge
MergeError: Merge keys are not unique in right dataset; not a one-to-one merge
If the user is aware of the duplicates in the right ``DataFrame`` but wants to
ensure there are no duplicates in the left DataFrame, one can use the
``validate='one_to_many'`` argument instead, which will not raise an exception.
If the user is aware of the duplicates in the right ``DataFrame`` but wants to
ensure there are no duplicates in the left DataFrame, one can use the
``validate='one_to_many'`` argument instead, which will not raise an exception.

.. ipython:: python
Expand All @@ -794,8 +794,8 @@ ensure there are no duplicates in the left DataFrame, one can use the
The merge indicator
~~~~~~~~~~~~~~~~~~~

:func:`~pandas.merge` accepts the argument ``indicator``. If ``True``, a
Categorical-type column called ``_merge`` will be added to the output object
:func:`~pandas.merge` accepts the argument ``indicator``. If ``True``, a
Categorical-type column called ``_merge`` will be added to the output object
that takes on values:

=================================== ================
Expand Down Expand Up @@ -903,7 +903,7 @@ Joining on index
~~~~~~~~~~~~~~~~

:meth:`DataFrame.join` is a convenient method for combining the columns of two
potentially differently-indexed ``DataFrames`` into a single result
potentially differently-indexed ``DataFrames`` into a single result
``DataFrame``. Here is a very basic example:

.. ipython:: python
Expand Down Expand Up @@ -983,9 +983,9 @@ indexes:
Joining key columns on an index
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

:meth:`~DataFrame.join` takes an optional ``on`` argument which may be a column
:meth:`~DataFrame.join` takes an optional ``on`` argument which may be a column
or multiple column names, which specifies that the passed ``DataFrame`` is to be
aligned on that column in the ``DataFrame``. These two function calls are
aligned on that column in the ``DataFrame``. These two function calls are
completely equivalent:

::
Expand All @@ -995,7 +995,7 @@ completely equivalent:
how='left', sort=False)

Obviously you can choose whichever form you find more convenient. For
many-to-one joins (where one of the ``DataFrame``'s is already indexed by the
many-to-one joins (where one of the ``DataFrame``'s is already indexed by the
join key), using ``join`` may be more convenient. Here is a simple example:

.. ipython:: python
Expand Down Expand Up @@ -1133,17 +1133,42 @@ This is equivalent but less verbose and more memory efficient / faster than this
Joining with two MultiIndexes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is not implemented via ``join`` at-the-moment, however it can be done using
the following code.
This is supported in a limited way, provided that the index for the right
argument is completely used in the join, and is a subset of the indices in
the left argument, as in this example:

.. ipython:: python
index = pd.MultiIndex.from_tuples([('K0', 'X0'), ('K0', 'X1'),
('K1', 'X2')],
names=['key', 'X'])
leftindex = pd.MultiIndex.from_product([list('abc'), list('xy'), [1, 2]],
names=['abc', 'xy', 'num'])
left = pd.DataFrame({'v1' : range(12)}, index=leftindex)
left
rightindex = pd.MultiIndex.from_product([list('abc'), list('xy')],
names=['abc', 'xy'])
right = pd.DataFrame({'v2': [100*i for i in range(1, 7)]}, index=rightindex)
right
left.join(right, on=['abc', 'xy'], how='inner')
If that condition is not satisfied, a join with two multi-indexes can be
done using the following code.

.. ipython:: python
leftindex = pd.MultiIndex.from_tuples([('K0', 'X0'), ('K0', 'X1'),
('K1', 'X2')],
names=['key', 'X'])
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=index)
index=leftindex)
rightindex = pd.MultiIndex.from_tuples([('K0', 'Y0'), ('K1', 'Y1'),
('K2', 'Y2'), ('K2', 'Y3')],
names=['key', 'Y'])
right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=rightindex)
result = pd.merge(left.reset_index(), right.reset_index(),
on=['key'], how='inner').set_index(['key','X','Y'])
Expand All @@ -1161,7 +1186,7 @@ the following code.
Merging on a combination of columns and index levels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. versionadded:: 0.22
.. versionadded:: 0.23

Strings passed as the ``on``, ``left_on``, and ``right_on`` parameters
may refer to either column names or index level names. This enables merging
Expand Down Expand Up @@ -1200,6 +1225,12 @@ resetting indexes.
frames, the index level is preserved as an index level in the resulting
DataFrame.

.. note::
When DataFrames are merged using only some of the levels of a `MultiIndex`,
the extra levels will be dropped from the resulting merge. In order to
preserve those levels, use ``reset_index`` on those level names to move
those levels to columns prior to doing the merge.

.. note::

If a string matches both a column name and an index level name, then a
Expand Down Expand Up @@ -1262,7 +1293,7 @@ similarly.
Joining multiple DataFrame or Panel objects
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A list or tuple of ``DataFrames`` can also be passed to :meth:`~DataFrame.join`
A list or tuple of ``DataFrames`` can also be passed to :meth:`~DataFrame.join`
to join them together on their indexes.

.. ipython:: python
Expand All @@ -1284,7 +1315,7 @@ Merging together values within Series or DataFrame columns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Another fairly common situation is to have two like-indexed (or similarly
indexed) ``Series`` or ``DataFrame`` objects and wanting to "patch" values in
indexed) ``Series`` or ``DataFrame`` objects and wanting to "patch" values in
one object from values for matching indices in the other. Here is an example:

.. ipython:: python
Expand All @@ -1309,7 +1340,7 @@ For this, use the :meth:`~DataFrame.combine_first` method:
plt.close('all');
Note that this method only takes values from the right ``DataFrame`` if they are
missing in the left ``DataFrame``. A related method, :meth:`~DataFrame.update`,
missing in the left ``DataFrame``. A related method, :meth:`~DataFrame.update`,
alters non-NA values in place:

.. ipython:: python
Expand Down Expand Up @@ -1361,15 +1392,15 @@ Merging AsOf

.. versionadded:: 0.19.0

A :func:`merge_asof` is similar to an ordered left-join except that we match on
nearest key rather than equal keys. For each row in the ``left`` ``DataFrame``,
we select the last row in the ``right`` ``DataFrame`` whose ``on`` key is less
A :func:`merge_asof` is similar to an ordered left-join except that we match on
nearest key rather than equal keys. For each row in the ``left`` ``DataFrame``,
we select the last row in the ``right`` ``DataFrame`` whose ``on`` key is less
than the left's key. Both DataFrames must be sorted by the key.

Optionally an asof merge can perform a group-wise merge. This matches the
Optionally an asof merge can perform a group-wise merge. This matches the
``by`` key equally, in addition to the nearest match on the ``on`` key.

For example; we might have ``trades`` and ``quotes`` and we want to ``asof``
For example; we might have ``trades`` and ``quotes`` and we want to ``asof``
merge them.

.. ipython:: python
Expand Down Expand Up @@ -1428,8 +1459,8 @@ We only asof within ``2ms`` between the quote time and the trade time.
by='ticker',
tolerance=pd.Timedelta('2ms'))
We only asof within ``10ms`` between the quote time and the trade time and we
exclude exact matches on time. Note that though we exclude the exact matches
We only asof within ``10ms`` between the quote time and the trade time and we
exclude exact matches on time. Note that though we exclude the exact matches
(of the quotes), prior quotes **do** propagate to that point in time.

.. ipython:: python
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1545,6 +1545,7 @@ Reshaping
- Bug in :meth:`DataFrame.append` with a :class:`Series` with a dateutil timezone would raise a ``TypeError`` (:issue:`23682`)
- Bug in ``Series`` construction when passing no data and ``dtype=str`` (:issue:`22477`)
- Bug in :func:`cut` with ``bins`` as an overlapping ``IntervalIndex`` where multiple bins were returned per item instead of raising a ``ValueError`` (:issue:`23980`)
- Bug in :meth:`DataFrame.join` when joining on partial MultiIndex would drop names (:issue:`20452`).

.. _whatsnew_0240.bug_fixes.sparse:

Expand Down
1 change: 1 addition & 0 deletions pandas/core/reshape/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -715,6 +715,7 @@ def _maybe_add_join_keys(self, result, left_indexer, right_indexer):
result[name] = key_col
elif result._is_level_reference(name):
if isinstance(result.index, MultiIndex):
key_col.name = name
idx_list = [result.index.get_level_values(level_name)
if level_name != name else key_col
for level_name in result.index.names]
Expand Down
25 changes: 25 additions & 0 deletions pandas/tests/reshape/merge/test_join.py
Original file line number Diff line number Diff line change
Expand Up @@ -730,6 +730,31 @@ def test_panel_join_many(self):
pytest.raises(ValueError, panels[0].join, panels[1:],
how='right')

def test_join_multi_to_multi(self, join_type):
# GH 20475
leftindex = MultiIndex.from_product([list('abc'), list('xy'), [1, 2]],
names=['abc', 'xy', 'num'])
left = DataFrame({'v1': range(12)}, index=leftindex)

rightindex = MultiIndex.from_product([list('abc'), list('xy')],
names=['abc', 'xy'])
right = DataFrame({'v2': [100 * i for i in range(1, 7)]},
index=rightindex)

result = left.join(right, on=['abc', 'xy'], how=join_type)
expected = (left.reset_index()
.merge(right.reset_index(),
on=['abc', 'xy'], how=join_type)
.set_index(['abc', 'xy', 'num'])
)
assert_frame_equal(expected, result)

with pytest.raises(ValueError):
left.join(right, on='xy', how=join_type)

with pytest.raises(ValueError):
right.join(left, on=['abc', 'xy'], how=join_type)


def _check_join(left, right, result, join_col, how='left',
lsuffix='_x', rsuffix='_y'):
Expand Down
Loading

0 comments on commit d7e96d8

Please sign in to comment.