Skip to content

Commit

Permalink
ENH: str.extractall for several matches
Browse files Browse the repository at this point in the history
Author: Toby Dylan Hocking <tdhock5@gmail.com>

Closes pandas-dev#11386 from tdhock/extractall and squashes the following commits:

0c1c3d1 [Toby Dylan Hocking] ENH: extract(expand), extractall
  • Loading branch information
tdhock authored and jreback committed Feb 9, 2016
1 parent 517c559 commit 67730dd
Show file tree
Hide file tree
Showing 6 changed files with 913 additions and 99 deletions.
1 change: 1 addition & 0 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -526,6 +526,7 @@ strings and apply several methods to it. These can be accessed like
Series.str.encode
Series.str.endswith
Series.str.extract
Series.str.extractall
Series.str.find
Series.str.findall
Series.str.get
Expand Down
144 changes: 127 additions & 17 deletions doc/source/text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -168,28 +168,37 @@ Extracting Substrings

.. _text.extract:

The method ``extract`` (introduced in version 0.13) accepts `regular expressions
<https://docs.python.org/2/library/re.html>`__ with match groups. Extracting a
regular expression with one group returns a Series of strings.
Extract first match in each subject (extract)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. ipython:: python
.. versionadded:: 0.13.0

.. warning::

In version 0.18.0, ``extract`` gained the ``expand`` argument. When
``expand=False`` it returns a ``Series``, ``Index``, or
``DataFrame``, depending on the subject and regular expression
pattern (same behavior as pre-0.18.0). When ``expand=True`` it
always returns a ``DataFrame``, which is more consistent and less
confusing from the perspective of a user.

pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)')
The ``extract`` method accepts a `regular expression
<https://docs.python.org/2/library/re.html>`__ with at least one
capture group.

Elements that do not match return ``NaN``. Extracting a regular expression
with more than one group returns a DataFrame with one column per group.
Extracting a regular expression with more than one group returns a
DataFrame with one column per group.

.. ipython:: python
pd.Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)')
Elements that do not match return a row filled with ``NaN``.
Thus, a Series of messy strings can be "converted" into a
like-indexed Series or DataFrame of cleaned-up or more useful strings,
without necessitating ``get()`` to access tuples or ``re.match`` objects.

The results dtype always is object, even if no match is found and the result
only contains ``NaN``.
Elements that do not match return a row filled with ``NaN``. Thus, a
Series of messy strings can be "converted" into a like-indexed Series
or DataFrame of cleaned-up or more useful strings, without
necessitating ``get()`` to access tuples or ``re.match`` objects. The
results dtype always is object, even if no match is found and the
result only contains ``NaN``.

Named groups like

Expand All @@ -201,9 +210,109 @@ and optional groups like

.. ipython:: python
pd.Series(['a1', 'b2', '3']).str.extract('(?P<letter>[ab])?(?P<digit>\d)')
pd.Series(['a1', 'b2', '3']).str.extract('([ab])?(\d)')
can also be used. Note that any capture group names in the regular
expression will be used for column names; otherwise capture group
numbers will be used.

Extracting a regular expression with one group returns a ``DataFrame``
with one column if ``expand=True``.

.. ipython:: python
pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=True)
It returns a Series if ``expand=False``.

.. ipython:: python
pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False)
Calling on an ``Index`` with a regex with exactly one capture group
returns a ``DataFrame`` with one column if ``expand=True``,

.. ipython:: python
s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"])
s
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)
It returns an ``Index`` if ``expand=False``.

.. ipython:: python
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)
Calling on an ``Index`` with a regex with more than one capture group
returns a ``DataFrame`` if ``expand=True``.

.. ipython:: python
s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)
It raises ``ValueError`` if ``expand=False``.

.. code-block:: python
>>> s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
ValueError: This pattern contains no groups to capture.
The table below summarizes the behavior of ``extract(expand=False)``
(input subject in first column, number of groups in regex in
first row)

+--------+---------+------------+
| | 1 group | >1 group |
+--------+---------+------------+
| Index | Index | ValueError |
+--------+---------+------------+
| Series | Series | DataFrame |
+--------+---------+------------+

Extract all matches in each subject (extractall)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. _text.extractall:

Unlike ``extract`` (which returns only the first match),

.. ipython:: python
s = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
s
s.str.extract("[ab](?P<digit>\d)")
.. versionadded:: 0.18.0

the ``extractall`` method returns every match. The result of
``extractall`` is always a ``DataFrame`` with a ``MultiIndex`` on its
rows. The last level of the ``MultiIndex`` is named ``match`` and
indicates the order in the subject.

.. ipython:: python
s.str.extractall("[ab](?P<digit>\d)")
When each subject string in the Series has exactly one match,

.. ipython:: python
s = pd.Series(['a3', 'b3', 'c2'])
s
two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'
then ``extractall(pat).xs(0, level='match')`` gives the same result as
``extract(pat)``.

.. ipython:: python
extract_result = s.str.extract(two_groups)
extract_result
extractall_result = s.str.extractall(two_groups)
extractall_result
extractall_result.xs(0, level="match")
can also be used.
Testing for Strings that Match or Contain a Pattern
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -288,7 +397,8 @@ Method Summary
:meth:`~Series.str.endswith`,Equivalent to ``str.endswith(pat)`` for each element
:meth:`~Series.str.findall`,Compute list of all occurrences of pattern/regex for each string
:meth:`~Series.str.match`,"Call ``re.match`` on each element, returning matched groups as list"
:meth:`~Series.str.extract`,"Call ``re.match`` on each element, as ``match`` does, but return matched groups as strings for convenience."
:meth:`~Series.str.extract`,"Call ``re.search`` on each element, returning DataFrame with one row for each element and one column for each regex capture group"
:meth:`~Series.str.extractall`,"Call ``re.findall`` on each element, returning DataFrame with one row for each match and one column for each regex capture group"
:meth:`~Series.str.len`,Compute string lengths
:meth:`~Series.str.strip`,Equivalent to ``str.strip``
:meth:`~Series.str.rstrip`,Equivalent to ``str.rstrip``
Expand Down
86 changes: 86 additions & 0 deletions doc/source/whatsnew/v0.18.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,92 @@ New Behavior:
s.index
s.index.nbytes

.. _whatsnew_0180.enhancements.extract:

Changes to str.extract
^^^^^^^^^^^^^^^^^^^^^^

The :ref:`.str.extract <text.extract>` method takes a regular
expression with capture groups, finds the first match in each subject
string, and returns the contents of the capture groups
(:issue:`11386`). In v0.18.0, the ``expand`` argument was added to
``extract``. When ``expand=False`` it returns a ``Series``, ``Index``,
or ``DataFrame``, depending on the subject and regular expression
pattern (same behavior as pre-0.18.0). When ``expand=True`` it always
returns a ``DataFrame``, which is more consistent and less confusing
from the perspective of a user. Currently the default is
``expand=None`` which gives a ``FutureWarning`` and uses
``expand=False``. To avoid this warning, please explicitly specify
``expand``.

.. ipython:: python

pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)')

Extracting a regular expression with one group returns a ``DataFrame``
with one column if ``expand=True``.

.. ipython:: python

pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=True)

It returns a Series if ``expand=False``.

.. ipython:: python

pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False)

Calling on an ``Index`` with a regex with exactly one capture group
returns a ``DataFrame`` with one column if ``expand=True``,

.. ipython:: python

s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"])
s
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)

It returns an ``Index`` if ``expand=False``.

.. ipython:: python

s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)

Calling on an ``Index`` with a regex with more than one capture group
returns a ``DataFrame`` if ``expand=True``.

.. ipython:: python

s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)

It raises ``ValueError`` if ``expand=False``.

.. code-block:: python

>>> s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
ValueError: only one regex group is supported with Index

In summary, ``extract(expand=True)`` always returns a ``DataFrame``
with a row for every subject string, and a column for every capture
group.

.. _whatsnew_0180.enhancements.extractall:

The :ref:`.str.extractall <text.extractall>` method was added
(:issue:`11386`). Unlike ``extract`` (which returns only the first
match),

.. ipython:: python

s = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
s
s.str.extract("(?P<letter>[ab])(?P<digit>\d)")

the ``extractall`` method returns all matches.

.. ipython:: python

s.str.extractall("(?P<letter>[ab])(?P<digit>\d)")

.. _whatsnew_0180.enhancements.rounding:

Datetimelike rounding
Expand Down
Loading

0 comments on commit 67730dd

Please sign in to comment.