Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Deprecate regex=True default in Series.str.replace #36695

Merged
merged 45 commits into from
Oct 10, 2020
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
52f6d78
API: Deprecate regex=True default in Series.str.replace
dsaxton Sep 28, 2020
3b9bcb2
Format
dsaxton Sep 28, 2020
c51e837
Stacklevel
dsaxton Sep 28, 2020
3761625
Test
dsaxton Sep 28, 2020
7583b7f
Test name
dsaxton Sep 28, 2020
8e0d1a5
Link discussion
dsaxton Sep 28, 2020
79e4400
Set regex
dsaxton Sep 28, 2020
ecc5786
Update
dsaxton Sep 28, 2020
cf659b0
Merge remote-tracking branch 'upstream/master' into replace-regex-def…
dsaxton Sep 28, 2020
628cfa3
Update
dsaxton Sep 28, 2020
f216022
Fix
dsaxton Sep 28, 2020
e6c79a3
Merge remote-tracking branch 'upstream/master' into replace-regex-def…
dsaxton Sep 28, 2020
e6b3e0f
Check for str
dsaxton Sep 28, 2020
e3ce080
Merge remote-tracking branch 'upstream/master' into replace-regex-def…
dsaxton Sep 28, 2020
752d322
Merge remote-tracking branch 'upstream/master' into replace-regex-def…
dsaxton Sep 29, 2020
ee67284
Doc updates
dsaxton Sep 29, 2020
c832a21
Nit
dsaxton Sep 29, 2020
09b07e6
Silence some warnings
dsaxton Sep 29, 2020
c8541da
Edit
dsaxton Sep 29, 2020
bd31857
Edit
dsaxton Sep 29, 2020
377a2ba
Question mark
dsaxton Sep 29, 2020
d6b4fbe
Update doc
dsaxton Sep 29, 2020
84dfe71
Merge remote-tracking branch 'upstream/master' into replace-regex-def…
dsaxton Sep 30, 2020
5cfbc04
Merge remote-tracking branch 'upstream/master' into replace-regex-def…
dsaxton Sep 30, 2020
1b53051
Add back
dsaxton Sep 30, 2020
1931c31
Oops
dsaxton Sep 30, 2020
20bfe16
Note
dsaxton Sep 30, 2020
6ff5955
Merge remote-tracking branch 'upstream/master' into replace-regex-def…
dsaxton Oct 1, 2020
cea44a7
Fix
dsaxton Oct 1, 2020
4376bca
Merge remote-tracking branch 'upstream/master' into replace-regex-def…
dsaxton Oct 2, 2020
482d5c3
Merge remote-tracking branch 'upstream/master' into replace-regex-def…
dsaxton Oct 3, 2020
cd18347
Merge remote-tracking branch 'upstream/master' into replace-regex-def…
dsaxton Oct 7, 2020
f872011
Update warning
dsaxton Oct 7, 2020
c0a473e
rst lint
dsaxton Oct 7, 2020
727986e
Merge remote-tracking branch 'upstream/master' into replace-regex-def…
dsaxton Oct 8, 2020
f49f778
Merge remote-tracking branch 'upstream/master' into replace-regex-def…
dsaxton Oct 8, 2020
d3d155a
Merge remote-tracking branch 'upstream/master' into replace-regex-def…
dsaxton Oct 9, 2020
89013db
Nit + emphasis
dsaxton Oct 9, 2020
e78017c
regex=False
dsaxton Oct 9, 2020
0150b2b
Merge remote-tracking branch 'upstream/master' into replace-regex-def…
dsaxton Oct 10, 2020
e396ce5
Update
dsaxton Oct 10, 2020
e799b12
fix
dsaxton Oct 10, 2020
9f7545f
rst fix
dsaxton Oct 10, 2020
6be90a4
Merge remote-tracking branch 'upstream/master' into replace-regex-def…
dsaxton Oct 10, 2020
8a4a833
Make a warning
dsaxton Oct 10, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 24 additions & 14 deletions doc/source/user_guide/text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -255,7 +255,7 @@ i.e., from the end of the string to the beginning of the string:

s2.str.rsplit("_", expand=True, n=1)

``replace`` by default replaces `regular expressions
``replace`` optionally uses `regular expressions
<https://docs.python.org/3/library/re.html>`__:

.. ipython:: python
Expand All @@ -265,25 +265,30 @@ i.e., from the end of the string to the beginning of the string:
dtype="string",
)
s3
s3.str.replace("^.a|dog", "XX-XX ", case=False)
s3.str.replace("^.a|dog", "XX-XX ", case=False, regex=True)

Some caution must be taken to keep regular expressions in mind! For example, the
following code will cause trouble because of the regular expression meaning of
``$``:
Some caution must be taken when dealing with regular expressions! The current behavior
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make this a note / warning

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add a versionchanged tag

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a versionchanged tag. Do we also need one in the docstring?

I'm not sure what's meant by note / warning since I've added a whatsnew note and warning. Do you mean put the text itself in the whatsnew?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no i mean a ..note:: a sphinx-note (which puts a highlite box around this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also do a ..warning:: which is a red-box and more prominent, but either way want to call out this

Copy link
Member Author

@dsaxton dsaxton Oct 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this a .. warning
image

is to treat single character patterns as literal strings, even when ``regex`` is set
to ``True``. (This behavior is deprecated and will be removed in a future version so
that the ``regex`` keyword is always respected.) For example, the following code will
cause trouble because of the regular expression meaning of ``$``:

.. ipython:: python

# Consider the following badly formatted financial data
dollars = pd.Series(["12", "-$10", "$10,000"], dtype="string")

# This does what you'd naively expect:
dollars.str.replace("$", "")
# Here $ is treated as a literal character
dollars.str.replace("$", "", regex=True)
dsaxton marked this conversation as resolved.
Show resolved Hide resolved

# But this doesn't:
dollars.str.replace("-$", "-")
# But here it is not
dollars.str.replace("-$", "-", regex=True)

# We need to escape the special character (for >1 len patterns)
dollars.str.replace(r"-\$", "-")
dollars.str.replace(r"-\$", "-", regex=True)

# Or set regex equal to False
dollars.str.replace("-$", "-", regex=False)

If you do want literal replacement of a string (equivalent to
:meth:`str.replace`), you can set the optional ``regex`` parameter to
Expand All @@ -293,7 +298,7 @@ and ``repl`` must be strings:
.. ipython:: python

# These lines are equivalent
dollars.str.replace(r"-\$", "-")
dollars.str.replace(r"-\$", "-", regex=True)
jreback marked this conversation as resolved.
Show resolved Hide resolved
dollars.str.replace("-$", "-", regex=False)

The ``replace`` method can also take a callable as replacement. It is called
Expand All @@ -310,7 +315,10 @@ positional argument (a regex object) and return a string.
return m.group(0)[::-1]


pd.Series(["foo 123", "bar baz", np.nan], dtype="string").str.replace(pat, repl)
pd.Series(
["foo 123", "bar baz", np.nan],
dtype="string"
).str.replace(pat, repl, regex=True)

# Using regex groups
pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"
Expand All @@ -320,7 +328,9 @@ positional argument (a regex object) and return a string.
return m.group("two").swapcase()


pd.Series(["Foo Bar Baz", np.nan], dtype="string").str.replace(pat, repl)
pd.Series(["Foo Bar Baz", np.nan], dtype="string").str.replace(
pat, repl, regex=True
)

The ``replace`` method also accepts a compiled regular expression object
from :func:`re.compile` as a pattern. All flags should be included in the
Expand All @@ -331,7 +341,7 @@ compiled regular expression object.
import re

regex_pat = re.compile(r"^.a|dog", flags=re.IGNORECASE)
s3.str.replace(regex_pat, "XX-XX ")
s3.str.replace(regex_pat, "XX-XX ", regex=True)

Including a ``flags`` argument when calling ``replace`` with a compiled
regular expression object will raise a ``ValueError``.
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -287,6 +287,7 @@ Deprecations
- Deprecated indexing :class:`DataFrame` rows with datetime-like strings ``df[string]``, use ``df.loc[string]`` instead (:issue:`36179`)
- Deprecated casting an object-dtype index of ``datetime`` objects to :class:`DatetimeIndex` in the :class:`Series` constructor (:issue:`23598`)
- Deprecated :meth:`Index.is_all_dates` (:issue:`27744`)
- The default value of ``regex`` for :meth:`Series.str.replace` will change from ``True`` to ``False`` in a future release. In addition, single character regular expressions will *not* be treated as literal strings when ``regex=True`` is set. (:issue:`24804`)
- Deprecated automatic alignment on comparison operations between :class:`DataFrame` and :class:`Series`, do ``frame, ser = frame.align(ser, axis=1, copy=False)`` before e.g. ``frame == ser`` (:issue:`28759`)
- :meth:`Rolling.count` with ``min_periods=None`` will default to the size of the window in a future version (:issue:`31302`)
- Deprecated slice-indexing on timezone-aware :class:`DatetimeIndex` with naive ``datetime`` objects, to match scalar indexing behavior (:issue:`36148`)
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/reshape/melt.py
Original file line number Diff line number Diff line change
Expand Up @@ -483,7 +483,7 @@ def melt_stub(df, stub: str, i, j, value_vars, sep: str):
var_name=j,
)
newdf[j] = Categorical(newdf[j])
newdf[j] = newdf[j].str.replace(re.escape(stub + sep), "")
newdf[j] = newdf[j].str.replace(re.escape(stub + sep), "", regex=True)

# GH17627 Cast numerics suffixes to int/float
newdf[j] = to_numeric(newdf[j], errors="ignore")
Expand Down
16 changes: 15 additions & 1 deletion pandas/core/strings/accessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -1170,7 +1170,7 @@ def fullmatch(self, pat, case=True, flags=0, na=None):
return self._wrap_result(result, fill_value=na, returns_string=False)

@forbid_nonstring_types(["bytes"])
def replace(self, pat, repl, n=-1, case=None, flags=0, regex=True):
def replace(self, pat, repl, n=-1, case=None, flags=0, regex=None):
r"""
Replace each occurrence of pattern/regex in the Series/Index.

Expand Down Expand Up @@ -1288,6 +1288,20 @@ def replace(self, pat, repl, n=-1, case=None, flags=0, regex=True):
2 NaN
dtype: object
"""
if regex is None:
if isinstance(pat, str) and any(c in pat for c in ".+*|^$?[](){}\\"):
# warn only in cases where regex behavior would differ from literal
msg = (
"The default value of regex will change from True to False "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an you make sure that we are testing both of these warnings (as i see you only added a single test)

"in a future version."
)
if len(pat) == 1:
msg += (
" In addition, single character regular expressions will"
"*not* be treated as literal strings when regex=True."
)
warnings.warn(msg, FutureWarning, stacklevel=3)
regex = True
result = self._array._str_replace(
pat, repl, n=n, case=case, flags=flags, regex=regex
)
Expand Down
6 changes: 6 additions & 0 deletions pandas/tests/series/methods/test_replace.py
Original file line number Diff line number Diff line change
Expand Up @@ -449,3 +449,9 @@ def test_replace_with_compiled_regex(self):
result = s.replace({regex: "z"}, regex=True)
expected = pd.Series(["z", "b", "c"])
tm.assert_series_equal(result, expected)

def test_str_replace_regex_default_raises_warning(self):
# https://github.com/pandas-dev/pandas/pull/24809
s = pd.Series(["a", "b", "c"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check the messages on this warning

with tm.assert_produces_warning(FutureWarning, check_stacklevel=False):
s.str.replace("^.$", "")
26 changes: 14 additions & 12 deletions pandas/tests/test_strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -984,11 +984,11 @@ def test_casemethods(self):
def test_replace(self):
values = Series(["fooBAD__barBAD", np.nan])

result = values.str.replace("BAD[_]*", "")
result = values.str.replace("BAD[_]*", "", regex=True)
exp = Series(["foobar", np.nan])
tm.assert_series_equal(result, exp)

result = values.str.replace("BAD[_]*", "", n=1)
result = values.str.replace("BAD[_]*", "", n=1, regex=True)
exp = Series(["foobarBAD", np.nan])
tm.assert_series_equal(result, exp)

Expand All @@ -997,15 +997,17 @@ def test_replace(self):
["aBAD", np.nan, "bBAD", True, datetime.today(), "fooBAD", None, 1, 2.0]
)

rs = Series(mixed).str.replace("BAD[_]*", "")
rs = Series(mixed).str.replace("BAD[_]*", "", regex=True)
xp = Series(["a", np.nan, "b", np.nan, np.nan, "foo", np.nan, np.nan, np.nan])
assert isinstance(rs, Series)
tm.assert_almost_equal(rs, xp)

# flags + unicode
values = Series([b"abcd,\xc3\xa0".decode("utf-8")])
exp = Series([b"abcd, \xc3\xa0".decode("utf-8")])
result = values.str.replace(r"(?<=\w),(?=\w)", ", ", flags=re.UNICODE)
result = values.str.replace(
r"(?<=\w),(?=\w)", ", ", flags=re.UNICODE, regex=True
)
tm.assert_series_equal(result, exp)

# GH 13438
Expand All @@ -1023,7 +1025,7 @@ def test_replace_callable(self):

# test with callable
repl = lambda m: m.group(0).swapcase()
result = values.str.replace("[a-z][A-Z]{2}", repl, n=2)
result = values.str.replace("[a-z][A-Z]{2}", repl, n=2, regex=True)
exp = Series(["foObaD__baRbaD", np.nan])
tm.assert_series_equal(result, exp)

Expand All @@ -1049,7 +1051,7 @@ def test_replace_callable(self):
values = Series(["Foo Bar Baz", np.nan])
pat = r"(?P<first>\w+) (?P<middle>\w+) (?P<last>\w+)"
repl = lambda m: m.group("middle").swapcase()
result = values.str.replace(pat, repl)
result = values.str.replace(pat, repl, regex=True)
exp = Series(["bAR", np.nan])
tm.assert_series_equal(result, exp)

Expand All @@ -1059,11 +1061,11 @@ def test_replace_compiled_regex(self):

# test with compiled regex
pat = re.compile(r"BAD[_]*")
result = values.str.replace(pat, "")
result = values.str.replace(pat, "", regex=True)
exp = Series(["foobar", np.nan])
tm.assert_series_equal(result, exp)

result = values.str.replace(pat, "", n=1)
result = values.str.replace(pat, "", n=1, regex=True)
exp = Series(["foobarBAD", np.nan])
tm.assert_series_equal(result, exp)

Expand All @@ -1072,7 +1074,7 @@ def test_replace_compiled_regex(self):
["aBAD", np.nan, "bBAD", True, datetime.today(), "fooBAD", None, 1, 2.0]
)

rs = Series(mixed).str.replace(pat, "")
rs = Series(mixed).str.replace(pat, "", regex=True)
xp = Series(["a", np.nan, "b", np.nan, np.nan, "foo", np.nan, np.nan, np.nan])
assert isinstance(rs, Series)
tm.assert_almost_equal(rs, xp)
Expand Down Expand Up @@ -1110,7 +1112,7 @@ def test_replace_literal(self):
# GH16808 literal replace (regex=False vs regex=True)
values = Series(["f.o", "foo", np.nan])
exp = Series(["bao", "bao", np.nan])
result = values.str.replace("f.", "ba")
result = values.str.replace("f.", "ba", regex=True)
tm.assert_series_equal(result, exp)

exp = Series(["bao", "foo", np.nan])
Expand Down Expand Up @@ -3044,7 +3046,7 @@ def test_pipe_failures(self):

tm.assert_series_equal(result, exp)

result = s.str.replace("|", " ")
result = s.str.replace("|", " ", regex=False)
exp = Series(["A B C"])

tm.assert_series_equal(result, exp)
Expand Down Expand Up @@ -3345,7 +3347,7 @@ def test_replace_moar(self):
)
tm.assert_series_equal(result, expected)

result = s.str.replace("^.a|dog", "XX-XX ", case=False)
result = s.str.replace("^.a|dog", "XX-XX ", case=False, regex=True)
expected = Series(
[
"A",
Expand Down