-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
str.replace('.','') should replace every character? (fix) #24809
Conversation
Hello @alarangeiras! Thanks for updating the PR.
Comment last updated on January 17, 2019 at 17:34 Hours UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks generally good. Can you add a whatsnew note as well?
pandas/tests/test_strings.py
Outdated
values = Series(['abc','123']) | ||
|
||
result = values.str.replace('.', 'foo') | ||
exp = Series(['foofoofoo', 'foofoofoo']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you name this expected
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, should I make another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No just add as a commit and push on the same branch.
Looks like CI failed too -haven’t checked but make sure tests pass locally before pushing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, i've seem that.
Actually, i know what is the problem.
The problem is the test test_pipe_failures
. It was built to test a char replacement: pipe to white space.
But, pipe is a regex code too.
When i fixed the replace behavior, this test was broken.
My proposal is change this test to pass the regex=False parameter. Like below:
def test_pipe_failures(self):
# #2119
s = Series(['A|B|C'])
result = s.str.split('|')
exp = Series([['A', 'B', 'C']])
tm.assert_series_equal(result, exp)
result = s.str.replace('|', ' ', regex=False)
exp = Series(['A B C'])
tm.assert_series_equal(result, exp)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What to think about that?
pandas/tests/test_strings.py
Outdated
@@ -1008,6 +1008,13 @@ def test_replace(self): | |||
values = klass(data) | |||
pytest.raises(TypeError, values.str.replace, 'a', repl) | |||
|
|||
def test_replace_single_pattern(self): | |||
values = Series(['abc','123']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment for the issue (# GH 24804
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
pandas/core/strings.py
Outdated
@@ -564,7 +564,7 @@ def str_replace(arr, pat, repl, n=-1, case=None, flags=0, regex=True): | |||
# add case flag, if provided | |||
if case is False: | |||
flags |= re.IGNORECASE | |||
if is_compiled_re or len(pat) > 1 or flags or callable(repl): | |||
if is_compiled_re or len(pat) > 0 or flags or callable(repl): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need the len(pat)
condition? Can it just be pat
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be just pat
, the only issue is case pat
is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn’t that be False in either case? If so shouldn’t need the len expression
- fixing test_pipe_failures (it's not a regex test, it's a char test)
Codecov Report
@@ Coverage Diff @@
## master #24809 +/- ##
===========================================
- Coverage 92.38% 42.92% -49.47%
===========================================
Files 166 166
Lines 52382 52382
===========================================
- Hits 48395 22485 -25910
- Misses 3987 29897 +25910
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24809 +/- ##
==========================================
- Coverage 92.38% 92.38% -0.01%
==========================================
Files 166 166
Lines 52382 52382
==========================================
- Hits 48395 48394 -1
- Misses 3987 3988 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a whatsnew entry? Think this is ok for 0.24 but cc @jreback
pandas/tests/test_strings.py
Outdated
values = Series(['abc', '123']) | ||
|
||
result = values.str.replace('.', 'foo') | ||
chars_replaced_expected = Series(['foofoofoo', 'foofoofoo']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just called this expected
@@ -2924,7 +2932,7 @@ def test_pipe_failures(self): | |||
|
|||
tm.assert_series_equal(result, exp) | |||
|
|||
result = s.str.replace('|', ' ') | |||
result = s.str.replace('|', ' ', regex=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm Ok. I think this is correct but arguably an API breaking change so make sure we make note of that in the whatsnew
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this an api change? regex=True is the default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was just thinking of this particular instance and any where a user was passing in a single character that may have special meaning with a regex. This previously would directly replace a pipe but now requires regex=False
in user code, so it could cause some breakage.
Being extra conservative but not tied to the request then if you feel its over communicating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, but if you think this documentation is not necessary, let me know and I can change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. This is a potentially disruptive change...
Can we:
- Change the default
regex=None
for.str.replace
- Detect when a length-1 character is a regex symbol
- Warn that it'll change in the future to interpret that character as a regex, not a literal
- set
regex=False
for now to preserve the old (buggy) behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I agree - that's probably the best go-forward path, save the first point which I don't understand.
@alarangeiras can you raise a FutureWarning here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
save the first point which I don't understand.
Changing the default regex=None
? That's so we can detect if we need a warning or not.
If the user passes .str.replace('.', 'b', regex=True)
, we know to interpret the .
as re.compile('.')
, so the output would be 'bbb'
.
If the user passes .str.replace('.', 'b', regex=False)
, we know that they want a literal .
, so the output is 'abc'
.
We'll use regex=None to see if the user is explicit or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just warn when the length of the pattern is 1 and regex=True
? Whether or not the user explicitly typed that or relied on the default argument they'd hit the same bug at the end of the day. Don't see value in introducing a None
value into a True/False field currently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just warn when the length of the pattern is 1 and regex=True?
Then I think there would be no way to have
In [5]: pd.Series(['a.c']).str.replace('.', 'b')
# Warning: Interpreting '.' as a literal, not a regex... The default will change in the future.
Out[5]:
0 abc
dtype: object
# no warning
In [5]: pd.Series(['a.c']).str.replace('.', 'b', regex=True)
Out[5]:
0 bbb
dtype: object
unless I"m missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following this line of reasoning, from what I understand, every bug found should issue a warning of future adjustment?
- adding whatsnew entry and a note for API breaking change
I think now it's ok. |
doc/source/whatsnew/v0.24.0.rst
Outdated
@@ -795,6 +795,9 @@ Now, the return type is consistently a :class:`DataFrame`. | |||
and a :class:`DataFrame` with sparse values. The memory usage will | |||
be the same as in the previous version of pandas. | |||
|
|||
Be sure to perform a replace of literal strings by passing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this to the section on breaking changes and show a before / after of the behavior?
- adding before and after example
|
||
Be sure to perform a replace of literal strings by passing the | ||
regex=False parameter to func:`str.replace`. Mainly when the | ||
pattern is 1 size string (:issue:`24809`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed, this is a simple bug fix
@@ -2924,7 +2932,7 @@ def test_pipe_failures(self): | |||
|
|||
tm.assert_series_equal(result, exp) | |||
|
|||
result = s.str.replace('|', ' ') | |||
result = s.str.replace('|', ' ', regex=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this an api change? regex=True is the default
@WillAyd, is there a consensus about how document this issue? |
doc/source/whatsnew/v0.24.0.rst
Outdated
@@ -1645,6 +1669,7 @@ Strings | |||
- Bug in :meth:`Index.str.split` was not nan-safe (:issue:`23677`). | |||
- Bug :func:`Series.str.contains` not respecting the ``na`` argument for a ``Categorical`` dtype ``Series`` (:issue:`22158`) | |||
- Bug in :meth:`Index.str.cat` when the result contained only ``NaN`` (:issue:`24044`) | |||
- Bug in :func:`Series.str.replace` not applying regex in patterns of len size = 1 (:issue:`24809`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"len size = 1" -> "length 1".
@@ -2924,7 +2932,7 @@ def test_pipe_failures(self): | |||
|
|||
tm.assert_series_equal(result, exp) | |||
|
|||
result = s.str.replace('|', ' ') | |||
result = s.str.replace('|', ' ', regex=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. This is a potentially disruptive change...
Can we:
- Change the default
regex=None
for.str.replace
- Detect when a length-1 character is a regex symbol
- Warn that it'll change in the future to interpret that character as a regex, not a literal
- set
regex=False
for now to preserve the old (buggy) behavior?
Ah I see your point now - you’d essentially be doing that on top of the change here. I was assuming we would hold off on this change in lieu of the warning.
…Sent from my iPhone
On Jan 17, 2019, at 10:42 AM, Tom Augspurger ***@***.***> wrote:
@TomAugspurger commented on this pull request.
In pandas/tests/test_strings.py:
> @@ -2924,7 +2932,7 @@ def test_pipe_failures(self):
tm.assert_series_equal(result, exp)
- result = s.str.replace('|', ' ')
+ result = s.str.replace('|', ' ', regex=False)
Why not just warn when the length of the pattern is 1 and regex=True?
Then I think there would be no way to have
In [5]: pd.Series(['a.c']).str.replace('.', 'b')
# Warning: Interpreting '.' as a literal, not a regex...
Out[5]:
0 abc
dtype: object
# no warning
In [5]: pd.Series(['a.c']).str.replace('.', 'b', regex=True)
Out[5]:
0 bbb
dtype: object
unless I"m missing something.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
No. Just ones as potentially disruptive as this.
…On Thu, Jan 17, 2019 at 9:57 AM Allan Larangeiras ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/tests/test_strings.py
<#24809 (comment)>:
> @@ -2924,7 +2932,7 @@ def test_pipe_failures(self):
tm.assert_series_equal(result, exp)
- result = s.str.replace('|', ' ')
+ result = s.str.replace('|', ' ', regex=False)
Following this line of reasoning, from what I understand, every bug found
should issue a warning of future adjustment?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24809 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIslncrVV0b7yBOox-uSCtXW_OyW7ks5vEJ2AgaJpZM4aEAxN>
.
|
Sorry, I don't agree with this solution. I think it's the best if someone else make this fix. |
Just noting the potential for this confusion was brought up when we added the At the time I noted that changing this behavior would break back-compat (since the undocumented behavior that had been there since the beginning was literal replacement for 1-character strings and regex replacement for >1 character strings). I'm totally on board with changing either the documentation or the behavior to be more consistent, but it definitely needs a deprecation cycle as suggested by @TomAugspurger. The behavior of .str.replace('.', '') without regex specified to replace periods, rather than every character, has been constant since at least <=0.16. |
Thanks for the context @Liam3851, that's valuable. @alarangeiras, does that make sense? Or are you done working on this? |
git diff upstream/master -u -- "*.py" | flake8 --diff