Redefine match #5224

danielballan · 2013-10-14T21:24:20Z

How does this look? If the docstring and the tests reflect our consensus, I'll take a stab at the docs. This is the gist of it:

Default behavior is unchanged, but issues a warning.

In [1]: Series(['aa', 'bb']).str.match('(a)(a)')
pandas/core/strings.py:333: UserWarning: This usage of match will be removed in
            an upcoming version of pandas. Consider using extract instead.
  UserWarning)
Out[1]: 
0    (a, a)
1        []
dtype: object

New, more useful behavior is available through as_indexer.

In [2]: Series(['aa', 'bb']).str.match('(a)(a)', as_indexer=True)
Out[2]: 
0     True
1    False
dtype: bool

P.S. There's a stray commit in here. Not sure why 88716e4 got lumped in....

jtratner · 2013-10-14T21:31:07Z

Today I realized there's a contains method too :-/ not sure what we want to do.

danielballan · 2013-10-14T21:34:29Z

Ha! Well I'm glad this didn't take too long. Hmm.

jtratner · 2013-10-14T21:36:50Z

I just found it today - I had no idea. I guess we should just have match be a synonym with warning for extract and then deprecate? Or just let match keep doing its thing and emit warnings til we remove it?

jtratner · 2013-10-14T21:38:14Z

You might see if your implementation is better than contains...

That said, if you want to fix things like this in the future, easiest thing to do is diff your final commit against master, save the diff just on the files you care about, checkout master, then use git apply with the diff and then commit it again. That or copy/paste :)

jtratner · 2013-10-15T02:57:17Z

(I mean fix git issues.)

jreback · 2013-10-16T15:00:21Z

@danielballan this needs to be rebased to eliminate the first commit

danielballan · 2013-10-16T15:14:18Z

@jreback It's unclear what we want to do with this now. I'm in favor of @jtratner's suggestion above: make match a synonym for extract, and scrap it in 0.14.

jreback · 2013-10-16T15:15:44Z

@danielballan yes...I thought that was fine for 0.13....I haven't looked at this close...is that the intention here?

danielballan · 2013-10-16T15:18:00Z

No, this made match return a boolean indexer, which we now realize is already covered by str.contains. Cannot fix today, but probably tomorrow.

jtratner · 2013-10-17T21:40:46Z

I'm still mixed on the naming here. What if we make match use re.match whereas contains can use re.search?

I'm still not sold on adding extract as a name and deprecating match. Though it might be surprising to have match return different ndim objects depending on number of groups, there's something nice about having one function work intelligently based on number of match groups.

jreback · 2013-10-21T12:48:24Z

how's this coming along?

danielballan · 2013-10-21T14:05:13Z

We need a final decision on the naming.

I have come around to preferring @jtratner's suggestion, where str.match and str.contains return boolean indexers based on re.match and re.search respectively. That leaves str.extract intact, effectively doing what str.match used to do, but in a more useful structure and (arguably) with more descriptive name.

Examples from current PR branch "redefine-match":

In [12]: Series(['abc', '123']).str.contains('bc') # uses re.search
Out[12]: 
0     True
1    False
dtype: bool

In [13]: Series(['abc', '123']).str.match('bc', as_indexer=True) # uses re.match
Out[13]: 
0    False
1    False
dtype: bool

In [14]: Series(['abc', '123']).str.match('bc') # again, re.match
UserWarning: This usage of match will be removed in
        an upcoming version of pandas. Consider using extract instead.
Out[14]: 
0    []
1    []
dtype: object

In [15]: Series(['abc', '123']).str.extract('bc') # uses re.match
ValueError: This pattern contains no groups to capture.

In [16]: Series(['abc', '123']).str.extract('(bc)')
Out[16]: 
0   NaN
1   NaN
dtype: float64

In [17]: Series(['abc', '123']).str.extract('.*(bc)')
Out[17]: 
0     bc
1    NaN
dtype: object

If all this looks good, all we need are docs.

jtratner · 2013-10-21T15:57:46Z

Your suggested api looks good, but I'd prefer extract use re.search (if
someone really cared to get match behavior, simple to do in the regex
itself). match should not warn if you don't give it match groups, because
the previous behavior (returning empty tuples and lists) was basically
useless, so that usage is unambiguous.

I'd prefer the match warning to say 'In future versions of pandas, match
will change to always return a bool indexer'

Similarly, if as_indexer is True and the regex has match groups (or using
str.contains and it has match groups), would be nice to warn that you can
use str.extract to actually get the match groups back.

danielballan · 2013-10-21T19:49:23Z

Changes made. If we all like them, I'll write docs tomorrow. Let me know.

jreback · 2013-10-21T19:51:49Z

fyi....seems you have an odd commit in there....can you rebase off of current master?

jreback · 2013-10-22T21:59:57Z

can you rebase against master?

jreback · 2013-10-23T12:42:56Z

@jtratner ?

danielballan · 2013-10-23T13:05:00Z

Sorry, nonstop work at my day job, pushing a paper out. Will do today when I get to my work machine.

danielballan · 2013-10-23T20:18:04Z

Fixed. Thanks for the git guidance (awhile ago) @jtratner. Ready for docs? Are we happy with this?

jreback · 2013-10-23T20:29:12Z

I like this.

So to summarize we are effectively deprecating str.match in favor of extract, not touching contains, which does what?

danielballan · 2013-10-23T20:35:44Z

0.12:
match -- extracts matches in annoying, un-useful way
contains -- returns boolean indexer based on re.search

0.13:
match -- warn on annoying/useless behavior; new as_indexer option returns boolean indexer based on re.match
extract -- effectively replaces match, returns useful structure (uses on re.search)
contains -- returns boolean indexer based on re.search

0.14:
match -- returns boolean indexer based on re.match
extract -- effectively replaces match, returns useful structure (uses on re.search)
contains -- returns boolean indexer based on re.search

jreback · 2013-10-23T21:06:50Z

I would say go ahead with the docs (put that in a separate commit).... for v0.13.0 and in the string section

jreback · 2013-10-23T21:14:38Z

@danielballan I would possibly make some sub-sections under the string methods, e.g. maybe for the methods that you are actually explaining (e.g. match/extract/split.....).

jtratner · 2013-10-23T21:33:36Z

I'm very happy with this (and I appreciate your flexibility!)- can you confirm one thing for me:

match -- warn on annoying/useless behavior; new as_indexer option returns boolean indexer based on re.match

So this means when match gets something with zero match groups, it returns a boolean indexer? That way for the majority of use cases match will just work. (i.e., no need to even put examples in the docs with match groups) and then we'd warn when there were match groups but preserve existing.

Thanks for putting this together!

danielballan · 2013-10-25T16:38:37Z

I hadn't thought of that. Currently, match only returns a boolean indexer is you set as_indexer=True. What you are suggesting is better. I will make the change.

Sorry for the delay on my end. Not losing patience, just short on free time this week.

jreback · 2013-10-27T22:01:10Z

@danielballan how's this coming?

danielballan · 2013-10-28T20:58:33Z

Starting the docs, I came across this:

Methods like contains, startswith, and endswith takes an extra
na arguement so missing values can be considered True or False.

The new version of match should do that as well. The only different between contains and the new match should be re.search vs. re.match.

jtratner · 2013-10-28T21:01:56Z

that's fine, fine with me if you incorporate that.

jtratner · 2013-10-28T21:20:18Z

Just ping me when you think this is all ready to go. Also, if you want to
rebase on master while you're working on this, would be helpful.

jtratner · 2013-10-30T21:07:32Z

I'd really like this to be in 0.13 rc, because it will wrap up str methods
nicely - would you be able to get this out today? No pressure tho

…t behavior.

danielballan · 2013-10-31T00:36:24Z

Ready.

Now the APIs and underlying codes for match and contains are happily symmetric, aside from the warning apparatus and the as_indexer stuff that we will remove in 0.14. Docs should be good to go. I haven't complied them (I'm not set up for that yet) but I didn't touch any of the ipython code in these edits.

jtratner · 2013-10-31T00:39:00Z

As long as it passes the tests, I'm fine with this.

I probably would want to edit the docs somewhat to just show what you can do in 0.13 and not bother mentioning all the weird deprecated things you can do (esp since I think you can always go back and look at docs for older versions), but I'm fine with doing it later and getting this functionality merged for now :)

danielballan · 2013-10-31T00:42:00Z

I waffled on how much to explain, and probably erred saying too much. I'll heading back over to see if I can fix filter in time for the RC. We can come back to this.

jtratner · 2013-10-31T00:43:42Z

Yeah, exactly, docs are easy to pare down later - just commenting that we should do it at some point. Frankly, the entire docs section on string matching could be tightened up a bit.

jtratner · 2013-10-31T00:51:42Z

pandas/tests/test_strings.py

@@ -411,10 +419,52 @@ def test_match(self):
        # unicode
        values = Series([u('fooBAD__barBAD'), NA, u('foo')])

-        result = values.str.match('.*(BAD[_]+).*(BAD)')
+        with warnings.catch_warnings(record=True) as w:


future reference - you can use assert_produces_warning() here as well, so you could get down to:

with tm.assert_produces_warning(): result = values.str.match('.*(BAD[_]+).*(BAD)')

jtratner · 2013-10-31T02:15:38Z

Thanks!

hayd · 2014-01-21T23:09:00Z

One workaround/trick is to do (where regex is foo|bar):

s.str.contains('^(foo|bar)$')

… match (GH5224) This PR changes the default behaviour of `str.match` from extracting groups to just a match (True/False). The previous default behaviour was deprecated since 0.13.0 (#5224) Author: Joris Van den Bossche <jorisvandenbossche@gmail.com> Closes #15257 from jorisvandenbossche/str-match and squashes the following commits: 0ab36b6 [Joris Van den Bossche] Raise FutureWarning instead of UserWarning for as_indexer a2bae51 [Joris Van den Bossche] raise error in case of regex with groups and as_indexer=False 87446c3 [Joris Van den Bossche] fix test 0788de2 [Joris Van den Bossche] API: change default behaviour of str.match from deprecated extract to match (GH5224)

… match (GH5224) This PR changes the default behaviour of `str.match` from extracting groups to just a match (True/False). The previous default behaviour was deprecated since 0.13.0 (pandas-dev#5224) Author: Joris Van den Bossche <jorisvandenbossche@gmail.com> Closes pandas-dev#15257 from jorisvandenbossche/str-match and squashes the following commits: 0ab36b6 [Joris Van den Bossche] Raise FutureWarning instead of UserWarning for as_indexer a2bae51 [Joris Van den Bossche] raise error in case of regex with groups and as_indexer=False 87446c3 [Joris Van den Bossche] fix test 0788de2 [Joris Van den Bossche] API: change default behaviour of str.match from deprecated extract to match (GH5224)

danielballan mentioned this pull request Oct 14, 2013

API: Series.str.match == extract? #5075

Closed

danielballan added 2 commits October 30, 2013 20:07

ENH/CLN: Redefine str.match, and issue a warning on deprecated defaul…

75dd0f2

…t behavior.

DOC: Expanded section on string methods in wake of extract/match change.

3b832d0

jtratner reviewed Oct 31, 2013
View reviewed changes

jtratner added a commit to jtratner/pandas that referenced this pull request Oct 31, 2013

Merge pull request pandas-dev#5224 from daniel-ballan/redefine-match

3ebd769

jtratner merged commit 3b832d0 into pandas-dev:master Oct 31, 2013

danielballan deleted the redefine-match branch October 31, 2013 03:54

jorisvandenbossche mentioned this pull request Jan 29, 2017

API: change default behaviour of str.match from deprecated extract to match (GH5224) #15257

Closed

Redefine match #5224

Redefine match #5224

Conversation

danielballan commented Oct 14, 2013

jtratner commented Oct 14, 2013

danielballan commented Oct 14, 2013

jtratner commented Oct 14, 2013

jtratner commented Oct 14, 2013

jtratner commented Oct 15, 2013

jreback commented Oct 16, 2013

danielballan commented Oct 16, 2013

jreback commented Oct 16, 2013

danielballan commented Oct 16, 2013

jtratner commented Oct 17, 2013

jreback commented Oct 21, 2013

danielballan commented Oct 21, 2013

jtratner commented Oct 21, 2013

danielballan commented Oct 21, 2013

jreback commented Oct 21, 2013

jreback commented Oct 22, 2013

jreback commented Oct 23, 2013

danielballan commented Oct 23, 2013

danielballan commented Oct 23, 2013

jreback commented Oct 23, 2013

danielballan commented Oct 23, 2013

jreback commented Oct 23, 2013

jreback commented Oct 23, 2013

jtratner commented Oct 23, 2013

danielballan commented Oct 25, 2013

jreback commented Oct 27, 2013

danielballan commented Oct 28, 2013

jtratner commented Oct 28, 2013

jtratner commented Oct 28, 2013

jtratner commented Oct 30, 2013

danielballan commented Oct 31, 2013

jtratner commented Oct 31, 2013

danielballan commented Oct 31, 2013

jtratner commented Oct 31, 2013

jtratner Oct 31, 2013

Choose a reason for hiding this comment

jtratner commented Oct 31, 2013

hayd commented Jan 21, 2014