Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redefine match #5224

Merged
merged 2 commits into from
Oct 31, 2013
Merged

Redefine match #5224

merged 2 commits into from
Oct 31, 2013

Conversation

danielballan
Copy link
Contributor

closes #5075

How does this look? If the docstring and the tests reflect our consensus, I'll take a stab at the docs. This is the gist of it:

Default behavior is unchanged, but issues a warning.

In [1]: Series(['aa', 'bb']).str.match('(a)(a)')
pandas/core/strings.py:333: UserWarning: This usage of match will be removed in
            an upcoming version of pandas. Consider using extract instead.
  UserWarning)
Out[1]: 
0    (a, a)
1        []
dtype: object

New, more useful behavior is available through as_indexer.

In [2]: Series(['aa', 'bb']).str.match('(a)(a)', as_indexer=True)
Out[2]: 
0     True
1    False
dtype: bool

P.S. There's a stray commit in here. Not sure why 88716e4 got lumped in....

@jtratner
Copy link
Contributor

Today I realized there's a contains method too :-/ not sure what we want to do.

@danielballan
Copy link
Contributor Author

Ha! Well I'm glad this didn't take too long. Hmm.

@jtratner
Copy link
Contributor

I just found it today - I had no idea. I guess we should just have match be a synonym with warning for extract and then deprecate? Or just let match keep doing its thing and emit warnings til we remove it?

@jtratner
Copy link
Contributor

You might see if your implementation is better than contains...

That said, if you want to fix things like this in the future, easiest thing to do is diff your final commit against master, save the diff just on the files you care about, checkout master, then use git apply with the diff and then commit it again. That or copy/paste :)

@jtratner
Copy link
Contributor

(I mean fix git issues.)

@jreback
Copy link
Contributor

jreback commented Oct 16, 2013

@danielballan this needs to be rebased to eliminate the first commit

@danielballan
Copy link
Contributor Author

@jreback It's unclear what we want to do with this now. I'm in favor of @jtratner's suggestion above: make match a synonym for extract, and scrap it in 0.14.

@jreback
Copy link
Contributor

jreback commented Oct 16, 2013

@danielballan yes...I thought that was fine for 0.13....I haven't looked at this close...is that the intention here?

@danielballan
Copy link
Contributor Author

No, this made match return a boolean indexer, which we now realize is already covered by str.contains. Cannot fix today, but probably tomorrow.

@jtratner
Copy link
Contributor

I'm still mixed on the naming here. What if we make match use re.match whereas contains can use re.search?

I'm still not sold on adding extract as a name and deprecating match. Though it might be surprising to have match return different ndim objects depending on number of groups, there's something nice about having one function work intelligently based on number of match groups.

@jreback
Copy link
Contributor

jreback commented Oct 21, 2013

how's this coming along?

@danielballan
Copy link
Contributor Author

We need a final decision on the naming.

I have come around to preferring @jtratner's suggestion, where str.match and str.contains return boolean indexers based on re.match and re.search respectively. That leaves str.extract intact, effectively doing what str.match used to do, but in a more useful structure and (arguably) with more descriptive name.

Examples from current PR branch "redefine-match":

In [12]: Series(['abc', '123']).str.contains('bc') # uses re.search
Out[12]: 
0     True
1    False
dtype: bool

In [13]: Series(['abc', '123']).str.match('bc', as_indexer=True) # uses re.match
Out[13]: 
0    False
1    False
dtype: bool

In [14]: Series(['abc', '123']).str.match('bc') # again, re.match
UserWarning: This usage of match will be removed in
        an upcoming version of pandas. Consider using extract instead.
Out[14]: 
0    []
1    []
dtype: object

In [15]: Series(['abc', '123']).str.extract('bc') # uses re.match
ValueError: This pattern contains no groups to capture.

In [16]: Series(['abc', '123']).str.extract('(bc)')
Out[16]: 
0   NaN
1   NaN
dtype: float64

In [17]: Series(['abc', '123']).str.extract('.*(bc)')
Out[17]: 
0     bc
1    NaN
dtype: object

If all this looks good, all we need are docs.

@jtratner
Copy link
Contributor

Your suggested api looks good, but I'd prefer extract use re.search (if
someone really cared to get match behavior, simple to do in the regex
itself). match should not warn if you don't give it match groups, because
the previous behavior (returning empty tuples and lists) was basically
useless, so that usage is unambiguous.

I'd prefer the match warning to say 'In future versions of pandas, match
will change to always return a bool indexer'

Similarly, if as_indexer is True and the regex has match groups (or using
str.contains and it has match groups), would be nice to warn that you can
use str.extract to actually get the match groups back.

@danielballan
Copy link
Contributor Author

Changes made. If we all like them, I'll write docs tomorrow. Let me know.

@jreback
Copy link
Contributor

jreback commented Oct 21, 2013

fyi....seems you have an odd commit in there....can you rebase off of current master?

@jreback
Copy link
Contributor

jreback commented Oct 22, 2013

can you rebase against master?

@jreback
Copy link
Contributor

jreback commented Oct 23, 2013

@jtratner ?

@danielballan
Copy link
Contributor Author

Sorry, nonstop work at my day job, pushing a paper out. Will do today when I get to my work machine.

@danielballan
Copy link
Contributor Author

Fixed. Thanks for the git guidance (awhile ago) @jtratner. Ready for docs? Are we happy with this?

@jreback
Copy link
Contributor

jreback commented Oct 23, 2013

I like this.

So to summarize we are effectively deprecating str.match in favor of extract, not touching contains, which does what?

@danielballan
Copy link
Contributor Author

0.12:
match -- extracts matches in annoying, un-useful way
contains -- returns boolean indexer based on re.search

0.13:
match -- warn on annoying/useless behavior; new as_indexer option returns boolean indexer based on re.match
extract -- effectively replaces match, returns useful structure (uses on re.search)
contains -- returns boolean indexer based on re.search

0.14:
match -- returns boolean indexer based on re.match
extract -- effectively replaces match, returns useful structure (uses on re.search)
contains -- returns boolean indexer based on re.search

@jreback
Copy link
Contributor

jreback commented Oct 23, 2013

I would say go ahead with the docs (put that in a separate commit).... for v0.13.0 and in the string section

@jreback
Copy link
Contributor

jreback commented Oct 23, 2013

@danielballan I would possibly make some sub-sections under the string methods, e.g. maybe for the methods that you are actually explaining (e.g. match/extract/split.....).

@jtratner
Copy link
Contributor

I'm very happy with this (and I appreciate your flexibility!)- can you confirm one thing for me:

match -- warn on annoying/useless behavior; new as_indexer option returns boolean indexer based on re.match

So this means when match gets something with zero match groups, it returns a boolean indexer? That way for the majority of use cases match will just work. (i.e., no need to even put examples in the docs with match groups) and then we'd warn when there were match groups but preserve existing.

Thanks for putting this together!

@danielballan
Copy link
Contributor Author

I hadn't thought of that. Currently, match only returns a boolean indexer is you set as_indexer=True. What you are suggesting is better. I will make the change.

Sorry for the delay on my end. Not losing patience, just short on free time this week.

@jreback
Copy link
Contributor

jreback commented Oct 27, 2013

@danielballan how's this coming?

@danielballan
Copy link
Contributor Author

Starting the docs, I came across this:

Methods like contains, startswith, and endswith takes an extra
na arguement so missing values can be considered True or False.

The new version of match should do that as well. The only different between contains and the new match should be re.search vs. re.match.

@jtratner
Copy link
Contributor

that's fine, fine with me if you incorporate that.

@jtratner
Copy link
Contributor

Just ping me when you think this is all ready to go. Also, if you want to
rebase on master while you're working on this, would be helpful.

@jtratner
Copy link
Contributor

I'd really like this to be in 0.13 rc, because it will wrap up str methods
nicely - would you be able to get this out today? No pressure tho

@danielballan
Copy link
Contributor Author

Ready.

Now the APIs and underlying codes for match and contains are happily symmetric, aside from the warning apparatus and the as_indexer stuff that we will remove in 0.14. Docs should be good to go. I haven't complied them (I'm not set up for that yet) but I didn't touch any of the ipython code in these edits.

@jtratner
Copy link
Contributor

As long as it passes the tests, I'm fine with this.

I probably would want to edit the docs somewhat to just show what you can do in 0.13 and not bother mentioning all the weird deprecated things you can do (esp since I think you can always go back and look at docs for older versions), but I'm fine with doing it later and getting this functionality merged for now :)

@danielballan
Copy link
Contributor Author

I waffled on how much to explain, and probably erred saying too much. I'll heading back over to see if I can fix filter in time for the RC. We can come back to this.

@jtratner
Copy link
Contributor

Yeah, exactly, docs are easy to pare down later - just commenting that we should do it at some point. Frankly, the entire docs section on string matching could be tightened up a bit.

@@ -411,10 +419,52 @@ def test_match(self):
# unicode
values = Series([u('fooBAD__barBAD'), NA, u('foo')])

result = values.str.match('.*(BAD[_]+).*(BAD)')
with warnings.catch_warnings(record=True) as w:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future reference - you can use assert_produces_warning() here as well, so you could get down to:

with tm.assert_produces_warning():
    result = values.str.match('.*(BAD[_]+).*(BAD)')

jtratner added a commit to jtratner/pandas that referenced this pull request Oct 31, 2013
@jtratner jtratner merged commit 3b832d0 into pandas-dev:master Oct 31, 2013
@jtratner
Copy link
Contributor

Thanks!

@danielballan danielballan deleted the redefine-match branch October 31, 2013 03:54
@hayd
Copy link
Contributor

hayd commented Jan 21, 2014

One workaround/trick is to do (where regex is foo|bar):

s.str.contains('^(foo|bar)$')

jreback pushed a commit that referenced this pull request Mar 22, 2017
… match (GH5224)

This PR changes the default behaviour of
`str.match` from extracting groups to just a match (True/False). The
previous default behaviour was deprecated since 0.13.0
(#5224)

Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Closes #15257 from jorisvandenbossche/str-match and squashes the following commits:

0ab36b6 [Joris Van den Bossche] Raise FutureWarning instead of UserWarning for as_indexer
a2bae51 [Joris Van den Bossche] raise error in case of regex with groups and as_indexer=False
87446c3 [Joris Van den Bossche] fix test
0788de2 [Joris Van den Bossche] API: change default behaviour of str.match from deprecated extract to match (GH5224)
mattip pushed a commit to mattip/pandas that referenced this pull request Apr 3, 2017
… match (GH5224)

This PR changes the default behaviour of
`str.match` from extracting groups to just a match (True/False). The
previous default behaviour was deprecated since 0.13.0
(pandas-dev#5224)

Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Closes pandas-dev#15257 from jorisvandenbossche/str-match and squashes the following commits:

0ab36b6 [Joris Van den Bossche] Raise FutureWarning instead of UserWarning for as_indexer
a2bae51 [Joris Van den Bossche] raise error in case of regex with groups and as_indexer=False
87446c3 [Joris Van den Bossche] fix test
0788de2 [Joris Van den Bossche] API: change default behaviour of str.match from deprecated extract to match (GH5224)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: Series.str.match == extract?
4 participants