Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Series.str.match == extract? #5075

Closed
jreback opened this issue Oct 1, 2013 · 18 comments · Fixed by #5224
Closed

API: Series.str.match == extract? #5075

jreback opened this issue Oct 1, 2013 · 18 comments · Fixed by #5224
Labels
API Design Strings String extension data type and string data
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Oct 1, 2013

This is the whatsnew example. I think for compat, the 3rd entry should be (np.nan, 3) ?

In [1]: Series(['a1', 'b2', '3']).str.match('(?P<letter>[ab])?(?P<digit>\d)')
Out[1]: 
0       (a, 1)
1       (b, 2)
2    (None, 3)
dtype: object
@danielballan
Copy link
Contributor

Actually, that should say str.extract. (My bad!) See #5099.

Extract does NaN, not None, so that entry will become NaN with this edit. IMO we should leave match alone and let it follow the conventions of re. The point of extract is to give a more pandorable* alternative to match.

*Am I doing it right, @hayd?

@jreback
Copy link
Contributor Author

jreback commented Oct 3, 2013

is there a reason to not make match == extract? (maybe deprecate in 0.13 and change in 0.14)?

@jreback
Copy link
Contributor Author

jreback commented Oct 3, 2013

I merged #5099

@danielballan
Copy link
Contributor

That makes sense to me.

@jreback
Copy link
Contributor Author

jreback commented Oct 3, 2013

ok...why don't you do a PR to put that deprecation in

(also I think you changed the docs in basics.rst, not just in v0.13.0 (#5099)

@jreback
Copy link
Contributor Author

jreback commented Oct 3, 2013

@jtratner brings up a good point.....

is there a different in match and extract?

will it break compat if we just call extract match?

@jreback
Copy link
Contributor Author

jreback commented Oct 3, 2013

@jtratner questions

but if you can use the output of extract with match, then I don't see the point of breaking backwards compatibility by scuttling the old name
only question I have is this:

could you previously do something like:

ser[ser.str.match(some_regex)]

and will that stop working now?
(and if it used to output tuples, then it would never have been useful)
maybe match should accept strings without match groups and use a different subroutine?
ie.., without match groups returns bools

@hayd @danielballan can you comment on this?

@hayd
Copy link
Contributor

hayd commented Oct 3, 2013

Although match kinda == extract, I much prefer the name extract (as that describes more precisely what it's going on... pandorable), and I think it's ok to be consistent with re but match is vague... I agree current output IS kinda not that useful.

The bool thing for groupless is an interesting idea (and would distinguish extract and match), and seems "expected" (that's what "match" means in English!).

Also if the output NaN instead of [] for missing that would be better.

In [1]: s = pd.Series(list('abcde'))

In [2]: s.str.match('(a|b)')
Out[2]: 
0    (a,)
1    (b,)
2      []
3      []
4      []
dtype: object

atm using ser[ser.str.match(some_regex)] throws if you don't match for that reason. If output NaN rather than [] for no match, then you could use s.str.match('a|b').notnull() to filter... though actually you ought to be using filter to filter!!

In [3]: s.filter(regex='a|b')  # erm this gives me TypeError: buffer size mismatch??

@jtratner
Copy link
Contributor

jtratner commented Oct 3, 2013

okay. I was just hoping we could not add a method then deprecate an old one if we could just put it together under one method.

@hayd
Copy link
Contributor

hayd commented Oct 3, 2013

At the moment what is the best way to match an entire regex? I guess you can use...

In [22]: s.str.match('(a|b)$')

I can't think of another reason to use match, so if there's another way to do the above (and I expect there is) then I think am happy to depreciate match... (but -1 on renaming extract to match!).

@jreback
Copy link
Contributor Author

jreback commented Oct 3, 2013

so why don't you

extract(regex,flags,as_list=False)

and as_list=True will do what match does not

then match is just a wrapper around extract (or can just remove it and point to extract)?

@jtratner
Copy link
Contributor

jtratner commented Oct 3, 2013

@hayd I think it's worth it to have a way to convert a Series of strings into a boolean indexer (which you might use for filter, but you could also use for, e.g., making an indexer to use with something else).

@jreback I'd like to add extract, and turn match into something that converts str --> bool (and I guess leaves nan?), because I think that's much clearer. No reason to overload extract. (so basically sugar for ser.apply(lambda x: re.match(x)) with better nan handling and maybe avoiding the apply machinery if there's a faster path.

@jreback
Copy link
Contributor Author

jreback commented Oct 11, 2013

@danielballan @hayd @jtratner

what was decided here?

@jtratner
Copy link
Contributor

Let's add extract and then I'll put in a PR to change match to turn into a
method that returns a bool Series suitable for indexing. We can put that
new behavior under a keyword argument and deprecate the original behavior
for 0.14.

@jreback
Copy link
Contributor Author

jreback commented Oct 11, 2013

ok...you could deprecate match in 0.13 (and just direct to using the kw with extract).....

@jtratner
Copy link
Contributor

Yeah, that's what I'm thinking, maybe something like bool_only = False and
then go from there.
On Oct 11, 2013 8:49 AM, "jreback" notifications@github.com wrote:

ok...you could deprecate match in 0.13 (and just direct to using the kw
with extract).....


Reply to this email directly or view it on GitHubhttps://github.com//issues/5075#issuecomment-26134330
.

@jreback
Copy link
Contributor Author

jreback commented Oct 11, 2013

as_indexer=False?

@danielballan
Copy link
Contributor

I took a stab at it. See #5224.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants