API: Series.str.match == extract? #5075

jreback · 2013-10-01T23:22:03Z

This is the whatsnew example. I think for compat, the 3rd entry should be (np.nan, 3) ?

In [1]: Series(['a1', 'b2', '3']).str.match('(?P<letter>[ab])?(?P<digit>\d)')
Out[1]: 
0       (a, 1)
1       (b, 2)
2    (None, 3)
dtype: object

The text was updated successfully, but these errors were encountered:

danielballan · 2013-10-03T17:20:26Z

Actually, that should say str.extract. (My bad!) See #5099.

Extract does NaN, not None, so that entry will become NaN with this edit. IMO we should leave match alone and let it follow the conventions of re. The point of extract is to give a more pandorable* alternative to match.

*Am I doing it right, @hayd?

jreback · 2013-10-03T17:24:41Z

is there a reason to not make match == extract? (maybe deprecate in 0.13 and change in 0.14)?

jreback · 2013-10-03T17:25:34Z

I merged #5099

danielballan · 2013-10-03T18:55:57Z

That makes sense to me.

jreback · 2013-10-03T19:08:24Z

ok...why don't you do a PR to put that deprecation in

(also I think you changed the docs in basics.rst, not just in v0.13.0 (#5099)

jreback · 2013-10-03T19:22:50Z

@jtratner brings up a good point.....

is there a different in match and extract?

will it break compat if we just call extract match?

jreback · 2013-10-03T19:24:16Z

@jtratner questions

but if you can use the output of extract with match, then I don't see the point of breaking backwards compatibility by scuttling the old name
only question I have is this:

could you previously do something like:

ser[ser.str.match(some_regex)]

and will that stop working now?
(and if it used to output tuples, then it would never have been useful)
maybe match should accept strings without match groups and use a different subroutine?
ie.., without match groups returns bools

@hayd @danielballan can you comment on this?

hayd · 2013-10-03T22:08:14Z

Although match kinda == extract, I much prefer the name extract (as that describes more precisely what it's going on... pandorable), and I think it's ok to be consistent with re but match is vague... I agree current output IS kinda not that useful.

The bool thing for groupless is an interesting idea (and would distinguish extract and match), and seems "expected" (that's what "match" means in English!).

Also if the output NaN instead of [] for missing that would be better.

In [1]: s = pd.Series(list('abcde'))

In [2]: s.str.match('(a|b)')
Out[2]: 
0    (a,)
1    (b,)
2      []
3      []
4      []
dtype: object

atm using ser[ser.str.match(some_regex)] throws if you don't match for that reason. If output NaN rather than [] for no match, then you could use s.str.match('a|b').notnull() to filter... though actually you ought to be using filter to filter!!

In [3]: s.filter(regex='a|b')  # erm this gives me TypeError: buffer size mismatch??

jtratner · 2013-10-03T22:12:44Z

okay. I was just hoping we could not add a method then deprecate an old one if we could just put it together under one method.

hayd · 2013-10-03T22:24:06Z

At the moment what is the best way to match an entire regex? I guess you can use...

In [22]: s.str.match('(a|b)$')

I can't think of another reason to use match, so if there's another way to do the above (and I expect there is) then I think am happy to depreciate match... (but -1 on renaming extract to match!).

jreback · 2013-10-03T22:26:29Z

so why don't you

extract(regex,flags,as_list=False)

and as_list=True will do what match does not

then match is just a wrapper around extract (or can just remove it and point to extract)?

jtratner · 2013-10-03T22:29:42Z

@hayd I think it's worth it to have a way to convert a Series of strings into a boolean indexer (which you might use for filter, but you could also use for, e.g., making an indexer to use with something else).

@jreback I'd like to add extract, and turn match into something that converts str --> bool (and I guess leaves nan?), because I think that's much clearer. No reason to overload extract. (so basically sugar for ser.apply(lambda x: re.match(x)) with better nan handling and maybe avoiding the apply machinery if there's a faster path.

jreback · 2013-10-11T11:52:18Z

@danielballan @hayd @jtratner

what was decided here?

jtratner · 2013-10-11T12:45:23Z

Let's add extract and then I'll put in a PR to change match to turn into a
method that returns a bool Series suitable for indexing. We can put that
new behavior under a keyword argument and deprecate the original behavior
for 0.14.

jreback · 2013-10-11T12:49:06Z

ok...you could deprecate match in 0.13 (and just direct to using the kw with extract).....

jtratner · 2013-10-11T12:54:51Z

Yeah, that's what I'm thinking, maybe something like bool_only = False and
then go from there.
On Oct 11, 2013 8:49 AM, "jreback" notifications@github.com wrote:

ok...you could deprecate match in 0.13 (and just direct to using the kw
with extract).....

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/5075#issuecomment-26134330
.

jreback · 2013-10-11T13:01:19Z

as_indexer=False?

danielballan · 2013-10-14T21:27:05Z

I took a stab at it. See #5224.

danielballan mentioned this issue Oct 15, 2013

Redefine match #5224

Merged

jtratner closed this as completed in #5224 Oct 31, 2013

teese mentioned this issue Nov 8, 2014

get start and end of regex match in dataframe #8747

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Series.str.match == extract? #5075

API: Series.str.match == extract? #5075

jreback commented Oct 1, 2013

danielballan commented Oct 3, 2013

jreback commented Oct 3, 2013

jreback commented Oct 3, 2013

danielballan commented Oct 3, 2013

jreback commented Oct 3, 2013

jreback commented Oct 3, 2013

jreback commented Oct 3, 2013

hayd commented Oct 3, 2013

jtratner commented Oct 3, 2013

hayd commented Oct 3, 2013

jreback commented Oct 3, 2013

jtratner commented Oct 3, 2013

jreback commented Oct 11, 2013

jtratner commented Oct 11, 2013

jreback commented Oct 11, 2013

jtratner commented Oct 11, 2013

jreback commented Oct 11, 2013

danielballan commented Oct 14, 2013

API: Series.str.match == extract? #5075

API: Series.str.match == extract? #5075

Comments

jreback commented Oct 1, 2013

danielballan commented Oct 3, 2013

jreback commented Oct 3, 2013

jreback commented Oct 3, 2013

danielballan commented Oct 3, 2013

jreback commented Oct 3, 2013

jreback commented Oct 3, 2013

jreback commented Oct 3, 2013

hayd commented Oct 3, 2013

jtratner commented Oct 3, 2013

hayd commented Oct 3, 2013

jreback commented Oct 3, 2013

jtratner commented Oct 3, 2013

jreback commented Oct 11, 2013

jtratner commented Oct 11, 2013

jreback commented Oct 11, 2013

jtratner commented Oct 11, 2013

jreback commented Oct 11, 2013

danielballan commented Oct 14, 2013