different handling of \w in unicode patterns in regex and re #3

mrabarnett · 2011-02-06T17:13:48Z

Hi,

I think, it may be an intended behaviour, but I did't find it mentioned anywhere in the docs. Sorry, if it is already discussed somewhere I haven't looked ...

It seems, that in the unicode patterns like ur"..." regex implicitely sets the unicode flag (?u), while re doesn't seem to do that.

>>> re.findall(ur"\w", u"aáb")
[u'a', u'b']
>>> regex.findall(ur"\w", u"aáb")
[u'a', u'\xe1', u'b']
>>> re.findall(r"\w", u"aáb")
[u'a', u'b']
>>> regex.findall(r"\w", u"aáb")
[u'a', u'b']
>>> re.findall(ur"(?u)\w", u"aáb")
[u'a', u'\xe1', u'b']
>>> regex.findall(ur"(?u)\w", u"aáb")
[u'a', u'\xe1', u'b']
>>>

Python 2.7.1, win XPp SP3, 32 bit Czech; regex r902c02d44f

regards,
Vlastimil Brom

The text was updated successfully, but these errors were encountered:

mrabarnett · 2011-02-06T18:55:59Z

Original comment by Anonymous.

Ah, yes, if the pattern is a Unicode string then the matching defaults to Unicode, and if the pattern is a bytestring then the matching defaults to ASCII.

You can be explicit with regex.UNICODE or "(?u)" and regex.ASCII or "(?a)".

The justification is that if you're using Unicode strings then you probably want Unicode matching too. I'll make a note to update the docs at some point (I don't have any other changes planned).

I would be willing to make it the same as the 're' module if the general consensus is that it should be.

mrabarnett · 2011-02-07T05:45:06Z

Original comment by Anonymous.

Thanks for confirmation; I was just a bit surprised seeing different results in a script (using re) and my general app (using regex normally), where I didn't expect a difference between these re engines.

I am happy with either behaviour; the (?u) can be simply added if needed and is more explicit; on the other hand the unicode flag is global and cannot be switched off - if one needed an unicode string pattern with special sequences to be interpreted in ascii, [a-zA-Z0-9_] would be necessary instead of \w (if I understand correctly).

But that being said, I have no strong personal preference, now that it is documented. It would depend on the inclusion policy into the standard library (e.g. whether to include this behaviour to the NEW flag).

vbr

mrabarnett · 2016-06-15T00:25:00Z

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).

Documented.

mrabarnett closed this as completed Jun 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

different handling of \w in unicode patterns in regex and re #3

different handling of \w in unicode patterns in regex and re #3

mrabarnett commented Feb 6, 2011

mrabarnett commented Feb 6, 2011

mrabarnett commented Feb 7, 2011

mrabarnett commented Jun 15, 2016

different handling of \w in unicode patterns in regex and re #3

different handling of \w in unicode patterns in regex and re #3

Comments

mrabarnett commented Feb 6, 2011

mrabarnett commented Feb 6, 2011

mrabarnett commented Feb 7, 2011

mrabarnett commented Jun 15, 2016