Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different handling of \w in unicode patterns in regex and re #3

Closed
mrabarnett opened this issue Feb 6, 2011 · 3 comments
Closed

different handling of \w in unicode patterns in regex and re #3

mrabarnett opened this issue Feb 6, 2011 · 3 comments
Labels
bug Something isn't working minor

Comments

@mrabarnett
Copy link
Owner

Original report by Anonymous.


Hi,

I think, it may be an intended behaviour, but I did't find it mentioned anywhere in the docs. Sorry, if it is already discussed somewhere I haven't looked ...

It seems, that in the unicode patterns like ur"..." regex implicitely sets the unicode flag (?u), while re doesn't seem to do that.

>>> re.findall(ur"\w", u"aáb")
[u'a', u'b']
>>> regex.findall(ur"\w", u"aáb")
[u'a', u'\xe1', u'b']
>>> re.findall(r"\w", u"aáb")
[u'a', u'b']
>>> regex.findall(r"\w", u"aáb")
[u'a', u'b']
>>> re.findall(ur"(?u)\w", u"aáb")
[u'a', u'\xe1', u'b']
>>> regex.findall(ur"(?u)\w", u"aáb")
[u'a', u'\xe1', u'b']
>>> 

Python 2.7.1, win XPp SP3, 32 bit Czech; regex r902c02d44f

regards,
Vlastimil Brom

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Ah, yes, if the pattern is a Unicode string then the matching defaults to Unicode, and if the pattern is a bytestring then the matching defaults to ASCII.

You can be explicit with regex.UNICODE or "(?u)" and regex.ASCII or "(?a)".

The justification is that if you're using Unicode strings then you probably want Unicode matching too. I'll make a note to update the docs at some point (I don't have any other changes planned).

I would be willing to make it the same as the 're' module if the general consensus is that it should be.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Thanks for confirmation; I was just a bit surprised seeing different results in a script (using re) and my general app (using regex normally), where I didn't expect a difference between these re engines.

I am happy with either behaviour; the (?u) can be simply added if needed and is more explicit; on the other hand the unicode flag is global and cannot be switched off - if one needed an unicode string pattern with special sequences to be interpreted in ascii, [a-zA-Z0-9_] would be necessary instead of \w (if I understand correctly).

But that being said, I have no strong personal preference, now that it is documented. It would depend on the inclusion policy into the standard library (e.g. whether to include this behaviour to the NEW flag).

vbr

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


Documented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working minor
Projects
None yet
Development

No branches or pull requests

1 participant