Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re.IGNORECASE does not match literal "_" (underscore) #56156

Closed
RobM mannequin opened this issue Apr 28, 2011 · 6 comments
Closed

re.IGNORECASE does not match literal "_" (underscore) #56156

RobM mannequin opened this issue Apr 28, 2011 · 6 comments
Labels
topic-regex type-bug An unexpected behavior, bug, or error

Comments

@RobM
Copy link
Mannequin

RobM mannequin commented Apr 28, 2011

BPO 11947
Nosy @pitrou, @ezio-melotti
Superseder
  • bpo-11957: re.sub confusion between count and flags args
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-04-28.20:49:19.294>
    created_at = <Date 2011-04-28.17:02:55.958>
    labels = ['expert-regex', 'type-bug']
    title = 're.IGNORECASE does not match literal "_" (underscore)'
    updated_at = <Date 2014-10-29.16:15:11.230>
    user = 'https://bugs.python.org/RobM'

    bugs.python.org fields:

    activity = <Date 2014-10-29.16:15:11.230>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2011-04-28.20:49:19.294>
    closer = 'ezio.melotti'
    components = ['Regular Expressions']
    creation = <Date 2011-04-28.17:02:55.958>
    creator = 'RobM'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 11947
    keywords = []
    message_count = 6.0
    messages = ['134700', '134716', '134717', '134723', '134752', '134831']
    nosy_count = 5.0
    nosy_names = ['effbot', 'pitrou', 'ezio.melotti', 'mrabarnett', 'RobM']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '11957'
    type = 'behavior'
    url = 'https://bugs.python.org/issue11947'
    versions = ['Python 2.6']

    @RobM
    Copy link
    Mannequin Author

    RobM mannequin commented Apr 28, 2011

    Regular expressions which are written match literal underscores ("_", ASCII
    ordinal 95) and specify re.IGNORECASE during compilation do not consistently
    match underscores: it seems some occurrences are matched, but others are not.

    The following session log shows the problem:

        Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) 
        [GCC 4.4.3] on linux2
        Type "help", "copyright", "credits" or "license" for more information.
        >>> import re
        >>> subject = "[Conclave-Mendoi]_ef_-_a_tale_of_memories_00-12_H264"
        >>> print subject.encode("base64")  # Incase my environment encoding is to blame
        W0NvbmNsYXZlLU1lbmRvaV1fZWZfLV9hX3RhbGVfb2ZfbWVtb3JpZXNfMDAtMTJfSDI2NA==
    
        >>> re.sub("_", "X", subject)  # No flags, does what I expect
        '[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264'
        >>> 
        >>> re.sub("_", "X", subject, re.IGNORECASE)  # Misses some matches
        '[Conclave-Mendoi]XefX-_a_tale_of_memories_00-12_H264'
        >>> 
        >>> re.sub("_", "X", subject, re.IGNORECASE | re.LOCALE)  # Misses fewer matches
        '[Conclave-Mendoi]XefX-XaXtaleXofXmemories_00-12_H264'
        >>> 
        >>> re.sub("_", "X", subject, re.IGNORECASE | re.LOCALE | re.UNICODE)  # Works OK
        '[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264'
        >>> 
        >>> re.sub("_", "X", subject, re.IGNORECASE | re.UNICODE) # Works OK
        '[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264'
        >>> 
        >>> type(subject)  # Don't think this is a unicode string
        <type 'str'>
        >>> 

    Since my subject variable is of type str and only contains ASCII characters
    I do not believe that the re.UNICODE flag should be required.

    @RobM RobM mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Apr 28, 2011
    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Apr 28, 2011

    help(re.sub) says:

        sub(pattern, repl, string, count=0)

    and re.IGNORECASE has a value of 2.

    Therefore this:

        re.sub("_", "X", subject, re.IGNORECASE)

    is telling it to replace at most 2 occurrences of "_".

    @ezio-melotti
    Copy link
    Member

    Closing as invalid.
    I wonder if it would be better to have count as a keyword-only argument though, since this problem seems to come up pretty often and it's not easy to debug.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Apr 28, 2011

    I don't know how much code that might break. It might not be that much; I can't remember when I last used re.sub without the default count.

    @RobM
    Copy link
    Mannequin Author

    RobM mannequin commented Apr 29, 2011

    Oh, that's embarrassing. :-)

    Could a type-check be used to alert the user to their mistake? I suppose that would require re.IGNORECASE (et al) to be of some new type (presumably sub-classed from Integer).

    (Thanks for the quick response, and sorry to waste your time)

    @ezio-melotti
    Copy link
    Member

    See also bpo-11957.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-regex type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants