Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

= for fuzzy matches #41

Closed
mrabarnett opened this issue Jan 17, 2012 · 11 comments
Closed

= for fuzzy matches #41

mrabarnett opened this issue Jan 17, 2012 · 11 comments
Labels
enhancement New feature or request trivial

Comments

@mrabarnett
Copy link
Owner

Original report by Anonymous.


= operator could be pretty handy for fuzzy matches, finding only erroneous text. For example, in a list of hotmail email accounts, you could search for misspells like '@(hotmail.com){e=1}'. This will save the user an extra "grep -v" for filtering out correct emails in the list of matches.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


or even something like this would be more powerful:

'@(hotmail\.com){0<e<3}'

matching only texts with 1 or 2 errors

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


A regex such as:

@(hotmail\.com){e=1}

can be we written as:

@(?>(hotmail\.com){e<=1})(?<!@hotmail\.com)

although that's not quite as convenient, I admit! :-)

Note that the fuzzy part needs to be in an atomic group in order to stop it backtracking to find a worse match. For example, given the string "@hotmail.comb", the fuzzy part will match "@hotmail.com" with 0 errors, then the negative look-behind will reject it, so the fuzzy part will match "@hotmail.comb" with 1 error.

I'm not sure how easy it'll be to add a lower limit; such a problem could still occur.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


I think I've figured out how to do it, but how much demand is for it? You gave an example, but is that a real use case?

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


I am fixing tags for 25k+ text documents for a web site, so I do have a real (different) use case. That was just an example. But I think it would be a really nice feature for regex module...

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Could you provide a few test cases?

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


here is a real example translated into english

3 servic detection
1 service detect
5 service detecti
46 service detection
1 in service detection

The site has manually entered tags, and their frequencies from 25k+ (non-english) text documents. Most of the time the correct one has a high frequency, and anything that is close enough to a correct one (except itself) should probably get fixed..

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


What fuzzy regex would you use to match the incorrect strings in your example? Would it be this:

(?:^\d+ service detection$){1<=e<=3}

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


no no, the first part is the frequency of a tag, not part of it. I would search a match with:

r = compile(r'(?:service detection){0<e<5}')
m = r.match(str)
if m:

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


or '(?:service detection){0<e<5}$' is also a possibility..

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Added in regex 0.1.20120119.

Note that it supports only constraints of the form e<=3 or 1<=e<=3 ("<" is also allowed), but not "=".

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


thanks ^_^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request trivial
Projects
None yet
Development

No branches or pull requests

1 participant