-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
= for fuzzy matches #41
Comments
Original comment by Anonymous. or even something like this would be more powerful:
matching only texts with 1 or 2 errors |
Original comment by Anonymous. A regex such as:
can be we written as:
although that's not quite as convenient, I admit! :-) Note that the fuzzy part needs to be in an atomic group in order to stop it backtracking to find a worse match. For example, given the string "@hotmail.comb", the fuzzy part will match "@hotmail.com" with 0 errors, then the negative look-behind will reject it, so the fuzzy part will match "@hotmail.comb" with 1 error. I'm not sure how easy it'll be to add a lower limit; such a problem could still occur. |
Original comment by Anonymous. I think I've figured out how to do it, but how much demand is for it? You gave an example, but is that a real use case? |
Original comment by Anonymous. I am fixing tags for 25k+ text documents for a web site, so I do have a real (different) use case. That was just an example. But I think it would be a really nice feature for regex module... |
Original comment by Anonymous. Could you provide a few test cases? |
Original comment by Anonymous. here is a real example translated into english
The site has manually entered tags, and their frequencies from 25k+ (non-english) text documents. Most of the time the correct one has a high frequency, and anything that is close enough to a correct one (except itself) should probably get fixed.. |
Original comment by Anonymous. What fuzzy regex would you use to match the incorrect strings in your example? Would it be this:
|
Original comment by Anonymous. no no, the first part is the frequency of a tag, not part of it. I would search a match with:
|
Original comment by Anonymous. or '(?:service detection){0<e<5}$' is also a possibility.. |
Original comment by Anonymous. Added in regex 0.1.20120119. Note that it supports only constraints of the form e<=3 or 1<=e<=3 ("<" is also allowed), but not "=". |
Original comment by Anonymous. thanks ^_^ |
Original report by Anonymous.
= operator could be pretty handy for fuzzy matches, finding only erroneous text. For example, in a list of hotmail email accounts, you could search for misspells like '@(hotmail.com){e=1}'. This will save the user an extra "grep -v" for filtering out correct emails in the list of matches.
The text was updated successfully, but these errors were encountered: