= for fuzzy matches #41

mrabarnett · 2012-01-17T04:49:29Z

= operator could be pretty handy for fuzzy matches, finding only erroneous text. For example, in a list of hotmail email accounts, you could search for misspells like '@(hotmail.com){e=1}'. This will save the user an extra "grep -v" for filtering out correct emails in the list of matches.

mrabarnett · 2012-01-17T04:53:06Z

Original comment by Anonymous.

or even something like this would be more powerful:

'@(hotmail\.com){0<e<3}'

matching only texts with 1 or 2 errors

mrabarnett · 2012-01-17T19:37:27Z

Original comment by Anonymous.

A regex such as:

@(hotmail\.com){e=1}

can be we written as:

@(?>(hotmail\.com){e<=1})(?<!@hotmail\.com)

although that's not quite as convenient, I admit! :-)

Note that the fuzzy part needs to be in an atomic group in order to stop it backtracking to find a worse match. For example, given the string "@hotmail.comb", the fuzzy part will match "@hotmail.com" with 0 errors, then the negative look-behind will reject it, so the fuzzy part will match "@hotmail.comb" with 1 error.

I'm not sure how easy it'll be to add a lower limit; such a problem could still occur.

mrabarnett · 2012-01-18T10:25:33Z

Original comment by Anonymous.

I think I've figured out how to do it, but how much demand is for it? You gave an example, but is that a real use case?

mrabarnett · 2012-01-18T11:55:55Z

Original comment by Anonymous.

I am fixing tags for 25k+ text documents for a web site, so I do have a real (different) use case. That was just an example. But I think it would be a really nice feature for regex module...

mrabarnett · 2012-01-18T12:06:26Z

Original comment by Anonymous.

Could you provide a few test cases?

mrabarnett · 2012-01-18T12:21:36Z

Original comment by Anonymous.

here is a real example translated into english

3 servic detection
1 service detect
5 service detecti
46 service detection
1 in service detection

The site has manually entered tags, and their frequencies from 25k+ (non-english) text documents. Most of the time the correct one has a high frequency, and anything that is close enough to a correct one (except itself) should probably get fixed..

mrabarnett · 2012-01-18T13:26:40Z

Original comment by Anonymous.

What fuzzy regex would you use to match the incorrect strings in your example? Would it be this:

(?:^\d+ service detection$){1<=e<=3}

mrabarnett · 2012-01-18T13:33:56Z

Original comment by Anonymous.

no no, the first part is the frequency of a tag, not part of it. I would search a match with:

r = compile(r'(?:service detection){0<e<5}')
m = r.match(str)
if m:

mrabarnett · 2012-01-18T13:47:43Z

Original comment by Anonymous.

or '(?:service detection){0<e<5}$' is also a possibility..

mrabarnett · 2012-01-18T19:54:53Z

Original comment by Anonymous.

Added in regex 0.1.20120119.

Note that it supports only constraints of the form e<=3 or 1<=e<=3 ("<" is also allowed), but not "=".

mrabarnett · 2012-01-19T00:31:45Z

Original comment by Anonymous.

thanks ^_^

mrabarnett closed this as completed Jan 19, 2012

aaronlippold mentioned this issue Dec 16, 2023

Use FuzzyMatching and Regex to collect the Vendor/Company of a Benchmark form the Title - and perhaps we can also get a fuzzy match on Product Name and Version mitre/inspec-profile-update-action#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

= for fuzzy matches #41

= for fuzzy matches #41

mrabarnett commented Jan 17, 2012

mrabarnett commented Jan 17, 2012

mrabarnett commented Jan 17, 2012

mrabarnett commented Jan 18, 2012

mrabarnett commented Jan 18, 2012

mrabarnett commented Jan 18, 2012

mrabarnett commented Jan 18, 2012

mrabarnett commented Jan 18, 2012

mrabarnett commented Jan 18, 2012

mrabarnett commented Jan 18, 2012

mrabarnett commented Jan 18, 2012

mrabarnett commented Jan 19, 2012

= for fuzzy matches #41

= for fuzzy matches #41

Comments

mrabarnett commented Jan 17, 2012

mrabarnett commented Jan 17, 2012

mrabarnett commented Jan 17, 2012

mrabarnett commented Jan 18, 2012

mrabarnett commented Jan 18, 2012

mrabarnett commented Jan 18, 2012

mrabarnett commented Jan 18, 2012

mrabarnett commented Jan 18, 2012

mrabarnett commented Jan 18, 2012

mrabarnett commented Jan 18, 2012

mrabarnett commented Jan 18, 2012

mrabarnett commented Jan 19, 2012