Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edit distance of fuzzy match #109

Closed
mrabarnett opened this issue Mar 28, 2014 · 8 comments
Closed

Edit distance of fuzzy match #109

mrabarnett opened this issue Mar 28, 2014 · 8 comments
Labels
enhancement New feature or request trivial

Comments

@mrabarnett
Copy link
Owner

Original report by Anonymous.


I'm using Regex for fuzzy match, and the application needs to compute the exact edit distance between the regex and the matched string. I can't find this feature in the document. It would be very useful to include the exact number of insertions, replacements and deletions in the returned match object.

Thank you for providing such powerful regex library!

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


You'll need to bear in mind that it would report the total number of substitutions, insertions and deletions that it did in order for it to match, which would not necessarily be the minimum.

For example, when trying this:

>>> match(r'(?:cats|cat){e<=1}', 'cat')
<regex.Match object; span=(0, 3), match='cat', fuzzy_counts=(0, 0, 1)>

it would report that it matches with one deletion.

This is because, when faced with alternatives, it tries them in sequence, so it will try to match "cats" first, which it can do with one deletion (the "s").

It doesn't try every possibility by default (that could take a long time!), so it won't necessarily find the minimum edit distance.

However, you could use the ENHANCEMATCH or BESTMATCH flag to look for a better fit:

>>> match(r'(?e)(?:cats|cat){e<=1}', 'cat')
<regex.Match object; span=(0, 3), match='cat'>

I'm not sure what the best arrangement would be on the match object. A single attribute called, say, 'fuzzy_counts' as above? Three separate attributes called 'substitutions', 'insertions' and 'deletions'?

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


I think the fuzzy_count attribute in the example will suffice, thanks!

Could you explain a little about the difference of the oridinary, enhancematch and best match? Do they use different search algorithms? It's better to give the running complexity in the documentation.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


The feature isn't available at the moment, but should be in the next release. I'm right in the middle of another issue at the moment.

There's a brief explanation of the ENHANCEMATCH and BESTMATCH flags in the "Flags" section here:

https://code.google.com/p/mrab-regex-hg/wiki/GeneralDetails

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Thanks for your efforts!

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Added in regex 2014.04.10.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Hi! Could I check if this is expected behavior or if I'm misunderstanding something?

  1. fuzzy_counts don't include substitutions when ENHANCEMATCH or BESTMATCH is on

    re.match(r'(?:cats|cat){e<=1}', 'caz')
    <regex.Match object; span=(0, 3), match='caz', fuzzy_counts=(1, 0, 0)>

    re.match(r'(?e)(?:cats|cat){e<=1}', 'caz')
    <regex.Match object; span=(0, 3), match='caz'>

    re.match(r'(?b)(?:cat){e<=1}', 'caz')
    <regex.Match object; span=(0, 3), match='caz'>

  2. Insertions are not always included in fuzzy_counts

    re.match(r'(?:cats){e<=2}', 'c ats')
    <regex.Match object; span=(0, 5), match='c ats', fuzzy_counts=(1, 1, 0)>
    #I understand the edit distance may not be minimum, assume it is matching 'c ats' to '-cats' or 'c-ats' or some such thing?

    re.match(r'(?b)(?:cats){e<=2}', 'c ats') #minimum edit distance
    <regex.Match object; span=(0, 5), match='c ats', fuzzy_counts=(0, 1, 0)>

    re.match(r'(?b)(?:cats){e<=2}', 'c a ts') #why edit distance 0?
    <regex.Match object; span=(0, 6), match='c a ts'>

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Really you should've created a new issue for a bug because this issue is for an enhancement that has been completed.

It is a bug, though.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Fixed in regex-2014.05.23.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request trivial
Projects
None yet
Development

No branches or pull requests

1 participant