approximate matching -- feature request #12

mrabarnett · 2011-06-02T13:33:14Z

I'm currently using the TRE regex engine to match output from OCR, because it supports approximate matching. Very useful. Would be nice to have that capability in Python regex, as well.

mrabarnett · 2011-06-03T14:02:40Z

Original comment by Anonymous.

See http://laurikari.net/tre/

mrabarnett · 2011-06-08T13:12:48Z

Original comment by Anonymous.

I'm not entirely sure how easy this would be, although I have come up with a few ideas.

Could you provide me with some test data, including regex(es)?

mrabarnett · 2011-06-08T18:28:33Z

Original comment by Anonymous.

The code for TRE is under a "two-clause BSD licence", whatever that means. But the approximate matching code is in a single C file there. Could it be lifted?

mrabarnett · 2011-06-08T21:36:57Z

Original comment by Anonymous.

There are a lot of tests as part of TRE; would they be enough? Assuming you use the TRE syntax.

Bill

/*
 * Approximate matching tests.
 *
 * The approximate matcher always searches for the best match, and returns
 * the leftmost and longest one if there are several best matches.
 */

test_comp("(fou){# ~1}", REG_EXTENDED, 0);
test_comp("(fuu){#}", REG_EXTENDED, 0);
test_comp("(fuu){# ~}", REG_EXTENDED, 0);
test_comp("(anaconda){ 1i + 1d < 1, #1}", REG_EXTENDED, 0);
test_comp("(anaconda){ 1i + 1d < 1 #1 ~10 }", REG_EXTENDED, 0);
test_comp("(anaconda){ #1, ~1, 1i + 1d < 1 }", REG_EXTENDED, 0);

test_comp("(znacnda){ #1 ~3 1i + 1d < 1 }", REG_EXTENDED, 0);
test_exec("molasses anaconda foo bar baz smith anderson ",
   0, REG_NOMATCH);

test_comp("(znacnda){ #1 ~3 1i + 1d < 2 }", REG_EXTENDED, 0);
test_exec("molasses anaconda foo bar baz smith anderson ",
   0, REG_OK, 9, 17, 9, 17, END);
test_comp("(ananda){ 1i + 1d < 2 }", REG_EXTENDED, 0);
test_exec("molasses anaconda foo bar baz smith anderson ",
   0, REG_NOMATCH);

test_comp("(fuu){ +3 -3 ~5}", REG_EXTENDED, 0);
test_exec("anaconda foo bar baz smith anderson",
   0, REG_OK, 9, 10, 9, 10, END);
test_comp("(fuu){ +2 -2 ~5}", REG_EXTENDED, 0);
test_exec("anaconda foo bar baz smith anderson",
   0, REG_OK, 9, 10, 9, 10, END);
test_comp("(fuu){ +3 -3 ~}", REG_EXTENDED, 0);
test_exec("anaconda foo bar baz smith anderson",
   0, REG_OK, 9, 10, 9, 10, END);

test_comp("(laurikari){ #3, 1i + 1d < 3 }", REG_EXTENDED, 0);

/* No cost limit. */
test_comp("(foobar){~}", REG_EXTENDED, 0);
test_exec("xirefoabralfobarxie", 0, REG_OK, 11, 16, 11, 16, END);

/* At most two errors. */
test_comp("(foobar){~2}", REG_EXTENDED, 0);
test_exec("xirefoabrzlfd", 0, REG_OK, 4, 9, 4, 9, END);
test_exec("xirefoabzlfd", 0, REG_NOMATCH);

/* At most two inserts or substitutions and max two errors total. */
test_comp("(foobar){+2#2~2}", REG_EXTENDED, 0);
test_exec("oobargoobaploowap", 0, REG_OK, 5, 11, 5, 11, END);

/* Find best whole word match for "foobar". */
test_comp("\\<(foobar){~}\\>", REG_EXTENDED, 0);
test_exec("zfoobarz", 0, REG_OK, 0, 8, 0, 8, END);
test_exec("boing zfoobarz goobar woop", 0, REG_OK, 15, 21, 15, 21, END);

/* Match whole string, allow only 1 error. */
test_comp("^(foobar){~1}$", REG_EXTENDED, 0);
test_exec("foobar", 0, REG_OK, 0, 6, 0, 6, END);
test_exec("xfoobar", 0, REG_OK, 0, 7, 0, 7, END);
/*
  This currently fails.
test_exec("foobarx", 0, REG_OK, 0, 7, 0, 7, END);
*/
test_exec("fooxbar", 0, REG_OK, 0, 7, 0, 7, END);
test_exec("foxbar", 0, REG_OK, 0, 6, 0, 6, END);
test_exec("xoobar", 0, REG_OK, 0, 6, 0, 6, END);
test_exec("foobax", 0, REG_OK, 0, 6, 0, 6, END);
test_exec("oobar", 0, REG_OK, 0, 5, 0, 5, END);
test_exec("fobar", 0, REG_OK, 0, 5, 0, 5, END);
test_exec("fooba", 0, REG_OK, 0, 5, 0, 5, END);
test_exec("xfoobarx", 0, REG_NOMATCH);
test_exec("foobarxx", 0, REG_NOMATCH);
test_exec("xxfoobar", 0, REG_NOMATCH);
test_exec("xfoxbar", 0, REG_NOMATCH);
test_exec("foxbarx", 0, REG_NOMATCH);

/* At most one insert, two deletes, and three substitutions.
   Additionally, deletes cost two and substitutes one, and total
   cost must be less than 4. */
test_comp("(foobar){+1 -2 #3, 2d + 1s < 4}", REG_EXTENDED, 0);
test_exec("3oifaowefbaoraofuiebofasebfaobfaorfeoaro",
   0, REG_OK, 26, 33, 26, 33, END);

/* Partially approximate matches. */
test_comp("foo(bar){~1}zap", REG_EXTENDED, 0);
test_exec("foobarzap", 0, REG_OK, 0, 9, 3, 6, END);
test_exec("fobarzap", 0, REG_NOMATCH);
test_exec("foobrzap", 0, REG_OK, 0, 8, 3, 5, END);
test_comp("^.*(dot.org){~}.*$", REG_EXTENDED, 0);
test_exec("www.cnn.com 64.236.16.20\n"
   "www.slashdot.org 66.35.250.150\n"
   "For useful information, use www.slashdot.org\n"
   "this is demo data!\n",
   0, REG_OK, 0, 120, 93, 100, END);

/* Approximate matching and back referencing cannot be used together. */
test_comp("(foo{~})\\1", REG_EXTENDED, REG_BADPAT);

mrabarnett · 2011-06-08T22:42:01Z

Original comment by Anonymous.

I took the tre test code and turned it into Python. You can adapt it a bit to regex.

Bill

mrabarnett · 2011-06-11T12:16:46Z

Original comment by Anonymous.

I've now re-cast the tests as a function you could drop into test_regex.py. If the "tre" module is installed, it will use that; if not, it will try to use regex instead to run the tests.

mrabarnett · 2011-06-11T12:19:04Z

Original comment by Anonymous.

Whoops -- found a bug (by inspection) in the regex-test code branch. Updated.

mrabarnett · 2011-06-15T18:27:42Z

Original comment by Anonymous.

What do you think about the notation in the {...}? Is it clear?

One idiosyncrasy I don't like is that in this:

1i + 1d < 1

the maximum cost is 1, but it uses "<", not "<=".

On another issue, it looks like a pattern which is going to be used for approximate matching would need to be specially marked (a flag and/or presence of {...} notation) because certain tricks to improve the speed of exact matching aren't possible for approximate matching, and I don't think the user would want to lose those speedups when most matching is exact. Another possibility is for an unmarked pattern to be recompiled automatically for approximate matching on demand.

mrabarnett · 2011-06-15T21:37:45Z

Original comment by Anonymous.

One idiosyncrasy I don't like is that in this:

1i + 1d < 1

the maximum cost is 1, but it uses "<", not "<=".

Good point. Since it was developed as a C library, I suspect there wasn't a lot of Pythonic thinking involved.

What do you think about the notation in the {...}? Is it clear?

I guess the question is, can we improve on it? I've been using TRE for a bit, so it's clear to me, but perhaps there is a better way of saying these things.

On another issue, it looks like a pattern which is going to be used for approximate matching
would need to be specially marked (a flag and/or presence of {...} notation) because certain
tricks to improve the speed of exact matching aren't possible for approximate matching,
and I don't think the user would want to lose those speedups when most matching is exact.

Yep. That seems appropriate to me. Presence of {...} notation would seem to be
the preferable way of doing it.

Bill

mrabarnett · 2011-06-23T12:37:32Z

Original comment by Anonymous.

I discovered that it is using "<" correctly.

I've decided to allow both "<" and "<=", with their usual meanings.

This is TRE: "(foobar){+1 -2 #3, 2d + 1s < 4}"
This will be regex: "(foobar){i <= 1, d <= 2, s <= 3, 2d + 1s < 4}"

My latest build passes the tests, although it isn't as fast at approximate matching, and it's too prone to catastrophic backtracking. I'm going to see whether I can adapt the existing safeguards used with repeats.

mrabarnett · 2011-06-24T09:20:55Z

Original comment by Anonymous.

The regex version certainly seems easier to read... a good Pythonic quality :-).

mrabarnett · 2011-06-24T19:57:44Z

Original comment by Anonymous.

After making some modifications and adding a simple heuristic, I've got fuzzy matching to work acceptably, but I'd be interested in any further test cases I can try which are closer to real-world use-cases before the next release.

Restriction: fuzzy matching doesn't work on group references or named lists (but that's not to say that they never will).

mrabarnett · 2011-06-26T20:47:20Z

Original comment by Anonymous.

I downloaded the latest version from PyPI (regex-0.1.20110623a), but I get an error when I try to use it:

p = regex.compile('(foobar){i <= 1, d <= 2, s <= 3, 2d + 1s < 4}')
p = regex.compile('(foobar){i <= 1, d <= 2, s <= 3, 2d + 1s < 4}')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.5/site-packages/regex.py", line 266, in compile
    return _compile(pattern, flags, kwargs)
  File "/Library/Python/2.5/site-packages/regex.py", line 373, in _compile
    parsed = parse_pattern(source, info)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 300, in parse_pattern
    branches = [parse_sequence(source, info)]
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 319, in parse_sequence
    item = parse_item(source, info)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 354, in parse_item
    if not subpattern or min_count == max_count == 1:
NameError: global name 'subpattern' is not defined
>>>

mrabarnett · 2011-06-27T09:14:42Z

Original comment by Anonymous.

Fuzzy matching wasn't in that release. You're the only person to have reported a problem with it! (Definitely a bug, though...)

New release (regex-0.1.20110627), with fuzzy matching, now on PyPI.

If all goes well, I'll be adding fuzzy named lists and group references in the next release.

mrabarnett · 2011-06-27T10:35:17Z

Original comment by Anonymous.

Excellent. OK, let me try it out, and I'll see if I can come up with more test cases.

This kind of thing is heavily used in text mining applications over OCR'd document sets. Mainly legal and medical, as far as I know. The legal use case is scanning a huge discovery collection for specific terms of art, and for specific named parties (which is where fuzzy named lists would come in). Unfortunately, it's hard to get real-world data to try things on, because legal and medical documents are typically privileged info.

mrabarnett · 2011-06-29T09:20:01Z

Original comment by Anonymous.

I have a number of lists of real-world addresses obtained via OCR. I'm trying to figure out how to turn them into a test for the fuzzy matching. Any ideas?

mrabarnett · 2011-06-29T17:08:20Z

Original comment by Anonymous.

No, no ideas.

BTW, I'm currently testing fuzzy named lists and group references for the next release.

mrabarnett · 2011-07-02T10:14:54Z

Original comment by Anonymous.

The new release (regex-0.1.20110702) supports fuzzy named lists and group references.

mrabarnett · 2011-07-07T12:26:58Z

Original comment by Anonymous.

I'm back.

One of the uses for fuzzy named lists is finding specific cities, streets, etc. in OCR'd lists of addresses. I'll whomp up a test case.

mrabarnett · 2011-07-21T05:26:50Z

Original comment by Anonymous.

I just tried this interesting feature and would like to ask about the handling of the fuzzy patterns.
Am I missing something or is there some kind of context sensitivity in the matching? Are possibly the starting/ending parts of the string treated in some special way?
Is there a specific order or hierarchy in which the error types are considered (e.g. in an unqualified ...{e} pattern?
I tried to figure something out via some naive trials (I confess, I wasn't able to comprehend the matching behaviour from the sources...).

>>> regex.findall("F{e}", "Fabcd")
['F', 'a', 'b', 'c', 'd', '']
>>> regex.findall("F{e}", "abFcd")
['F', 'c', 'd', '']
>>> regex.findall("F{e}", "abcdF")
['F', '']

Probably, the presence of a substring matching the pattern exactly limits the matching of others (preceeding it?). cf. the following with the above examples:

>>> regex.findall("F{e}", "abcd")
['a', 'b', 'c', 'd', '']

It appears towards the end of the string, there is sometimes more chance to match (even within the same substring - consistent to the above).

>>> regex.findall("F{e}", "abFcd abFcd")
['F', 'F', 'c', 'd', '']
>>>

Would it be possible to specify the matching resolution in more detail in order to make the fuzziness more verifyable/predictable?

Many thanks for continuing enhancements; now with the unicode properties and fuzzy matching, i believe, the recursion and code embedding or maybe case conversion on sub() are probably the only remaining unsupported features present in some other implementations :-) - not that I would want to use them too often (maybe except for the last one...).

Regards,
vbr

mrabarnett · 2011-07-21T07:04:05Z

Original comment by Anonymous.

Fuzzy matching applies to the preceding item in the same way as the quantifiers, so "F{e}" matches "F" with any number of errors permitted, so it could match anything. However, it will always try to find the best match, the match with the lowest cost.

regex.search("F{e}", "abFcd") will match "F" because that's the best match.

regex.findall performs multiple searches, each starting where the previous match finished, so with regex.findall("F{e}", "abFcd"):

It searches for the best match from the start, finding "F" (exact match, cost is 0).
It searches for the best match after "F", finding "c" (1 substitution, cost is 1).
It searches for the best match after "c", finding "d" (1 substitution, cost is 1).
It searches for the best match after "d", finding "" (1 deletion, cost is 1).

Surprising? Possibly. But the TRE regex engine finds the best match, and this implementation copies that behaviour.

An alternative behaviour would be to match anything whose cost is within the limits, but that can have surprising results too.

Given the regex "(cat){e<=1}" and the string "dog cat", it would find " cat" because " cat" comes before "cat" and has only 1 error.

If regex.findall had the alternative behaviour, then regex.search would have it too, but usually you want to find the best match, so returning " cat" when "cat" is better would also be surprising.

mrabarnett · 2011-07-21T07:51:21Z

Original comment by Anonymous.

Thanks for the clarification, it indeed was kind of surprising to me, that the equally inexact elements was matched (or not) depending on the position with respect to the best match and I think, the alternative behaviour would be less surprising (to me), but if it the behaviour of other engines supporting this feature, it might be better to stick with it.
Normally one would use more complex pattern, possibly with error weights, defined word boundaries etc., which would limit the "surprises" quite a bit.

vbr

mrabarnett · 2011-07-21T11:49:41Z

Original comment by Anonymous.

Just another thoughts, as I see the notices on OCR checks and similar and given the context dependent behaviour - is it somehow unexpected or discouraged to use fuzzy matching for findall() or finditer(), as search() is supposed to give the first "optimal" result? (In the mentioned OCR, one would probably test separate words against a dictionary items and not within a larger text.)
It was indeed unclear to me as I saw in the spec: "any number of errors" for, say, "F", how it is resolved "in the face of ambiguity".

Would it eventually be possible to check anything within the given constraints also checking the overlaps and maybe also expose the results in the ascending order of the "costs"? (maybe a bit similar to the "overlapped" parameter or match.captures(...), which also expose rather "internal" steps of the matching)

But anyway, if the alternatives also have drawbacks (like performance etc) and would be different from other implementations, its probably beter to stick with the usual way.
It's rather me, trying to understand the matching rules for use on more or less known data (in contrast to OCR or spell check and the like, where some randomness is inherent and therefore it probably doesn't hurt to have kind of irregularity in the matching too).

Regards,
vbr

mrabarnett · 2011-08-06T08:29:19Z

Original comment by Anonymous.

(No comment was entered for this change.)

mrabarnett · 2011-08-10T16:09:11Z

Original comment by Anonymous.

Sorry for another question regarding fuzzy matching; it seems, I'm still missing something topical regarding this feature.
I tried to make the matches more consistent for the whole text (i.e. in order not to mask suboptimal valid results preceeding a better match - cf. comments 20, 21 here). This can be quite straightforward, simply repeating the search in the substring before the first match of the previous search and collecting the results (the obvious drawbacks is the possibly irregular handling of anchors and lookarounds).

However, I was surprised by another issue - some matches, I'd have expected, are also missing, also after the best match and supposedly complying with the error cost:

>>> import regex
>>> txt = "ab ac ad ae ba bc bd be ca cb cd ce da db dc de ea eb ec ed"
>>> regex.findall(r"\bad{e<=1}\b", txt)
['ad', 'ae']

(with my custom findall function I can also get the "words" preceeding "ad":

['ab', 'ac', 'ad', 'ae']

but there are still others missing; the following pattern only deals with alternation, as insertions and deletions aren't relevant in this sample and given the word boundaries:

>>> regex.findall(r"\ba.|.d\b", txt)
['ab', 'ac', 'ad', 'ae', 'bd', 'cd', 'ed']
>>>

Is there maybe some left-to-right bias in the algorithm, or am I missing something else?

Possibly related to this, how are the empty strings matched?, e.g.:

>>> regex.findall(r"b{e<=1}", "abcd")
['b', 'c', 'd', '']

why are there no empty matches between each character (all of them cost one deletion), but only the last one?

Thanks in advance,
vbr

mrabarnett · 2011-08-10T18:16:41Z

Original comment by Anonymous.

The regex r"\bad{e<=1}\b" is:

\b       boundary
a        literal "a"
d{e<=1}  literal "d" with maximum of 1 error
\b       boundary

In the first example, the best match in the text is "ad", and the best match in the remainder is "ae". There are no further matches which meet the constraint.

It always looks for the best match in (the remainder of) the text.

I think I'll have to change the current behaviour (it seems to be too confusing) and add a BEST flag for when you want the best match, unless Bill (the original requester) or anyone else has an objection.

To summarise the differences:

Example 1:

txt = "ab ac ad ae ba bc bd be ca cb cd ce da db dc de ea eb ec ed"
regex.findall(r"\bad{e<=1}\b", txt)

Current: ['ad', 'ae']
Proposed: ['ab', 'ac', 'ad', 'ae']

Example 2:

regex.findall(r"b{e<=1}", "abcd")

Current: ['b', 'c', 'd', '']
Proposed: ['a', 'b', 'c', 'd', '']

mrabarnett · 2011-08-10T19:02:42Z

Original comment by Anonymous.

Thanks for the detailed answer; now I notice that a part of my surprise was obviously the missing parentheses, i.e. I meant:

>>> txt = "ab ac ad ae ba bc bd be ca cb cd ce da db dc de ea eb ec ed"

>>> regex.findall(r"\b(ad){e<=1}\b", txt)
['ad', 'ae', 'bd', 'cd', 'ed']

to be roughly equivalent to:

>>> regex.findall(r"\ba.|.d\b", txt)
['ab', 'ac', 'ad', 'ae', 'bd', 'cd', 'ed']

(given the text and the pattern; with the proposed change to also match before the "optimal" match).

My naive approach is something like the following (sorry for the style and coding - there must be some elegant ways to deal with the function arguments duplication etc...)

# # # # # # # # # # # # # # # # # # #

def fuzzy_findall(pattern, string, flags=0, pos=None, endpos=None, overlapped=False, concurrent=None, **kwargs):
    tmp_results = []
    string = string[0:endpos] # to replace endpos used in another way
    tmp_end_pos = len(string)
    while tmp_end_pos:
        tmp_results.append(regex.findall(pattern=pattern, string=string, flags=flags, pos=pos, endpos=tmp_end_pos, overlapped=overlapped, concurrent=concurrent, **kwargs))
        m = regex.search(pattern=pattern, string=string, flags=flags, pos=pos, endpos=tmp_end_pos, overlapped=overlapped, concurrent=concurrent, **kwargs)
        if m:
            tmp_end_pos = m.start()
        else:
            break
    tmp_results.reverse()
    results = [result for sublist in tmp_results for result in sublist]
    return results

# # # # # # # # # # # # # # # # # #

using this function I get e.g.

>>> fuzzy_findall(r"\b(ad){e<=1}\b", txt)
['ab', 'ac', 'ad', 'ae', 'bd', 'cd', 'ed']

but

print fuzzy_findall(r"c{e<=1}", "abcd")
['a', 'b', '', 'c', 'd', '']

with the empty match for the end of the substring in the second trial-match before "c".

I actually had expected empty matches between all the single characters

['', 'a', '', 'b', '', 'c', '', 'd', '']

or is it again some misunderstanding on my part?

Of course, a native behaviour in the regex engine would be much preferred tu such fragile wrapper function (also because of the mentioned anchors, lokkarounds etc.)

Many thanks for your efforts,
vbr

mrabarnett · 2011-08-10T19:22:34Z

Original comment by Anonymous.

After the proposed change:

>>> txt = "ab ac ad ae ba bc bd be ca cb cd ce da db dc de ea eb ec ed"
>>> regex.findall(r"\b(ad){e<=1}\b", txt)
['ab', 'ac', 'ad', 'ae', 'bd', 'cd', 'ed']
>>> regex.findall(r"c{e<=1}", "abcd")
['a', 'b', 'c', 'd', '']

In the second example:

matches "a" with 1 error
matches "b" with 1 error
matches "c" with 0 errors
matches "d" with 1 error
matches "" at the end with 1 error

Substitution is tried first, then insertion, then deletion.

mrabarnett · 2011-08-11T00:35:41Z

Original comment by Anonymous.

Thanks, the proposed behaviour looks really promissing and (at least for me) much more consistent.
The deletion being found only at the end, and not during the whole string at character boundaries displays maybe some asymmetry, but it's clear why from your explanation and these empty matches are probably not likely to be useful anyway.

regards,
vbr

mrabarnett · 2011-09-27T15:26:27Z

Original comment by Anonymous.

The regex module now looks for the first match.

Use the BESTMATCH flag to look for the best match.

mrabarnett · 2011-10-03T14:40:50Z

Original comment by Anonymous.

Lately, I have used fuzzy matching for some error checking and it proved to be very helpful;
however, I'd like to better understand the fuzzy matching process and its rules, to be able to reflect it in the patterns.
(Unfortunately, I was not able to learn much from the source, as most of the "magic" apparently happens in the C code (or was it python beyond my competence level? :-)

Comment 28 above clarifies the order of the error types being checked: "Substitution is tried first, then insertion, then deletion."

Now, I would be especially interested, how the segmentation of the matches is designed, if there are large or unlimited number of errors.

It is likely, that the matches without any error are taken as a base, which are then tried for variance defined in the cost-pattern.

Is it the case, that the match-possibilities are determined first and the allowed errors don't change/shift these "boundaries" for possible matches?

It seems, that e.g. the quantifier allowing possibly longer matches somehow "eats up" the possibilities for checking within the error cost:

>>> regex.findall(r"X{e}","abc")
['a', 'b', 'c', '']
>>> regex.findall(r"(X+){e}","abc")
[]
>>> regex.findall(r"(a+){e}","abc")
['a']
>>>

Does the algorithm work this way or am I misinterpretting these simplified results?

Just another thing I observed - without parentheses (which seem to be preferable in most cases anyway) - the cost pattern between {} seems to apply for the preceding "literal" only, e.g. also excluding quantifiers (?)

>>> regex.findall(r"a{e}","abc")
['a', 'b', 'c', '']
>>> regex.findall(r"a{1}{e}","abc") # the cost is possibly applied for the empty string after the quantifier (?)
[]
>>> regex.findall(r"(a{1}){e}","abc")
['a', 'b', 'c', '']
>>>

as an interesting and probably useless cornercase I found an empty pattern which match the empty string at character boundaries - without the cost pattern and nothing with any errors allowed

>>> regex.findall(r"","abc")
['', '', '', '']
>>> regex.findall(r"{e}","abc")
[]
as there is maybe no useful text for the error checking to begin with?

However, a similar zero width pattern matches identically:

>>> regex.findall(r"Q{0}","abc")
['', '', '', '']
>>> regex.findall(r"(Q{0}){e}","abc")
['', '', '', '']
>>>

This is meant purely as a question about the matching behaviour; in practice I am very happy with the present functionality; I just feel, that I could use it even better with an appropriate understanding of the "segmentation" for the possible matches tried. I would simply like to understand, how much "erroneous" cases I can match with this feature (not just in the extreme case of {e}, which would conceptually match anything, but obviously doesn't).

Many thanks in advance.
vbr

mrabarnett · 2011-10-03T16:30:53Z

Original comment by Anonymous.

As it tries to match, if a part of the pattern fails to match, for example, "a" doesn't match "b", it assumes that there's an error, whether a substitution, insertion or deletion, and then continues. It tries each type of error in turn, checking whether the error constraints allow it.

The error constraint follows the same syntactic rule as the quantifiers, applying to the preceding item. If there's no preceding item, it's treated as a literal, for example, r"{e}" is treated the same as r"\{e}":

>>> regex.findall(r"{e}","abc{e}")
['{e}']

I do note that the quantifiers raise an exception if there's no preceding item, so it could be argued that the error constraint should do the same.

mrabarnett · 2011-10-03T17:04:28Z

Original comment by Anonymous.

Thanks for the prompt answer;
the fallback to literal makes the relevant pattern clear, as does the analogy to quantifiers in the other cases.

However, how do quantifiers and error costs work together?

It now seems to me, that the two should be combine rather carefully for the same subpattern. (or is it discouraged at all?)

In some elementary patterns (below) it seems that the quantifier somehow hinders matching a (suboptimal) pattern (likely satisfying the general constraint {e}), which is found if equivalent simple literals are used cf. e.g. (Q{1,2}){e} vs (Q){e} , (QQ){e} ...

>>> regex.findall(r"(a+){e}","abc")
['a']
>>> regex.findall(r"(Q+){e}","abc")
[]
>>> regex.findall(r"(Q){e}","abc")
['a', 'b', 'c', '']
>>> regex.findall(r"(QQ){e}","abc")
['ab', 'c', '']
>>> regex.findall(r"(QQQ){e}","abc")
['abc', '']
>>> regex.findall(r"(QQQQ){e}","abc")
['abc', '']
>>> regex.findall(r"(a{1,5}){e}","abc")
['a']
>>> regex.findall(r"(Q{1,1}){e}","abc")
['a', 'b', 'c', '']
>>> regex.findall(r"(Q{1,2}){e}","abc")
[]

I see, that such patterns are not very real nor useful, but I just I should understand the matching possibilities better.

Regards,
vbr

mrabarnett · 2011-10-03T18:26:08Z

Original comment by Anonymous.

That's a bug.

It optimises repeats of single-character matches (a character, a set, or a property). Unfortunately, that doesn't work as you'd expect when using fuzzy matching. :-(

Fixed in regex 0.1.20111004.

mrabarnett · 2011-10-04T16:11:19Z

Original comment by Anonymous.

Thank you very much, in the updated version (regex-0.1.20111004), the fuzzy matching is a bit more predictable for me,
however, I still don't understand e.g.:

>>> regex.findall(r"\m(?:[^b]+){e}\M", "abc cde dba")
['cde', 'dba']
>>> regex.findall(r"\m(?:[^b]+){e}\M", "abc dba cde dba")
['abc', 'cde', 'dba']
>>>

I'd think, that all of the "words" would match at most with one error; however the more difficult aspect is the different handling of "abc" vs. "dba", which both need one substitution to match;
even less clear to me is the second example - adding "dba" lets the previous "abc" match, but then, only the second "dba" is matched and not this one in the 2nd word position.

Is it a peculiarity of the fuzzy matching or of the word-beginning/end anchors?

vbr

mrabarnett · 2011-10-04T17:30:34Z

Original comment by Anonymous.

When it finds a fuzzy match it tries to see whether it can get a better fit (a match with a lower cost) within that match.

The initial match for:

regex.findall(r"\m(?:[^b]+){e}\M", "abc cde dba")

is actually "abc cde dba" (the cost is 2), but a better fit is "abc cde" (the cost is 1), and an even better fit is "cde" (the cost is 0).

This results in the final output of ['cde', 'dba'].

If it didn't look for a better fit the final output would be ["abc cde dba"].

Let me give you another example:

>>> regex.findall(r"\m(?:[a-z]+){e}\M", "abc cde dba")).
# With "better fit":
['abc', 'cde', 'dba']
# Without "better fit":
['abc cde dba']

If I change your example slightly:

>>> regex.findall(r"\m(?:[^b ]+){e}\M", "abc cde dba")
# With "better fit":
['abc', 'cde', 'dba']
# Without "better fit":
['abc cde dba']

And that's my justification. :-)

mrabarnett · 2011-10-04T18:25:26Z

Original comment by Anonymous.

Thanks for the explanation, the recursive readjusting the match for the lowest error cost indeed makes the most cases, which wasn't very clear to me, obvious.

Now I feel, I may even grasp the fuzzy matching rules - sooner or later...

Thanks and regards,
vbr

mrabarnett closed this as completed Oct 4, 2011

aaronlippold mentioned this issue Dec 16, 2023

Use FuzzyMatching and Regex to collect the Vendor/Company of a Benchmark form the Title - and perhaps we can also get a fuzzy match on Product Name and Version mitre/inspec-profile-update-action#12

Open

approximate matching -- feature request #12

approximate matching -- feature request #12

Comments

mrabarnett commented Jun 2, 2011

mrabarnett commented Jun 3, 2011

mrabarnett commented Jun 8, 2011

mrabarnett commented Jun 8, 2011

mrabarnett commented Jun 8, 2011

mrabarnett commented Jun 8, 2011

mrabarnett commented Jun 11, 2011

mrabarnett commented Jun 11, 2011

mrabarnett commented Jun 15, 2011

mrabarnett commented Jun 15, 2011

mrabarnett commented Jun 23, 2011

mrabarnett commented Jun 24, 2011

mrabarnett commented Jun 24, 2011

mrabarnett commented Jun 26, 2011

mrabarnett commented Jun 27, 2011

mrabarnett commented Jun 27, 2011

mrabarnett commented Jun 29, 2011

mrabarnett commented Jun 29, 2011

mrabarnett commented Jul 2, 2011

mrabarnett commented Jul 7, 2011

mrabarnett commented Jul 21, 2011

mrabarnett commented Jul 21, 2011

mrabarnett commented Jul 21, 2011

mrabarnett commented Jul 21, 2011

mrabarnett commented Aug 6, 2011

mrabarnett commented Aug 10, 2011

mrabarnett commented Aug 10, 2011

mrabarnett commented Aug 10, 2011

mrabarnett commented Aug 10, 2011

mrabarnett commented Aug 11, 2011

mrabarnett commented Sep 27, 2011

mrabarnett commented Oct 3, 2011

mrabarnett commented Oct 3, 2011

mrabarnett commented Oct 3, 2011

mrabarnett commented Oct 3, 2011

mrabarnett commented Oct 4, 2011

mrabarnett commented Oct 4, 2011

mrabarnett commented Oct 4, 2011