-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
approximate matching -- feature request #12
Comments
Original comment by Anonymous. |
Original comment by Anonymous. I'm not entirely sure how easy this would be, although I have come up with a few ideas. Could you provide me with some test data, including regex(es)? |
Original comment by Anonymous. The code for TRE is under a "two-clause BSD licence", whatever that means. But the approximate matching code is in a single C file there. Could it be lifted? |
Original comment by Anonymous. There are a lot of tests as part of TRE; would they be enough? Assuming you use the TRE syntax. Bill
|
Original comment by Anonymous. I took the tre test code and turned it into Python. You can adapt it a bit to regex. Bill |
Original comment by Anonymous. I've now re-cast the tests as a function you could drop into test_regex.py. If the "tre" module is installed, it will use that; if not, it will try to use regex instead to run the tests. |
Original comment by Anonymous. Whoops -- found a bug (by inspection) in the regex-test code branch. Updated. |
Original comment by Anonymous. What do you think about the notation in the {...}? Is it clear? One idiosyncrasy I don't like is that in this:
the maximum cost is 1, but it uses "<", not "<=". On another issue, it looks like a pattern which is going to be used for approximate matching would need to be specially marked (a flag and/or presence of {...} notation) because certain tricks to improve the speed of exact matching aren't possible for approximate matching, and I don't think the user would want to lose those speedups when most matching is exact. Another possibility is for an unmarked pattern to be recompiled automatically for approximate matching on demand. |
Original comment by Anonymous.
Good point. Since it was developed as a C library, I suspect there wasn't a lot of Pythonic thinking involved.
I guess the question is, can we improve on it? I've been using TRE for a bit, so it's clear to me, but perhaps there is a better way of saying these things.
Yep. That seems appropriate to me. Presence of {...} notation would seem to be Bill |
Original comment by Anonymous. I discovered that it is using "<" correctly. I've decided to allow both "<" and "<=", with their usual meanings. This is TRE: "(foobar){+1 -2 #3, 2d + 1s < 4}" My latest build passes the tests, although it isn't as fast at approximate matching, and it's too prone to catastrophic backtracking. I'm going to see whether I can adapt the existing safeguards used with repeats. |
Original comment by Anonymous. The regex version certainly seems easier to read... a good Pythonic quality :-). |
Original comment by Anonymous. After making some modifications and adding a simple heuristic, I've got fuzzy matching to work acceptably, but I'd be interested in any further test cases I can try which are closer to real-world use-cases before the next release. Restriction: fuzzy matching doesn't work on group references or named lists (but that's not to say that they never will). |
Original comment by Anonymous. I downloaded the latest version from PyPI (regex-0.1.20110623a), but I get an error when I try to use it:
|
Original comment by Anonymous. Fuzzy matching wasn't in that release. You're the only person to have reported a problem with it! (Definitely a bug, though...) New release (regex-0.1.20110627), with fuzzy matching, now on PyPI. If all goes well, I'll be adding fuzzy named lists and group references in the next release. |
Original comment by Anonymous. Excellent. OK, let me try it out, and I'll see if I can come up with more test cases. This kind of thing is heavily used in text mining applications over OCR'd document sets. Mainly legal and medical, as far as I know. The legal use case is scanning a huge discovery collection for specific terms of art, and for specific named parties (which is where fuzzy named lists would come in). Unfortunately, it's hard to get real-world data to try things on, because legal and medical documents are typically privileged info. |
Original comment by Anonymous. I have a number of lists of real-world addresses obtained via OCR. I'm trying to figure out how to turn them into a test for the fuzzy matching. Any ideas? |
Original comment by Anonymous. No, no ideas. BTW, I'm currently testing fuzzy named lists and group references for the next release. |
Original comment by Anonymous. The new release (regex-0.1.20110702) supports fuzzy named lists and group references. |
Original comment by Anonymous. I'm back. One of the uses for fuzzy named lists is finding specific cities, streets, etc. in OCR'd lists of addresses. I'll whomp up a test case. |
Original comment by Anonymous. I just tried this interesting feature and would like to ask about the handling of the fuzzy patterns.
Probably, the presence of a substring matching the pattern exactly limits the matching of others (preceeding it?). cf. the following with the above examples:
It appears towards the end of the string, there is sometimes more chance to match (even within the same substring - consistent to the above).
Would it be possible to specify the matching resolution in more detail in order to make the fuzziness more verifyable/predictable? Many thanks for continuing enhancements; now with the unicode properties and fuzzy matching, i believe, the recursion and code embedding or maybe case conversion on sub() are probably the only remaining unsupported features present in some other implementations :-) - not that I would want to use them too often (maybe except for the last one...). Regards, |
Original comment by Anonymous. Fuzzy matching applies to the preceding item in the same way as the quantifiers, so "F{e}" matches "F" with any number of errors permitted, so it could match anything. However, it will always try to find the best match, the match with the lowest cost. regex.search("F{e}", "abFcd") will match "F" because that's the best match. regex.findall performs multiple searches, each starting where the previous match finished, so with regex.findall("F{e}", "abFcd"):
Surprising? Possibly. But the TRE regex engine finds the best match, and this implementation copies that behaviour. An alternative behaviour would be to match anything whose cost is within the limits, but that can have surprising results too. Given the regex "(cat){e<=1}" and the string "dog cat", it would find " cat" because " cat" comes before "cat" and has only 1 error. If regex.findall had the alternative behaviour, then regex.search would have it too, but usually you want to find the best match, so returning " cat" when "cat" is better would also be surprising. |
Original comment by Anonymous. Thanks for the clarification, it indeed was kind of surprising to me, that the equally inexact elements was matched (or not) depending on the position with respect to the best match and I think, the alternative behaviour would be less surprising (to me), but if it the behaviour of other engines supporting this feature, it might be better to stick with it. vbr |
Original comment by Anonymous. Just another thoughts, as I see the notices on OCR checks and similar and given the context dependent behaviour - is it somehow unexpected or discouraged to use fuzzy matching for findall() or finditer(), as search() is supposed to give the first "optimal" result? (In the mentioned OCR, one would probably test separate words against a dictionary items and not within a larger text.) Would it eventually be possible to check anything within the given constraints also checking the overlaps and maybe also expose the results in the ascending order of the "costs"? (maybe a bit similar to the "overlapped" parameter or match.captures(...), which also expose rather "internal" steps of the matching) But anyway, if the alternatives also have drawbacks (like performance etc) and would be different from other implementations, its probably beter to stick with the usual way. Regards, |
Original comment by Anonymous. (No comment was entered for this change.) |
Original comment by Anonymous. Sorry for another question regarding fuzzy matching; it seems, I'm still missing something topical regarding this feature. However, I was surprised by another issue - some matches, I'd have expected, are also missing, also after the best match and supposedly complying with the error cost:
(with my custom findall function I can also get the "words" preceeding "ad":
but there are still others missing; the following pattern only deals with alternation, as insertions and deletions aren't relevant in this sample and given the word boundaries:
Is there maybe some left-to-right bias in the algorithm, or am I missing something else? Possibly related to this, how are the empty strings matched?, e.g.:
why are there no empty matches between each character (all of them cost one deletion), but only the last one? Thanks in advance, |
Original comment by Anonymous. The regex r"\bad{e<=1}\b" is:
In the first example, the best match in the text is "ad", and the best match in the remainder is "ae". There are no further matches which meet the constraint. It always looks for the best match in (the remainder of) the text. I think I'll have to change the current behaviour (it seems to be too confusing) and add a BEST flag for when you want the best match, unless Bill (the original requester) or anyone else has an objection. To summarise the differences: Example 1:
Current: ['ad', 'ae'] Example 2:
Current: ['b', 'c', 'd', ''] |
Original comment by Anonymous. Thanks for the detailed answer; now I notice that a part of my surprise was obviously the missing parentheses, i.e. I meant:
to be roughly equivalent to:
(given the text and the pattern; with the proposed change to also match before the "optimal" match). My naive approach is something like the following (sorry for the style and coding - there must be some elegant ways to deal with the function arguments duplication etc...) # # # # # # # # # # # # # # # # # # #
# # # # # # # # # # # # # # # # # # using this function I get e.g.
but
I actually had expected empty matches between all the single characters
or is it again some misunderstanding on my part? Of course, a native behaviour in the regex engine would be much preferred tu such fragile wrapper function (also because of the mentioned anchors, lokkarounds etc.) Many thanks for your efforts, |
Original comment by Anonymous. After the proposed change:
In the second example:
Substitution is tried first, then insertion, then deletion. |
Original comment by Anonymous. Thanks, the proposed behaviour looks really promissing and (at least for me) much more consistent. regards, |
Original comment by Anonymous. The regex module now looks for the first match. Use the BESTMATCH flag to look for the best match. |
Original comment by Anonymous. Lately, I have used fuzzy matching for some error checking and it proved to be very helpful; Comment 28 above clarifies the order of the error types being checked: "Substitution is tried first, then insertion, then deletion." Now, I would be especially interested, how the segmentation of the matches is designed, if there are large or unlimited number of errors. It is likely, that the matches without any error are taken as a base, which are then tried for variance defined in the cost-pattern. Is it the case, that the match-possibilities are determined first and the allowed errors don't change/shift these "boundaries" for possible matches? It seems, that e.g. the quantifier allowing possibly longer matches somehow "eats up" the possibilities for checking within the error cost:
Does the algorithm work this way or am I misinterpretting these simplified results? Just another thing I observed - without parentheses (which seem to be preferable in most cases anyway) - the cost pattern between {} seems to apply for the preceding "literal" only, e.g. also excluding quantifiers (?)
as an interesting and probably useless cornercase I found an empty pattern which match the empty string at character boundaries - without the cost pattern and nothing with any errors allowed
However, a similar zero width pattern matches identically:
This is meant purely as a question about the matching behaviour; in practice I am very happy with the present functionality; I just feel, that I could use it even better with an appropriate understanding of the "segmentation" for the possible matches tried. I would simply like to understand, how much "erroneous" cases I can match with this feature (not just in the extreme case of {e}, which would conceptually match anything, but obviously doesn't). Many thanks in advance. |
Original comment by Anonymous. As it tries to match, if a part of the pattern fails to match, for example, "a" doesn't match "b", it assumes that there's an error, whether a substitution, insertion or deletion, and then continues. It tries each type of error in turn, checking whether the error constraints allow it. The error constraint follows the same syntactic rule as the quantifiers, applying to the preceding item. If there's no preceding item, it's treated as a literal, for example, r"{e}" is treated the same as r"\{e}":
I do note that the quantifiers raise an exception if there's no preceding item, so it could be argued that the error constraint should do the same. |
Original comment by Anonymous. Thanks for the prompt answer; However, how do quantifiers and error costs work together? It now seems to me, that the two should be combine rather carefully for the same subpattern. (or is it discouraged at all?) In some elementary patterns (below) it seems that the quantifier somehow hinders matching a (suboptimal) pattern (likely satisfying the general constraint {e}), which is found if equivalent simple literals are used cf. e.g. (Q{1,2}){e} vs (Q){e} , (QQ){e} ...
I see, that such patterns are not very real nor useful, but I just I should understand the matching possibilities better. Regards, |
Original comment by Anonymous. That's a bug. It optimises repeats of single-character matches (a character, a set, or a property). Unfortunately, that doesn't work as you'd expect when using fuzzy matching. :-( Fixed in regex 0.1.20111004. |
Original comment by Anonymous. Thank you very much, in the updated version (regex-0.1.20111004), the fuzzy matching is a bit more predictable for me,
I'd think, that all of the "words" would match at most with one error; however the more difficult aspect is the different handling of "abc" vs. "dba", which both need one substitution to match; Is it a peculiarity of the fuzzy matching or of the word-beginning/end anchors? vbr |
Original comment by Anonymous. When it finds a fuzzy match it tries to see whether it can get a better fit (a match with a lower cost) within that match. The initial match for:
is actually "abc cde dba" (the cost is 2), but a better fit is "abc cde" (the cost is 1), and an even better fit is "cde" (the cost is 0). This results in the final output of ['cde', 'dba']. If it didn't look for a better fit the final output would be ["abc cde dba"]. Let me give you another example:
If I change your example slightly:
And that's my justification. :-) |
Original comment by Anonymous. Thanks for the explanation, the recursive readjusting the match for the lowest error cost indeed makes the most cases, which wasn't very clear to me, obvious. Now I feel, I may even grasp the fuzzy matching rules - sooner or later... Thanks and regards, |
Original report by Anonymous.
I'm currently using the TRE regex engine to match output from OCR, because it supports approximate matching. Very useful. Would be nice to have that capability in Python regex, as well.
The text was updated successfully, but these errors were encountered: