You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@mbatchkarov and I found a bug in the matcher when using a ZERO_PLUS operator. There is also a possible inconsistency in the matches, which may or may not be true. Take a look at the following code example:
fromspacy.enimportEnglishfromspacy.matcherimportMatcherfromspacy.attrsimportORTHnlp=English()
matcher=Matcher(nlp.vocab)
matcher.add_pattern('KleenePhilippe', [{ORTH: 'Philippe', 'OP': '+'}])
doc=nlp('Philippe Philippe of Philippe.')
m=matcher(doc)
defprint_matcher_output(m):
forent_id, label, start, endinm:
print(str(doc[start:end]))
print_matcher_output(m)
Output:
>>> Philippe Philippe of
>>> Philippe of
>>> Philippe.
The obvious bug is related to the index that is passed to the list of matches. We are not sure if this is due to a faulty index passed by the matcher or by a faulty match. The fact that it matches any token after what is the match means it is probably a bad index.
Apart from the index, it is not quite clear what the behaviour of the ZERO_PLUS operator should be. In the case above we see two interpretations:
['Philippe Philippe', 'Philippe'] to match a greedy matching behaviour (like re.findall('(P+)', 'PP of P')),
['Philippe', 'Philippe Philippe', 'Philippe', 'Philippe'] to produce all possible matches consistent with how matches from different rules behave.
It is not clear what the logic of the current output is, so maybe it's just the manifestation of another bug.
Here is another test case that doesn't work at all:
Sorry for the delay getting to this. Two issues here:
There was a bug in the matcher that meant that patterns ending with "optional" items that could be filled at the end of the string failed to match. I've fixed this (although the fix is a little under-tested, which makes me nervous)
The '+' is implemented as a sequence of operators: ONE, ZERO_PLUS. The ZERO_PLUS operator isn't greedy, so you'd get a length-2 match. I agree this isn't great. I've exposed the ONE operator with the op string '1', to give better control of these things. It'd be nice to have a more satisfying system here.
Hi,
@mbatchkarov and I found a bug in the matcher when using a
ZERO_PLUS
operator. There is also a possible inconsistency in the matches, which may or may not be true. Take a look at the following code example:Output:
The obvious bug is related to the index that is passed to the list of matches. We are not sure if this is due to a faulty index passed by the matcher or by a faulty match. The fact that it matches any token after what is the match means it is probably a bad index.
Apart from the index, it is not quite clear what the behaviour of the
ZERO_PLUS
operator should be. In the case above we see two interpretations:['Philippe Philippe', 'Philippe']
to match a greedy matching behaviour (likere.findall('(P+)', 'PP of P')
),['Philippe', 'Philippe Philippe', 'Philippe', 'Philippe']
to produce all possible matches consistent with how matches from different rules behave.It is not clear what the logic of the current output is, so maybe it's just the manifestation of another bug.
Here is another test case that doesn't work at all:
Output:
The text was updated successfully, but these errors were encountered: