Connecting Matcher Patterns to Matches #10934
Locked
polm
started this conversation in
Help: Best practices
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Sometimes you have multiple Matcher patterns which are describing different variations on the same thing and need different postprocessing. However, Matcher results include the match ID used when the pattern is added, but don't identify the specific pattern they matched. This post explains how to work around this limitation using the same technique used internally by the EntityRuler.
When is this useful?
Suppose you want to match upper-case words followed by a colon (like
COLOR:
), unless a colon comes before them too. Since you also accept words at the start of a sentence you need two patterns, because theNOT
won't match words that aren't there.These are basically the same thing, so it makes sense to add them with the same label. But when you post-process them, you want to remove the
NOT
token if present, so you need slightly different code.Note that in this case, you can actually just check the first token of a match and see if it is
:
and remove it if so. This kind of simple check is possible in many cases of multiple patterns with one label, and there's no downside to using it in any particular case. The technique outlined in this post is only useful for dealing with the general case where you can't make assumptions about the patterns you have.How the EntityRuler Works
The EntityRuler has a feature that allows you to assign IDs to entities it matches. The way this works is that internally each label is combined with its ID and fed to the Matcher or PhraseMatcher as a separate label. For example:
When items are matched, the keys like
GPE||san-francisco
are split to provide the final NER entity type and the entity ID (if any). If you're curious, see here for the detailed implementation.Implementing the Solution
Let's write a smaller version of the EntityRuler solution. Here we'll implement a
Numerifier
component that assigns the integer value associated with a word to a Token extension attribute.We can use this code like this:
To produce this output:
By modifying the above code you can make your own component to deal with match patterns that should be grouped together but require different behavior in post-processing.
Beta Was this translation helpful? Give feedback.
All reactions