Connecting Matcher Patterns to Matches #10934

polm · 2022-06-09T06:07:22Z

polm
Jun 9, 2022

Sometimes you have multiple Matcher patterns which are describing different variations on the same thing and need different postprocessing. However, Matcher results include the match ID used when the pattern is added, but don't identify the specific pattern they matched. This post explains how to work around this limitation using the same technique used internally by the EntityRuler.

When is this useful?

Suppose you want to match upper-case words followed by a colon (like COLOR:), unless a colon comes before them too. Since you also accept words at the start of a sentence you need two patterns, because the NOT won't match words that aren't there.

# colon can't come first
pat1 = [{"NOT":{"TEXT":":"}}, {"REGEX": "^[A-Z]*$"}, {"TEXT": ":"}]
# but start of sentence is ok
pat2 = [{"REGEX": "^[A-Z]*$", "IS_SENT_START": True}, {"TEXT": ":"}]

These are basically the same thing, so it makes sense to add them with the same label. But when you post-process them, you want to remove the NOT token if present, so you need slightly different code.

Note that in this case, you can actually just check the first token of a match and see if it is : and remove it if so. This kind of simple check is possible in many cases of multiple patterns with one label, and there's no downside to using it in any particular case. The technique outlined in this post is only useful for dealing with the general case where you can't make assumptions about the patterns you have.

How the EntityRuler Works

The EntityRuler has a feature that allows you to assign IDs to entities it matches. The way this works is that internally each label is combined with its ID and fed to the Matcher or PhraseMatcher as a separate label. For example:

ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "GPE", "pattern": "San Fransico", "id": "san-francisco"},
            {"label": "GPE", "pattern": "London", "id": "london"}]
ruler.add_patterns(patterns)
# Internally the EntityRuler does something like this:
self.phrasematcher.add("GPE||san-fransciso", ...)
self.phrasematcher.add("GPE||london", ...)

When items are matched, the keys like GPE||san-francisco are split to provide the final NER entity type and the entity ID (if any). If you're curious, see here for the detailed implementation.

Implementing the Solution

Let's write a smaller version of the EntityRuler solution. Here we'll implement a Numerifier component that assigns the integer value associated with a word to a Token extension attribute.

import spacy
from spacy.matcher import Matcher
from spacy.language import Language
from spacy.tokens import Token

Token.set_extension("number", default=None)

NUMBER_WORDS = {
        "five": 5,
        "fifth": 5,
        # in reality you would have more
}

@Language.factory("numerifier", default_config={"mapping":NUMBER_WORDS})
def build_numerifier(nlp, name, mapping):
    return Numerifier(nlp=nlp, mapping=mapping)

class Numerifier:
    def __init__(self, nlp, mapping):
        self.mapping = mapping
        self.nlp = nlp
        self.matcher = Matcher(nlp.vocab)
        # it may make sense to allow this to be customized
        self.sep = "||"

        for key, val in self.mapping.items():
            # encode the value
            match_key = f"NUMBER{self.sep}{val}"
            pattern = [{"LOWER": key.lower()}]
            self.matcher.add(match_key, [pattern])

    def __call__(self, doc):
        for match_id, start, end in self.matcher(doc):
            string_id = self.nlp.vocab.strings[match_id]
            label, _, num = string_id.partition(self.sep)

            span = doc[start:end]
            for tok in span:
                tok._.number = int(num)

            # we could add this as an entity, but we'll just leave it for now
            span.label_ = label
        return doc

We can use this code like this:

nlp = spacy.blank("en")
numerifier = nlp.add_pipe("numerifier")
doc = nlp("I have five apples")

for tok in doc:
    print(tok, tok._.number, sep="\t")

To produce this output:

I       None
have    None
five    5
apples  None

By modifying the above code you can make your own component to deal with match patterns that should be grouped together but require different behavior in post-processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connecting Matcher Patterns to Matches #10934

{{title}}

Replies: 0 comments

Select a reply

Connecting Matcher Patterns to Matches #10934

polm Jun 9, 2022

When is this useful?

How the EntityRuler Works

Implementing the Solution

Replies: 0 comments

polm
Jun 9, 2022