Add experimental Span Suggesters #11

thomashacker · 2022-04-23T20:02:34Z

This PR adds three new experimental suggester functions for the spancat component and a spaCy project showcasing how to use them in a config.cfg file.

Subtree Suggester:

Uses annotations from the Tagger and Parser to suggests subtrees of individual tokens

Chunk Suggester:

Uses annotations from the Tagger and Parser to suggest noun_chunks

Sentence Suggester:

Uses sentence boundaries to suggest sentences

These suggesters also come with the ngram functionality which allows users to set a list of sizes for suggesting individual ngrams

The spaCy project covers:

How to source components from existing models
How to use frozen_components & annotating_components
How to use custom suggester functions registered in the registry

projects/span_suggesters/configs/chunk_suggester.cfg

Lolologist · 2022-05-02T15:10:08Z

Hi there! I stumbled in here while poking around at spancategorizer capabilities and wanted to ask if this is something that you hope/intend to make it in to future full releases of spaCy. I'm very excited about spans from subtrees, especially as that was something I was going to try on my own! Thanks so much for all your work.

adrianeboyd · 2022-05-02T16:02:52Z

The general plan is to initially release things like this under spacy-experimental for easier testing, and then move features to the core library or other places once they seem stable and generally useful. Not everything will make the cut and the APIs might change a bit, but I think these particular suggesters have a very high chance of being moved to the core library soon.

adrianeboyd · 2022-05-03T16:19:20Z

I think that these span suggesters would be simpler and faster without the ngrams thrown in, and it would make sense to have a more general way of combining arbitrary suggesters/suggestions than what's proposed here. (In particular sentences + ngrams doesn't make a lot of sense to me?)

thomashacker · 2022-05-03T16:48:41Z

I think, having the ngram functionality as a baseline will be useful for most use-cases. For example, the sentences + ngrams could be used for the healthsea dataset. I also like that it's optional to use them, when leaving the sizes list empty, the suggester will skip the whole code block that handles the ngram suggesting. But I definitely agree that it would be better to have a more general approach of combining multiple suggesters. I'd propose that we keep the ngram functionality for the experimental versions, and when we want to integrate them to the spaCy codebase, think about how we can easily combine the suggesters.

adrianeboyd · 2022-05-04T09:44:23Z

It is a problem that it's hard to specify a variable number of things (each with their own configs) to combine in a config block, which is something that's also come up with augmenters.

If we're going to do this, I think it would be better to have a utility function that merges suggestions in a more general, efficient way. The ops.to_numpy() for each row in the array is going to be particularly slow in the current version and appending everything row-by-row vs. working with the existing numpy array looks like it may also slow things down. (In a case where you already know you want it as numpy, I think you could just pass NumpyOps as the ops instead to the suggester and then you could skip all the conversions?)

I decided to sketch out what a suggestions merger could look like (could still use some typing and testing, particularly on GPU; note that there's no cupy.unique so you have to convert to numpy):

from typing import List, Optional, Iterable, cast
import numpy
from thinc.api import get_current_ops, Ops
from thinc.types import Ragged, Ints1d
from spacy.tokens import Doc
from spacy.util import registry
from spacy.pipeline.spancat import Suggester


@registry.misc("experimental.ngram_sentence_suggester.v1")
def build_ngram_sentence_suggester(sizes: List[int]) -> Suggester:
    """Suggest ngrams and sentences. Requires sentence boundaries"""

    ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes)
    def ngram_sentence_suggester(
        docs: Iterable[Doc], *, ops: Optional[Ops] = None
    ) -> Ragged:
        ngram_suggestions = ngram_suggester(docs, ops=ops)
        sentence_suggestions = sentence_suggester(docs, ops=ops)
        return merge_suggestions([ngram_suggestions, sentence_suggestions], ops=ops)

    return sentence_suggester


def sentence_suggester(docs: Iterable[Doc], *, ops: Optional[Ops] = None) -> Ragged:
    if ops is None:
        ops = get_current_ops()
    spans = []
    lengths = []

    for doc in docs:
        sents = list(doc.sents)
        spans.extend((sent.start, sent.end) for sent in sents)
        lengths.append(len(sents))

    lengths_array = cast(Ints1d, ops.asarray(lengths, dtype="i"))
    if len(spans) > 0:
        output = Ragged(ops.asarray(spans, dtype="i"), lengths_array)
    else:
        output = Ragged(ops.xp.zeros((0, 0), dtype="i"), lengths_array)

    return output


def merge_suggestions(suggestions: List[Ragged], ops: Optional[Ops] = None) -> Ragged:
    if ops is None:
        ops = get_current_ops()

    spans = []
    lengths = []

    if len(suggestions) == 0:
        lengths_array = cast(Ints1d, ops.asarray(lengths, dtype="i"))
        return Ragged(ops.xp.zeros((0, 0), dtype="i"), lengths_array)

    len_docs = len(suggestions[0])
    assert all(len_docs == len(x) for x in suggestions)

    for i in range(len_docs):
        combined = ops.xp.vstack([s[i].data for s in suggestions])
        uniqued = numpy.unique(ops.to_numpy(combined), axis=0)
        spans.append(ops.asarray(uniqued))
        lengths.append(uniqued.shape[0])

    lengths_array = cast(Ints1d, ops.asarray(lengths, dtype="i"))
    if len(spans) > 0:
        output = Ragged(ops.xp.vstack(spans), lengths_array)
    else:
        output = Ragged(ops.xp.zeros((0, 0), dtype="i"), lengths_array)

    return output

And I wouldn't be surprised if there's a faster way to manage the per-doc unique step, but this has reached the end of my numpy knowledge.

thomashacker · 2022-05-05T09:41:11Z

Wow, that's awesome! Thanks adriane! 🥳
I've implemented the merge_suggesters function and also added suggester registries that don't use the ngram at all, if the user does not want to:

  spacy-experimental.subtree_suggester.v1
  spacy-experimental.ngram_subtree_suggester.v1 
  spacy-experimental.chunk_suggester.v1 
  spacy-experimental.ngram_chunk_suggester.v1 
  spacy-experimental.sentence_suggester.v1 
  spacy-experimental.ngram_sentence_suggester.v1

adrianeboyd · 2022-05-05T11:01:13Z

Can you remove the formatting changes to azure-pipelines.yml?

projects/span_suggesters/configs/sentence_suggester.cfg

projects/span_suggesters/configs/chunk_suggester.cfg

projects/span_suggesters/configs/subtree_suggester.cfg

spacy_experimental/span_suggesters/tests/test__suggesters.py

spacy_experimental/span_suggesters/sentence_suggester.py

setup.cfg

spacy_experimental/span_suggesters/chunk_suggester.py

adrianeboyd · 2022-05-09T11:58:15Z

Can you add documentation for these to the top-level README?

adrianeboyd · 2022-05-12T09:05:03Z

Can you add sample configs in the README for one of the types of suggesters that also explains how to use sizes for the ngram versions?

Also explicitly how to use @misc so it's clear where the functions are registered.

Lolologist · 2022-05-16T14:21:55Z

Not to keep nosing into y'all's business, but once this merges would consumer (my) feedback on trying a model using this be warranted/appreciated?

adrianeboyd · 2022-05-16T14:56:02Z

@Lolologist: Feedback is welcome, that's pretty much what spacy-experimental is there for! We want it to be easy for users to try things out so we can improve them before they're added to the core library.

thomashacker · 2022-05-16T19:58:08Z

Added example snippets to the readme with 1f52f1e 🎉

README.md

thomashacker added 3 commits April 20, 2022 12:50

Init

0d2a2d4

Add suggesters

05aa5f7

Add suggester project

c0036e4

thomashacker added the enhancement New feature or request label Apr 23, 2022

black formatting

37b9182

adrianeboyd reviewed Apr 27, 2022

View reviewed changes

projects/span_suggesters/configs/chunk_suggester.cfg Outdated Show resolved Hide resolved

Add tests

007039e

Add merge_suggester function

40e8d54

Adjust registry names

0f63c46