Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental Span Suggesters #11

Merged
merged 13 commits into from
May 17, 2022

Conversation

thomashacker
Copy link
Contributor

This PR adds three new experimental suggester functions for the spancat component and a spaCy project showcasing how to use them in a config.cfg file.

Subtree Suggester:

  • Uses annotations from the Tagger and Parser to suggests subtrees of individual tokens

Chunk Suggester:

  • Uses annotations from the Tagger and Parser to suggest noun_chunks

Sentence Suggester:

  • Uses sentence boundaries to suggest sentences

These suggesters also come with the ngram functionality which allows users to set a list of sizes for suggesting individual ngrams

The spaCy project covers:

  • How to source components from existing models
  • How to use frozen_components & annotating_components
  • How to use custom suggester functions registered in the registry

@thomashacker thomashacker added the enhancement New feature or request label Apr 23, 2022
@Lolologist
Copy link

Hi there! I stumbled in here while poking around at spancategorizer capabilities and wanted to ask if this is something that you hope/intend to make it in to future full releases of spaCy. I'm very excited about spans from subtrees, especially as that was something I was going to try on my own! Thanks so much for all your work.

@adrianeboyd
Copy link
Contributor

The general plan is to initially release things like this under spacy-experimental for easier testing, and then move features to the core library or other places once they seem stable and generally useful. Not everything will make the cut and the APIs might change a bit, but I think these particular suggesters have a very high chance of being moved to the core library soon.

@adrianeboyd
Copy link
Contributor

I think that these span suggesters would be simpler and faster without the ngrams thrown in, and it would make sense to have a more general way of combining arbitrary suggesters/suggestions than what's proposed here. (In particular sentences + ngrams doesn't make a lot of sense to me?)

@thomashacker
Copy link
Contributor Author

thomashacker commented May 3, 2022

I think, having the ngram functionality as a baseline will be useful for most use-cases. For example, the sentences + ngrams could be used for the healthsea dataset. I also like that it's optional to use them, when leaving the sizes list empty, the suggester will skip the whole code block that handles the ngram suggesting. But I definitely agree that it would be better to have a more general approach of combining multiple suggesters. I'd propose that we keep the ngram functionality for the experimental versions, and when we want to integrate them to the spaCy codebase, think about how we can easily combine the suggesters.

@adrianeboyd
Copy link
Contributor

It is a problem that it's hard to specify a variable number of things (each with their own configs) to combine in a config block, which is something that's also come up with augmenters.

If we're going to do this, I think it would be better to have a utility function that merges suggestions in a more general, efficient way. The ops.to_numpy() for each row in the array is going to be particularly slow in the current version and appending everything row-by-row vs. working with the existing numpy array looks like it may also slow things down. (In a case where you already know you want it as numpy, I think you could just pass NumpyOps as the ops instead to the suggester and then you could skip all the conversions?)

I decided to sketch out what a suggestions merger could look like (could still use some typing and testing, particularly on GPU; note that there's no cupy.unique so you have to convert to numpy):

from typing import List, Optional, Iterable, cast
import numpy
from thinc.api import get_current_ops, Ops
from thinc.types import Ragged, Ints1d
from spacy.tokens import Doc
from spacy.util import registry
from spacy.pipeline.spancat import Suggester


@registry.misc("experimental.ngram_sentence_suggester.v1")
def build_ngram_sentence_suggester(sizes: List[int]) -> Suggester:
    """Suggest ngrams and sentences. Requires sentence boundaries"""

    ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes)
    def ngram_sentence_suggester(
        docs: Iterable[Doc], *, ops: Optional[Ops] = None
    ) -> Ragged:
        ngram_suggestions = ngram_suggester(docs, ops=ops)
        sentence_suggestions = sentence_suggester(docs, ops=ops)
        return merge_suggestions([ngram_suggestions, sentence_suggestions], ops=ops)

    return sentence_suggester


def sentence_suggester(docs: Iterable[Doc], *, ops: Optional[Ops] = None) -> Ragged:
    if ops is None:
        ops = get_current_ops()
    spans = []
    lengths = []

    for doc in docs:
        sents = list(doc.sents)
        spans.extend((sent.start, sent.end) for sent in sents)
        lengths.append(len(sents))

    lengths_array = cast(Ints1d, ops.asarray(lengths, dtype="i"))
    if len(spans) > 0:
        output = Ragged(ops.asarray(spans, dtype="i"), lengths_array)
    else:
        output = Ragged(ops.xp.zeros((0, 0), dtype="i"), lengths_array)

    return output


def merge_suggestions(suggestions: List[Ragged], ops: Optional[Ops] = None) -> Ragged:
    if ops is None:
        ops = get_current_ops()

    spans = []
    lengths = []

    if len(suggestions) == 0:
        lengths_array = cast(Ints1d, ops.asarray(lengths, dtype="i"))
        return Ragged(ops.xp.zeros((0, 0), dtype="i"), lengths_array)

    len_docs = len(suggestions[0])
    assert all(len_docs == len(x) for x in suggestions)

    for i in range(len_docs):
        combined = ops.xp.vstack([s[i].data for s in suggestions])
        uniqued = numpy.unique(ops.to_numpy(combined), axis=0)
        spans.append(ops.asarray(uniqued))
        lengths.append(uniqued.shape[0])

    lengths_array = cast(Ints1d, ops.asarray(lengths, dtype="i"))
    if len(spans) > 0:
        output = Ragged(ops.xp.vstack(spans), lengths_array)
    else:
        output = Ragged(ops.xp.zeros((0, 0), dtype="i"), lengths_array)

    return output

And I wouldn't be surprised if there's a faster way to manage the per-doc unique step, but this has reached the end of my numpy knowledge.

@thomashacker
Copy link
Contributor Author

Wow, that's awesome! Thanks adriane! 🥳
I've implemented the merge_suggesters function and also added suggester registries that don't use the ngram at all, if the user does not want to:

  spacy-experimental.subtree_suggester.v1
  spacy-experimental.ngram_subtree_suggester.v1 
  spacy-experimental.chunk_suggester.v1 
  spacy-experimental.ngram_chunk_suggester.v1 
  spacy-experimental.sentence_suggester.v1 
  spacy-experimental.ngram_sentence_suggester.v1 

@adrianeboyd
Copy link
Contributor

Can you remove the formatting changes to azure-pipelines.yml?

setup.cfg Show resolved Hide resolved
@adrianeboyd
Copy link
Contributor

Can you add documentation for these to the top-level README?

@adrianeboyd
Copy link
Contributor

adrianeboyd commented May 12, 2022

Can you add sample configs in the README for one of the types of suggesters that also explains how to use sizes for the ngram versions?

Also explicitly how to use @misc so it's clear where the functions are registered.

@Lolologist
Copy link

Not to keep nosing into y'all's business, but once this merges would consumer (my) feedback on trying a model using this be warranted/appreciated?

@adrianeboyd
Copy link
Contributor

@Lolologist: Feedback is welcome, that's pretty much what spacy-experimental is there for! We want it to be easy for users to try things out so we can improve them before they're added to the core library.

@thomashacker
Copy link
Contributor Author

Added example snippets to the readme with 1f52f1e 🎉

README.md Outdated Show resolved Hide resolved
@adrianeboyd adrianeboyd merged commit 0d53416 into explosion:master May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants