-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add experimental Span Suggesters #11
Add experimental Span Suggesters #11
Conversation
Hi there! I stumbled in here while poking around at spancategorizer capabilities and wanted to ask if this is something that you hope/intend to make it in to future full releases of spaCy. I'm very excited about spans from subtrees, especially as that was something I was going to try on my own! Thanks so much for all your work. |
The general plan is to initially release things like this under |
I think that these span suggesters would be simpler and faster without the ngrams thrown in, and it would make sense to have a more general way of combining arbitrary suggesters/suggestions than what's proposed here. (In particular sentences + ngrams doesn't make a lot of sense to me?) |
I think, having the ngram functionality as a baseline will be useful for most use-cases. For example, the |
It is a problem that it's hard to specify a variable number of things (each with their own configs) to combine in a config block, which is something that's also come up with augmenters. If we're going to do this, I think it would be better to have a utility function that merges suggestions in a more general, efficient way. The I decided to sketch out what a suggestions merger could look like (could still use some typing and testing, particularly on GPU; note that there's no from typing import List, Optional, Iterable, cast
import numpy
from thinc.api import get_current_ops, Ops
from thinc.types import Ragged, Ints1d
from spacy.tokens import Doc
from spacy.util import registry
from spacy.pipeline.spancat import Suggester
@registry.misc("experimental.ngram_sentence_suggester.v1")
def build_ngram_sentence_suggester(sizes: List[int]) -> Suggester:
"""Suggest ngrams and sentences. Requires sentence boundaries"""
ngram_suggester = registry.misc.get("spacy.ngram_suggester.v1")(sizes)
def ngram_sentence_suggester(
docs: Iterable[Doc], *, ops: Optional[Ops] = None
) -> Ragged:
ngram_suggestions = ngram_suggester(docs, ops=ops)
sentence_suggestions = sentence_suggester(docs, ops=ops)
return merge_suggestions([ngram_suggestions, sentence_suggestions], ops=ops)
return sentence_suggester
def sentence_suggester(docs: Iterable[Doc], *, ops: Optional[Ops] = None) -> Ragged:
if ops is None:
ops = get_current_ops()
spans = []
lengths = []
for doc in docs:
sents = list(doc.sents)
spans.extend((sent.start, sent.end) for sent in sents)
lengths.append(len(sents))
lengths_array = cast(Ints1d, ops.asarray(lengths, dtype="i"))
if len(spans) > 0:
output = Ragged(ops.asarray(spans, dtype="i"), lengths_array)
else:
output = Ragged(ops.xp.zeros((0, 0), dtype="i"), lengths_array)
return output
def merge_suggestions(suggestions: List[Ragged], ops: Optional[Ops] = None) -> Ragged:
if ops is None:
ops = get_current_ops()
spans = []
lengths = []
if len(suggestions) == 0:
lengths_array = cast(Ints1d, ops.asarray(lengths, dtype="i"))
return Ragged(ops.xp.zeros((0, 0), dtype="i"), lengths_array)
len_docs = len(suggestions[0])
assert all(len_docs == len(x) for x in suggestions)
for i in range(len_docs):
combined = ops.xp.vstack([s[i].data for s in suggestions])
uniqued = numpy.unique(ops.to_numpy(combined), axis=0)
spans.append(ops.asarray(uniqued))
lengths.append(uniqued.shape[0])
lengths_array = cast(Ints1d, ops.asarray(lengths, dtype="i"))
if len(spans) > 0:
output = Ragged(ops.xp.vstack(spans), lengths_array)
else:
output = Ragged(ops.xp.zeros((0, 0), dtype="i"), lengths_array)
return output And I wouldn't be surprised if there's a faster way to manage the per-doc unique step, but this has reached the end of my numpy knowledge. |
Wow, that's awesome! Thanks adriane! 🥳
|
Can you remove the formatting changes to |
Can you add documentation for these to the top-level README? |
Can you add sample configs in the README for one of the types of suggesters that also explains how to use Also explicitly how to use |
Not to keep nosing into y'all's business, but once this merges would consumer (my) feedback on trying a model using this be warranted/appreciated? |
@Lolologist: Feedback is welcome, that's pretty much what |
Added example snippets to the readme with 1f52f1e 🎉 |
This
PR
adds three new experimental suggester functions for thespancat
component and a spaCy project showcasing how to use them in aconfig.cfg
file.Subtree Suggester:
subtrees
of individual tokensChunk Suggester:
noun_chunks
Sentence Suggester:
sentences
These suggesters also come with the
ngram
functionality which allows users to set a list ofsizes
for suggesting individual ngramsThe spaCy project covers:
frozen_components
&annotating_components
registry