-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SpanCategorizer component #6747
Conversation
The design makes it a bit awkward to have a suggester that's built from the data. For instance, what if you want any span that was seen as an entity in the training data to be suggested? There's currently no good way to do that. It's also really hard to provide data files to the suggester. The best workaround at the moment would be to move the suggestion into a different component, that sets the candidates into the Maybe we want to recommend that pattern instead of trying to build a more complex API for non-trivial suggester functions? |
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so that it's possible to evaluate multiple `spancat` components in the same pipeline. * Use the (intentionally very short) default spans key `sc` in the `SpanCategorizer` * Adjust the default score weights to include the default key * Adjust the scorer to use `spans_{spans_key}` as the prefix for the returned score * Revert addition of `attr_name` argument to `score_spans` and adjust the key in the `getter` instead. Note that for `spancat` components with a custom `span_key`, the score weights currently need to be modified manually in `[training.score_weights]` for them to be available during training. To suppress the default score weights `spans_sc_p/r/f` during training, set them to `null` in `[training.score_weights]`.
I think this looks good now, with the experimental label. Anything else outstanding? Wait I think one of the scoring changes doesn't generalize. Let me have a look... |
* Add Spans to Evaluate CLI * Change to spans_key * Add spans per_type output Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix GPU issues * Require thinc >=8.0.6
* Include final ngram in doc for all sizes * Fix ngrams for docs of the same length as ngram size * Handle batches of docs that result in no ngrams * Add tests
I wanted to write another new component to give the v3 config/model/component systems another test. It also fits in well with the new
SpanGroup
class.The
SpanCategorizer
component provides a version of the idea discussed in #3961. The component takes a function that proposes(start, end)
offsets in the documents as possible spans to be recognized. A model then assigns label probabilities to each span. Spans with probability above a given threshold are added to a span group of a specified key. The span group is also used to get training data.I haven't tested this on an actual task yet, and I think the default model I've proposed probably won't work very well.Edit: Now testing on an Indonesian NER dataset. It does work,
although the model I have is indeed not so good, and the accuracies are already looking okay.SpanCategorizer
EntityRecognizer
I'm sure an attention layer over the spans will do better than the current janky thing I'm doing. The
SpanCategorizer
is classifying all ngrams of length 1, 2, 3 or 4. The model also doesn't get to exploit the knowledge that the task has no overlapping or nested entities. I'm sure adding some Viterbi decoding would push its score up a bit higher.Another really nice thing about this component is that it predicts directly over the spans, so we'll finally have a component that can give meaningful entity confidences 🎉 . You'll be able to ask the model the probability that any arbitrary span of text is an entity -- you just have to propose the span in the suggester.
Types of change
Enhancement.
Checklist