Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SpanCategorizer component #6747

Merged
merged 70 commits into from
Jun 24, 2021
Merged

Add SpanCategorizer component #6747

merged 70 commits into from
Jun 24, 2021

Conversation

honnibal
Copy link
Member

@honnibal honnibal commented Jan 17, 2021

I wanted to write another new component to give the v3 config/model/component systems another test. It also fits in well with the new SpanGroup class.

The SpanCategorizer component provides a version of the idea discussed in #3961. The component takes a function that proposes (start, end) offsets in the documents as possible spans to be recognized. A model then assigns label probabilities to each span. Spans with probability above a given threshold are added to a span group of a specified key. The span group is also used to get training data.

I haven't tested this on an actual task yet, and I think the default model I've proposed probably won't work very well.

Edit: Now testing on an Indonesian NER dataset. It does work, although the model I have is indeed not so good, and the accuracies are already looking okay.

Component NER P NER R NER F
SpanCategorizer 76.0 62.7 68.7
EntityRecognizer 71.1 66.8 68.9

I'm sure an attention layer over the spans will do better than the current janky thing I'm doing. The SpanCategorizer is classifying all ngrams of length 1, 2, 3 or 4. The model also doesn't get to exploit the knowledge that the task has no overlapping or nested entities. I'm sure adding some Viterbi decoding would push its score up a bit higher.

Another really nice thing about this component is that it predicts directly over the spans, so we'll finally have a component that can give meaningful entity confidences 🎉 . You'll be able to ask the model the probability that any arbitrary span of text is an entity -- you just have to propose the span in the suggester.

Types of change

Enhancement.

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@honnibal honnibal changed the base branch from master to develop January 17, 2021 10:01
@ines ines added enhancement Feature requests and improvements v3.0 Related to v3.0 feat / pipeline Feature: Processing pipeline and components labels Jan 18, 2021
@honnibal
Copy link
Member Author

The design makes it a bit awkward to have a suggester that's built from the data. For instance, what if you want any span that was seen as an entity in the training data to be suggested? There's currently no good way to do that. It's also really hard to provide data files to the suggester.

The best workaround at the moment would be to move the suggestion into a different component, that sets the candidates into the doc.spans under a different key. The suggester function would then just read this key from the doc.spans.

Maybe we want to recommend that pattern instead of trying to build a more complex API for non-trivial suggester functions?

svlandeg and others added 4 commits June 18, 2021 11:42
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so
that it's possible to evaluate multiple `spancat` components in the same
pipeline.

* Use the (intentionally very short) default spans key `sc` in the
  `SpanCategorizer`
* Adjust the default score weights to include the default key
* Adjust the scorer to use `spans_{spans_key}` as the prefix for the
  returned score
* Revert addition of `attr_name` argument to `score_spans` and adjust
  the key in the `getter` instead.

Note that for `spancat` components with a custom `span_key`, the score
weights currently need to be modified manually in
`[training.score_weights]` for them to be available during training. To
suppress the default score weights `spans_sc_p/r/f` during training, set
them to `null` in `[training.score_weights]`.
@adrianeboyd
Copy link
Contributor

adrianeboyd commented Jun 18, 2021

I think this looks good now, with the experimental label. Anything else outstanding?

Wait I think one of the scoring changes doesn't generalize. Let me have a look...

adrianeboyd and others added 6 commits June 18, 2021 19:39
* Add Spans to Evaluate CLI

* Change to spans_key

* Add spans per_type output

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix GPU issues

* Require thinc >=8.0.6
* Include final ngram in doc for all sizes
* Fix ngrams for docs of the same length as ngram size
* Handle batches of docs that result in no ngrams
* Add tests
@adrianeboyd adrianeboyd merged commit f994615 into master Jun 24, 2021
@svlandeg svlandeg deleted the feature/span-categorizer-v3 branch June 25, 2021 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components ⚠️ wip Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants