Add SpanCategorizer component #6747

honnibal · 2021-01-17T09:58:21Z

I wanted to write another new component to give the v3 config/model/component systems another test. It also fits in well with the new SpanGroup class.

The SpanCategorizer component provides a version of the idea discussed in #3961. The component takes a function that proposes (start, end) offsets in the documents as possible spans to be recognized. A model then assigns label probabilities to each span. Spans with probability above a given threshold are added to a span group of a specified key. The span group is also used to get training data.

~~I haven't tested this on an actual task yet, and I think the default model I've proposed probably won't work very well.~~

Edit: Now testing on an Indonesian NER dataset. It does work, ~~although the model I have is indeed not so good~~, and the accuracies are already looking okay.

Component	NER P	NER R	NER F
`SpanCategorizer`	76.0	62.7	68.7
`EntityRecognizer`	71.1	66.8	68.9

I'm sure an attention layer over the spans will do better than the current janky thing I'm doing. The SpanCategorizer is classifying all ngrams of length 1, 2, 3 or 4. The model also doesn't get to exploit the knowledge that the task has no overlapping or nested entities. I'm sure adding some Viterbi decoding would push its score up a bit higher.

Another really nice thing about this component is that it predicts directly over the spans, so we'll finally have a component that can give meaningful entity confidences 🎉 . You'll be able to ask the model the probability that any arbitrary span of text is an entity -- you just have to propose the span in the suggester.

Types of change

Enhancement.

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

honnibal · 2021-01-18T03:27:01Z

The design makes it a bit awkward to have a suggester that's built from the data. For instance, what if you want any span that was seen as an entity in the training data to be suggested? There's currently no good way to do that. It's also really hard to provide data files to the suggester.

The best workaround at the moment would be to move the suggestion into a different component, that sets the candidates into the doc.spans under a different key. The suggester function would then just read this key from the doc.spans.

Maybe we want to recommend that pattern instead of trying to build a more complex API for non-trivial suggester functions?

… up)

…gorizer-v3

spacy/pipeline/spancat.py

…osion/spaCy into feature/span-categorizer-v3

website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so that it's possible to evaluate multiple `spancat` components in the same pipeline. * Use the (intentionally very short) default spans key `sc` in the `SpanCategorizer` * Adjust the default score weights to include the default key * Adjust the scorer to use `spans_{spans_key}` as the prefix for the returned score * Revert addition of `attr_name` argument to `score_spans` and adjust the key in the `getter` instead. Note that for `spancat` components with a custom `span_key`, the score weights currently need to be modified manually in `[training.score_weights]` for them to be available during training. To suppress the default score weights `spans_sc_p/r/f` during training, set them to `null` in `[training.score_weights]`.

website/docs/api/scorer.md

adrianeboyd · 2021-06-18T12:22:01Z

I think this looks good now, with the experimental label. Anything else outstanding?

Wait I think one of the scoring changes doesn't generalize. Let me have a look...

* Add Spans to Evaluate CLI * Change to spans_key * Add spans per_type output Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix GPU issues * Require thinc >=8.0.6

spacy/pipeline/spancat.py

* Include final ngram in doc for all sizes * Fix ngrams for docs of the same length as ngram size * Handle batches of docs that result in no ngrams * Add tests

honnibal added 19 commits January 17, 2021 12:57

Draft spancat model

8d7ab4b

Add spancat model

5cd9fb1

Add test for extract_spans

8c05dcf

Add extract_spans layer

c05d7e7

Upd extract_spans

18c9a09

Add spancat model

bb3235a

Add test for spancat model

ebb0273

Upd spancat model

d8bb964

Update spancat component

dff38af

Upd spancat

1ad4c0f

Update spancat model

8313938

Add quick spancat test

d090aec

Import SpanCategorizer

db89b0f

Fix SpanCategorizer component

7fe54cf

Import SpanGroup

6ca4fd1

Fix span extraction

07f085e

Fix import

9d2b61f

Fix import

9e28c7f

Upd model

1381789

honnibal changed the base branch from master to develop January 17, 2021 10:01

honnibal requested a review from svlandeg January 17, 2021 11:04

ines added enhancement Feature requests and improvements v3.0 Related to v3.0 feat / pipeline Feature: Processing pipeline and components labels Jan 18, 2021

honnibal added 2 commits January 18, 2021 12:41

Update spancat models

5813eb9

Add scoring, update defaults

131c29a

ines added 3 commits January 18, 2021 21:56

Merge branch 'develop' into feature/span-categorizer-v3

b1d975d

Update and add docs

83d3c68

Merge branch 'develop' into feature/span-categorizer-v3

1a5c55f

svlandeg added 5 commits June 17, 2021 10:51

bugfix to allow renaming the default span_key (scores weren't showing…

1c26cda

… up)

use different key in docs example

98eacb7

change defaults to better-working parameters from project (WIP)

68b6dae

register spacy.extract_spans.v1 for legacy purposes

84aeafd

Merge remote-tracking branch 'upstream/master' into feature/span-cate…

c499cfe

…gorizer-v3

svlandeg mentioned this pull request Jun 17, 2021

Example SpanCategorizer project explosion/projects#26

Merged

adrianeboyd reviewed Jun 17, 2021

View reviewed changes

spacy/pipeline/spancat.py Outdated Show resolved Hide resolved

honnibal and others added 3 commits June 18, 2021 15:22

Upd dev version so can build wheel

b41aa07

layers instead of architectures for smaller building blocks

ae32e8b

Merge branch 'feature/span-categorizer-v3' of https://github.com/expl…

4c3a096

…osion/spaCy into feature/span-categorizer-v3

adrianeboyd reviewed Jun 18, 2021

View reviewed changes

website/docs/api/spancategorizer.md Outdated Show resolved Hide resolved

adrianeboyd reviewed Jun 18, 2021

View reviewed changes

website/docs/api/spancategorizer.md Outdated Show resolved Hide resolved

svlandeg and others added 4 commits June 18, 2021 11:42

Update website/docs/api/spancategorizer.md

3b0cea7

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Update website/docs/api/spancategorizer.md

8c3c219

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Include additional scores from overrides in combined score weights

7ae1432

svlandeg reviewed Jun 18, 2021

View reviewed changes

website/docs/api/scorer.md Outdated Show resolved Hide resolved

Update website/docs/api/scorer.md

5c62cda

adrianeboyd and others added 6 commits June 18, 2021 19:39

Fix scorer for spans key containing underscore

972eabb

Increment version

2d9985b

Add Spans to Evaluate CLI (#8439)

2cb2b85

* Add Spans to Evaluate CLI * Change to spans_key * Add spans per_type output Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Fix spancat GPU issues (#8455)

b95981e

* Fix GPU issues * Require thinc >=8.0.6

Merge branch 'master' into feature/span-categorizer-v3

e313884

Switch to glorot_uniform_init

b5db28e

adrianeboyd reviewed Jun 22, 2021

View reviewed changes

spacy/pipeline/spancat.py Outdated Show resolved Hide resolved

adrianeboyd reviewed Jun 22, 2021

View reviewed changes

spacy/pipeline/spancat.py Outdated Show resolved Hide resolved

Fix and test ngram suggester

825867f

* Include final ngram in doc for all sizes * Fix ngrams for docs of the same length as ngram size * Handle batches of docs that result in no ngrams * Add tests

adrianeboyd merged commit f994615 into master Jun 24, 2021

svlandeg deleted the feature/span-categorizer-v3 branch June 25, 2021 10:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SpanCategorizer component #6747

Add SpanCategorizer component #6747

honnibal commented Jan 17, 2021 •

edited

Loading

honnibal commented Jan 18, 2021

adrianeboyd commented Jun 18, 2021 •

edited

Loading

Add SpanCategorizer component #6747

Add SpanCategorizer component #6747

Conversation

honnibal commented Jan 17, 2021 • edited Loading

Types of change

Checklist

honnibal commented Jan 18, 2021

adrianeboyd commented Jun 18, 2021 • edited Loading

honnibal commented Jan 17, 2021 •

edited

Loading

adrianeboyd commented Jun 18, 2021 •

edited

Loading