Multiple textcat_multilabel models - only run on demand #10177

NixBiks · 2022-01-31T16:14:14Z

NixBiks
Jan 31, 2022

I have a pipeline where I need to tag the whole document with a trained textcat_multilabel component. Based on the labels I want to run another textcat_multilabel model. How would a config look like for that (just in broad lines)? Also how can I make sure that my models don't automatically runs when I create the document, i.e. on nlp(text). I only want to run on demand.

polm · 2022-02-01T04:17:23Z

polm
Feb 1, 2022

What I would do is have two pipelines. Call them the generalist (all data) and the specialist (deals with a subset).

You would train them each as usual, using a subset of data for the specialist, based on what you'll actually pass it.

For inference you can do something like this:

import spacy

generalist = nlp.load("generalist")
specialist = nlp.load("specialist")

doc = generalist("This is some input text.")
if doc.cats["blarg"] > 0.5:
     doc = specialist(doc)

We recently added the ability to pass docs to pipelines. This is mainly intended for adding extra data, so I'm not sure it helps much here, but it is an option.

You could also train one pipeline with two textcat components. You would disable the specialist on the first call (nlp(text, disable=["specialist"]), then, if you need the specialist, call nlp again, disabling the generalist on the second call. That would be a little weird but it would take up less memory and you might benefit from the shared tok2vec.

Another separate option would be to combine your classification into a single step. So if you have a label alpha that doesn't specialize and a label beta that could be classified in to beta_a, beta_b, beta_c, etc., you could just collapse the labels and do it all in one step. If the class balance gets weird that could have issues, which is why I assume you aren't trying it already, but it would mean you get the full classification with no computational overhead.

We've been meaning to make a hierarchical textcat component that could probably help with this, but we haven't started work on it yet.

Can you clarify why you only want to run the second textcat on demand? I don't think the overhead of a textcat should be that high in general.

4 replies

NixBiks Feb 1, 2022
Author

Thanks. This is very helpful. I guess I could extract the single components from nlp.pipeline too and then call them as needed.

One question though; if I do something like this

doc = generalist("This is some input text.")
if doc.cats["blarg"] > 0.5:
     doc = specialist(doc)

then you tokenize twice, right? Or does Language reuse tokenization when called on a Doc instead of str?

My usecase is simply that milliseconds matters...

adrianeboyd Feb 1, 2022

If you pass a doc the pipeline skips the tokenizer entirely, and you can always call individual pipes from the same pipeline on a doc in steps rather than the whole pipeline at once.

If you have two pipelines, you can run into errors if the doc vocab doesn't match the pipeline vocab, so you'd want to load the pipelines with a shared single vocab:

generalist = nlp.load("generalist")
specialist = nlp.load("specialist", vocab=generalist.vocab)

In turn, this only works if both pipelines have the same vectors. If they don't have the same vectors, you'd need to reload the doc with the new vocab, which can be done relatively quickly with Doc.to/from_dict and the right excludes.

meliascosta Jul 6, 2022

I ran into a similar problem while working on a classifier with labels organized in a hierarchy. Has the development of this hierarchical textcat started? Is there a discussion on it that I can follow?

polm Jul 7, 2022

No, no work has been done on that yet. We'll open a PR when it starts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple textcat_multilabel models - only run on demand #10177

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Multiple textcat_multilabel models - only run on demand #10177

NixBiks Jan 31, 2022

Replies: 1 comment · 4 replies

polm Feb 1, 2022

NixBiks Feb 1, 2022 Author

adrianeboyd Feb 1, 2022

meliascosta Jul 6, 2022

polm Jul 7, 2022

NixBiks
Jan 31, 2022

Replies: 1 comment 4 replies

polm
Feb 1, 2022

NixBiks Feb 1, 2022
Author