Where to do preprocessing step when training and using `textcat` model? #13216

orept · 2024-01-01T15:56:47Z

orept
Jan 1, 2024

I want to train a textcat model.
I need to do some preprocessing i.e. replace named entities with their types: "Apple released a new gadget" -> "-ORG- released a new gadget".
I need to do this step for each of the docs from the training set and also for texts the model will be used to classify.
The model will be deployed to another service (different from one where it was trained).

As I first thought, it should be just a step in a packaged pipeline, but weirdly I can't see a proper way to include it.
I thought to create custom tokenizer and put it there before actual tokenization, but there is a note saying that spaCy’s tokenization is non-destructive ... no information is added or removed during tokenization. This is kind of a core principle of spaCy’s Doc object

So where is the proper place to include it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where to do preprocessing step when training and using `textcat` model? #13216

{{title}}

Replies: 0 comments

Select a reply

Where to do preprocessing step when training and using textcat model? #13216

orept Jan 1, 2024

Replies: 0 comments

Where to do preprocessing step when training and using `textcat` model? #13216

orept
Jan 1, 2024