Where to do preprocessing step when training and using textcat
model?
#13216
orept
started this conversation in
Help: Best practices
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I want to train a textcat model.
I need to do some preprocessing i.e. replace named entities with their types: "Apple released a new gadget" -> "-ORG- released a new gadget".
I need to do this step for each of the docs from the training set and also for texts the model will be used to classify.
The model will be deployed to another service (different from one where it was trained).
As I first thought, it should be just a step in a packaged pipeline, but weirdly I can't see a proper way to include it.
I thought to create custom tokenizer and put it there before actual tokenization, but there is a note saying that
spaCy’s tokenization is non-destructive ... no information is added or removed during tokenization. This is kind of a core principle of spaCy’s Doc object
So where is the proper place to include it?
Beta Was this translation helpful? Give feedback.
All reactions