How should I preprocess text for spaCy? #10243
Locked
polm
started this conversation in
Help: Best practices
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
There's usually no need to preprocess text before passing it to spaCy.
For older NLP methods, such as those predating neural models, it was common to preprocess text by making everything lowercase, removing stopwords, lemmatizing or stemming words, and so on. For modern neural methods in general, and spaCy in particular, this kind of preprocessing will almost always hurt performance and should not be performed.
That said, note that "preprocessing" means many different things. spaCy is designed to process natural language, like a newspaper article, blog post, or this FAQ, without markup. If your text is in HTML, PDF, or another format, you'll have to extract the plain text contents before passing them to spaCy; that kind of preprocessing is still required.
The most important consideration with spaCy's models is that the input should resemble the training data. Our pretrained pipelines are trained on complete, grammatical sentences, like newspaper articles. These pretrained pipelines will typically work well if you feed them similar texts. However, if your text has many errors, or is much more informal, you might need to train your own models to get more accurate results. Note that we typically recommend training your own models on your specific domain and input texts anyway.
One kind of preprocessing that can be helpful is normalizing spaces and punctuation. The training data used in spaCy's pretrained pipelines usually has very clean punctuation and only uses normal spaces, so if you have a lot of unusual spaces, in-sentence newlines, or Unicode punctuation marks, the models may have never seen them before and make odd predictions. In this case it can be helpful to replace punctuation with ASCII equivalents and replace whitespace with simple spaces.
For example, this is how to replace all runs of whitespace with single spaces:
If you still have questions, feel free to open a new Discussion with details about your specific problem and reference this one.
Beta Was this translation helpful? Give feedback.
All reactions