Out of memory when using NER model #6497
Replies: 8 comments 6 replies
-
Yes, the rule of thumb is that spacy v2 requires ~1GB memory per 100K chars, so long documents can be a problem. The NER models rely on relatively local features, so paragraphs are typically a good unit to use. We use paragraphs in most of our internal training and testing. The trained model will still perform fine when you apply it to longer documents (as long as you don't run out of memory, of course). In spacy v2, the |
Beta Was this translation helpful? Give feedback.
-
@adrianeboyd I undersand, thank you! Will try to do both and test the compare the performances! |
Beta Was this translation helpful? Give feedback.
-
@adrianeboyd My model has limit of 1.000.000 characters and not 100.000. I know this from model.max_length However, I have no documents with string length > model.max_length and I still get memory errors. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
I think I now understand. So, it is a max limit for very large documents and also spacy needs 1GB for every 100K characters. So, If I have a document with 800K characters I need 8GB or RAM. And of course I cannot have larger documents than > 1M of characters (except if I increase manually model.max_length). OK, this is very clear for me now. I am going to filter out documents with > 100K characters in my training pipeline cause I don't care for them (they are just 0.1%), meaning that I will need at least 1GB of RAM. Also, I might try paragraph segmentation in the feature to test the difference in metics. Can you approve that I have a good understanding on this? Thank you. |
Beta Was this translation helpful? Give feedback.
-
The memory usage guideline is just a really really rough estimate. It depends a lot on the pipeline and model parameters, so you'll have to figure out what works for your data. If you have enough memory (or you're doing processing that doesn't require much memory, like only tokenization), you can increase While training, it's usually better to have shorter documents that are easier to batch and shuffle. For NER, I think a good recommendation would be from a paragraph up to a few pages. |
Beta Was this translation helpful? Give feedback.
-
Thank you! |
Beta Was this translation helpful? Give feedback.
-
Popping up question about same issue: When I used a subset of my jsonl it work just fine :) (full description of my data: And I use python -m spacy train) |
Beta Was this translation helpful? Give feedback.
-
I am trying to train a NER on some documents and custom entity types and have the following problem:
At first I was trying to train it with a small bunch of documents (100 in total) just to test my pipeline and it worked. Later, I used it with all samples
(~20K samples) and failed with out of memory error. Then, I tried to cut all samples till 3000 chars and it worked with all examples. After doing some investigation found that there are some documents in my dataset with > 800.000 characters and thinking to filter out them because they are not many (0.1%).
Here are my 3 questions:
Q1) Is this a good practice or should I try the NER on sentences, paragraphs by segmenting documents? The real use case is to run on complete documents in the production and not sentences/paragraphs. Can I train it on complete documents? Will it work? The pretrained models are train on whole documents? Will behave the same either with whole documents or segmented documents?
Q2) Is it OK if some of my documents are normal in size while some other are larger in length? For example, most are ~3500 characters (web articles) while some other are > 100.000 characters.
Q3) When fine tuning pretrained models such as de_core_news_lg, etc on custom entity types can we disable tagger and parser or do we have to have them enabled? I want only NER functionality. So, does NER depends on tagger or parser to work? I do it like that:
spacy.load(model_name, disable=["parser", "tagger"])
Thank you!
Your Environment
Beta Was this translation helpful? Give feedback.
All reactions