Out of memory when using NER model #6497

echatzikyriakidis · 2020-12-03T17:43:18Z

echatzikyriakidis
Dec 3, 2020

I am trying to train a NER on some documents and custom entity types and have the following problem:

At first I was trying to train it with a small bunch of documents (100 in total) just to test my pipeline and it worked. Later, I used it with all samples
(~20K samples) and failed with out of memory error. Then, I tried to cut all samples till 3000 chars and it worked with all examples. After doing some investigation found that there are some documents in my dataset with > 800.000 characters and thinking to filter out them because they are not many (0.1%).

Here are my 3 questions:

Q1) Is this a good practice or should I try the NER on sentences, paragraphs by segmenting documents? The real use case is to run on complete documents in the production and not sentences/paragraphs. Can I train it on complete documents? Will it work? The pretrained models are train on whole documents? Will behave the same either with whole documents or segmented documents?

Q2) Is it OK if some of my documents are normal in size while some other are larger in length? For example, most are ~3500 characters (web articles) while some other are > 100.000 characters.

Q3) When fine tuning pretrained models such as de_core_news_lg, etc on custom entity types can we disable tagger and parser or do we have to have them enabled? I want only NER functionality. So, does NER depends on tagger or parser to work? I do it like that:

spacy.load(model_name, disable=["parser", "tagger"])

Thank you!

Your Environment

Operating System: Google Colab
Python Version Used: Python 3.8
spaCy Version Used: spacy[cuda101]==2.3.4
Environment Information: Google Colab

adrianeboyd · 2020-12-04T09:59:09Z

adrianeboyd
Dec 4, 2020

Yes, the rule of thumb is that spacy v2 requires ~1GB memory per 100K chars, so long documents can be a problem. The NER models rely on relatively local features, so paragraphs are typically a good unit to use. We use paragraphs in most of our internal training and testing. The trained model will still perform fine when you apply it to longer documents (as long as you don't run out of memory, of course).

In spacy v2, the tagger, parser, and ner models are all completely independent, so you can remove the ones you're not using without problems.

0 replies

echatzikyriakidis · 2020-12-04T11:04:04Z

echatzikyriakidis
Dec 4, 2020
Author

@adrianeboyd I undersand, thank you! Will try to do both and test the compare the performances!

0 replies

echatzikyriakidis · 2020-12-04T13:02:51Z

echatzikyriakidis
Dec 4, 2020
Author

@adrianeboyd My model has limit of 1.000.000 characters and not 100.000.

I know this from model.max_length

However, I have no documents with string length > model.max_length and I still get memory errors.

0 replies

adrianeboyd · 2020-12-04T14:54:19Z

adrianeboyd
Dec 4, 2020

max_length is just a rough check to try to prevent you from creating huge documents where you might run out of memory on a typical computer. spacy doesn't actually analyze the system configuration at all, so if you're working somewhere with more limited RAM (like colab), you can still run out of memory. The solution is to split up long documents into shorter ones.

0 replies

echatzikyriakidis · 2020-12-04T15:31:53Z

echatzikyriakidis
Dec 4, 2020
Author

@adrianeboyd

I think I now understand. So, it is a max limit for very large documents and also spacy needs 1GB for every 100K characters. So, If I have a document with 800K characters I need 8GB or RAM. And of course I cannot have larger documents than > 1M of characters (except if I increase manually model.max_length).

OK, this is very clear for me now.

I am going to filter out documents with > 100K characters in my training pipeline cause I don't care for them (they are just 0.1%), meaning that I will need at least 1GB of RAM.

Also, I might try paragraph segmentation in the feature to test the difference in metics.

Can you approve that I have a good understanding on this?

Thank you.

0 replies

adrianeboyd · 2020-12-07T08:51:50Z

adrianeboyd
Dec 7, 2020

The memory usage guideline is just a really really rough estimate. It depends a lot on the pipeline and model parameters, so you'll have to figure out what works for your data.

If you have enough memory (or you're doing processing that doesn't require much memory, like only tokenization), you can increase max_length if you need to.

While training, it's usually better to have shorter documents that are easier to batch and shuffle. For NER, I think a good recommendation would be from a paragraph up to a few pages.

0 replies

echatzikyriakidis · 2020-12-07T11:11:34Z

echatzikyriakidis
Dec 7, 2020
Author

Thank you!

0 replies

eyalho · 2021-03-30T21:08:20Z

eyalho
Mar 30, 2021

Popping up question about same issue:
I have big annotated jsonl (70Mb*2) -> converted them to spacy binary and wanted to train my models.
It consumpt a lot of RAM! More than 20GB.. Made it impossible for me to use my GPU..

When I used a subset of my jsonl it work just fine :)
But I do want to train on all of my dataset, and use the (low memory ~10GB) GPU..
Any ideas how can I do it?

(full description of my data:
I have about 1000 papers.
I converted them from pdf to text.
I converted the text line by line to an annotated json and merged into big jsonl.
I converted into spacy binary

And I use python -m spacy train)

6 replies

rs-pawanmethre Apr 12, 2023

i had question, there are different sections of resumes like personal section, education details section, employment history section etc. if i train my model just by passing employment history details as data. will my model perform good in testing when i pass full resume ?

svlandeg Apr 12, 2023
Maintainer

As a general rule of thumb, you should always train your model on data that looks as similar as possible as the data that you'll feed in to make predictions, or the accuracy will likely suffer.

rs-pawanmethre Apr 13, 2023

ok thank you.

how to increase accuracy of spacy ner model, currently i am getting 50% . what all factors determine the accuracy of the model ?
also when i resume my training with last best model with new sets of data i still get same accuracy with no improvement why so ?

is accuracy dependent on batch size ?

rs-pawanmethre Apr 18, 2023

As a general rule of thumb, you should always train your model on data that looks as similar as possible as the data that you'll feed in to make predictions, or the accuracy will likely suffer.

which is best annotation tool for spacy ner model for cv parsing ?

svlandeg Apr 18, 2023
Maintainer

Hi @rs-pawanmethre: these questions don't seem related to the original topic here, can you please create a new topic for your question(s) and provide more context/feedback on what you've already tried etc? The more specific the question, the better we'll be able to help you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory when using NER model #6497

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Out of memory when using NER model #6497

Your Environment

Replies: 8 comments · 6 replies

echatzikyriakidis Dec 4, 2020 Author

echatzikyriakidis Dec 4, 2020 Author

echatzikyriakidis Dec 4, 2020 Author

echatzikyriakidis Dec 7, 2020 Author

svlandeg Apr 12, 2023 Maintainer

svlandeg Apr 18, 2023 Maintainer

Replies: 8 comments 6 replies

echatzikyriakidis
Dec 4, 2020
Author

echatzikyriakidis
Dec 4, 2020
Author

echatzikyriakidis
Dec 4, 2020
Author

echatzikyriakidis
Dec 7, 2020
Author

svlandeg Apr 12, 2023
Maintainer

svlandeg Apr 18, 2023
Maintainer