Training with BERT Transformers #12115
Replies: 2 comments
-
Nothing in the config jumps out at me as an error, and I see that you've lowered the default batch size for the eval step. If you have long training texts, they may still be too long for the training batch size in Does the OOM error appear before the first epoch The transformer takes up a good chunk of GPU memory, so one test that can be helpful to check that the rest of your config is okay is to test that it runs with a smaller model like One way we manage this for training models like |
Beta Was this translation helpful? Give feedback.
-
@alvaromarlo Did you success with the creation of a new bert model including ner? I'm thinking about research how to create a new Spanish model including ner for Spanish and some guidance would be nice 😊 |
Beta Was this translation helpful? Give feedback.
-
Hi,
I am developing a spaCy model and I want to use BERT transformers. Starting from the bottom, I have an annotations file from Prodigy that we are converting to spaCy data with the
data-to-spacy
command.Once we have the train and dev data, we are following this quick tutorial (https://www.youtube.com/watch?v=Y_N_AO39rRg&t=1s) in order to train the model using the Google Colab's GPU. Also, we are using a custom tokenizer.
The config file we are using is this one:
The problem appears when we do the train command
!python -m spacy train assets/corpus/config.cfg --code app/customize_tokenizer.py --output models/boe-b-section-4 --gpu-id 0
We receive these message at the first iteration:
Aborting and saving the final best model. Encountered exception: OutOfMemoryError('CUDA out of memory. Tried to allocate 42.00 MiB (GPU 0; 14.76 GiB total capacity; 12.43 GiB already allocated; 43.75 MiB free; 13.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF')
We tried splitting the data in sentences but the problem persists at the first iteration.
There's something we are doing wrong? Is the config file correct? Can we print extra information at the training to understand what's happening?
Massive thanks
Beta Was this translation helpful? Give feedback.
All reactions