-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding domain specific vocabulary #9
Comments
My recommendation would be to just use the existing wordpiece vocab and run pre-trianing for more steps on the in-domain text, and it should learn the compositionality "for free". Keep in mind that with a wordpiece vocabulary there are basically no out-of-vocabulary words, and you don't really know which words were seen in the pre-training and not. Just because a word was split up by word pieces doesn't mean it's rare, in fact many words which were split into wordpieces were seen 5,000+ times in the pre-training data. But if you want to add more vocab you can either: |
@jacobdevlin-google huggingface/transformers#1413 (comment) Some context into my situation: My situation I believe is quite typical: I have a reasonably large domain-specific text dataset available, all in English, and my end goal is produce a domain-specific language model that can be used for various downstream NLP tasks, (e.g., text classification, sentence similarity, etc), after additional fine-tuning on those tasks. But from my current understanding, to first obtain that domain-specific language model, I basically have two options: train a tokenizer from scratch and then use that tokenizer to train a LM from scratch. I am starting to explore the second option, but I am confused on how to properly modify the vocabulary in a way that makes sense for the model, and concerned about what other side effects this could cause. To summarize: I'd really like to know if there is a low cost option for training a LM from scratch to do option 1 above |
Hi, thanks for the release !
I will need to add some domain specific vocabulary, do you have any suggestion on how to do it ?
I was thinking of replacing some [unused#] tokens in the vocab file (so if i'm not mistaken they already have existing weights in the checkpoint model files) to avoid extending the matrices, and then finetuning the LM with a domain specific corpus.
If it feasible I would also try to do a first LM finetuning pass with the existing vocab embeddings freezed, to only learn the new words, and then a second pass with everything unfreezed.
Do you think it's the right way to do it ?
How can I freeze a subset of the embeddings ? (Gradient masking ?)
The text was updated successfully, but these errors were encountered: