-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling domain specific vocabulary #237
Comments
@hoonkai @bradfox2 Can you please tell how did the experiments go on the solutions proposed on issue #9 by @jacobdevlin-google I want to fine tune the model for the domain-specific problem. I used the existing WordPiece vocab and ran pre-training for 50000 steps on the in-domain text to learn the compositionality. I realized that the updated embeddings were improved by manual evaluation. But I really want to add my domain words to the vocab, as they carry importance for my downstream tasks. |
As mentioned in this comment from issue #9
Keep in mind this doesn't make the model any better at contextualizing those words you've replaced, since all you are doing is assigning the randomly initialized [unusedX] to a word. In order to get the model to learn better embeddings for those words, you would have pretrain BERT further using data that contains those words that you added to the vocab. |
Similar to #9 I'm trying to handle words that are not already in vocab.txt, e.g., "ohm", "farad", etc., but these words are not compositions of the wordpieces in vocab.txt. Should they be manually added to vocab.txt or should I follow the advice @jacobdevlin-google gave which is to use the existing vocab.txt and fine-tune the model on the in-domain text? As these words aren't compositions, how can they be learnt?
The text was updated successfully, but these errors were encountered: