Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling domain specific vocabulary #237

Closed
hoonkai opened this issue Dec 6, 2018 · 3 comments
Closed

Handling domain specific vocabulary #237

hoonkai opened this issue Dec 6, 2018 · 3 comments

Comments

@hoonkai
Copy link

hoonkai commented Dec 6, 2018

Similar to #9 I'm trying to handle words that are not already in vocab.txt, e.g., "ohm", "farad", etc., but these words are not compositions of the wordpieces in vocab.txt. Should they be manually added to vocab.txt or should I follow the advice @jacobdevlin-google gave which is to use the existing vocab.txt and fine-tune the model on the in-domain text? As these words aren't compositions, how can they be learnt?

@hoonkai hoonkai closed this as completed Dec 6, 2018
@bradfox2
Copy link

bradfox2 commented Jan 4, 2019

@hoonkai ,

Any progress on this issue? Based on the reading I've done, I think there is is no way to add additional terms to the vocab.txt file, and the finetuning will only help relate existing vocabulary as used in your domain.

Nevermind, see issue #9 #

@harmanpreet93
Copy link

harmanpreet93 commented Mar 29, 2019

@hoonkai @bradfox2 Can you please tell how did the experiments go on the solutions proposed on issue #9 by @jacobdevlin-google

I want to fine tune the model for the domain-specific problem. I used the existing WordPiece vocab and ran pre-training for 50000 steps on the in-domain text to learn the compositionality. I realized that the updated embeddings were improved by manual evaluation. But I really want to add my domain words to the vocab, as they carry importance for my downstream tasks.

@jaymody
Copy link

jaymody commented Jul 9, 2019

As mentioned in this comment from issue #9

#9 (comment)

(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.

Keep in mind this doesn't make the model any better at contextualizing those words you've replaced, since all you are doing is assigning the randomly initialized [unusedX] to a word. In order to get the model to learn better embeddings for those words, you would have pretrain BERT further using data that contains those words that you added to the vocab.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants