Handling domain specific vocabulary #237

hoonkai · 2018-12-06T15:24:51Z

Similar to #9 I'm trying to handle words that are not already in vocab.txt, e.g., "ohm", "farad", etc., but these words are not compositions of the wordpieces in vocab.txt. Should they be manually added to vocab.txt or should I follow the advice @jacobdevlin-google gave which is to use the existing vocab.txt and fine-tune the model on the in-domain text? As these words aren't compositions, how can they be learnt?

bradfox2 · 2019-01-04T21:43:34Z

@hoonkai ,

Any progress on this issue? Based on the reading I've done, I think there is is no way to add additional terms to the vocab.txt file, and the finetuning will only help relate existing vocabulary as used in your domain.

Nevermind, see issue #9 #

harmanpreet93 · 2019-03-29T07:30:43Z

@hoonkai @bradfox2 Can you please tell how did the experiments go on the solutions proposed on issue #9 by @jacobdevlin-google

I want to fine tune the model for the domain-specific problem. I used the existing WordPiece vocab and ran pre-training for 50000 steps on the in-domain text to learn the compositionality. I realized that the updated embeddings were improved by manual evaluation. But I really want to add my domain words to the vocab, as they carry importance for my downstream tasks.

jaymody · 2019-07-09T19:29:18Z

As mentioned in this comment from issue #9

#9 (comment)

(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.

Keep in mind this doesn't make the model any better at contextualizing those words you've replaced, since all you are doing is assigning the randomly initialized [unusedX] to a word. In order to get the model to learn better embeddings for those words, you would have pretrain BERT further using data that contains those words that you added to the vocab.

hoonkai closed this as completed Dec 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling domain specific vocabulary #237

Handling domain specific vocabulary #237

hoonkai commented Dec 6, 2018 •

edited

Loading

bradfox2 commented Jan 4, 2019 •

edited

Loading

harmanpreet93 commented Mar 29, 2019 •

edited

Loading

jaymody commented Jul 9, 2019

Handling domain specific vocabulary #237

Handling domain specific vocabulary #237

Comments

hoonkai commented Dec 6, 2018 • edited Loading

bradfox2 commented Jan 4, 2019 • edited Loading

harmanpreet93 commented Mar 29, 2019 • edited Loading

jaymody commented Jul 9, 2019

hoonkai commented Dec 6, 2018 •

edited

Loading

bradfox2 commented Jan 4, 2019 •

edited

Loading

harmanpreet93 commented Mar 29, 2019 •

edited

Loading