Adding domain specific vocabulary #9

artemisart · 2018-10-31T18:14:35Z

Hi, thanks for the release !

I will need to add some domain specific vocabulary, do you have any suggestion on how to do it ?
I was thinking of replacing some [unused#] tokens in the vocab file (so if i'm not mistaken they already have existing weights in the checkpoint model files) to avoid extending the matrices, and then finetuning the LM with a domain specific corpus.
If it feasible I would also try to do a first LM finetuning pass with the existing vocab embeddings freezed, to only learn the new words, and then a second pass with everything unfreezed.

Do you think it's the right way to do it ?
How can I freeze a subset of the embeddings ? (Gradient masking ?)

jacobdevlin-google · 2018-10-31T18:25:04Z

My recommendation would be to just use the existing wordpiece vocab and run pre-trianing for more steps on the in-domain text, and it should learn the compositionality "for free". Keep in mind that with a wordpiece vocabulary there are basically no out-of-vocabulary words, and you don't really know which words were seen in the pre-training and not. Just because a word was split up by word pieces doesn't mean it's rare, in fact many words which were split into wordpieces were seen 5,000+ times in the pre-training data.

But if you want to add more vocab you can either:
(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
(b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

kumarme072 · 2024-01-05T12:58:58Z

@jacobdevlin-google
I have come across many similar issues asking about how to add new tokens to a vocabulary, for reference, here are a couple links to useful comments made for doing roughly that:

huggingface/transformers#1413 (comment)
huggingface/transformers#2691 (comment)
huggingface/tokenizers#627 (comment)
However, I am concerned with how to first identify tokens that make sense to add to an existing tokenizer's vocabulary, and also possibly whether or not it makes sense to consider removing tokens from a vocabulary.

Some context into my situation:

My situation I believe is quite typical: I have a reasonably large domain-specific text dataset available, all in English, and my end goal is produce a domain-specific language model that can be used for various downstream NLP tasks, (e.g., text classification, sentence similarity, etc), after additional fine-tuning on those tasks.

But from my current understanding, to first obtain that domain-specific language model, I basically have two options:

train a tokenizer from scratch and then use that tokenizer to train a LM from scratch.
modify the vocabulary of a pretrained tokenizer, adjust a (also pretrained) LM's embedding matrix to work with this new vocab size, and fine-tune the pretrained LM on my domain-specific text dataset on something like MLM.
I am struggling with the first option because (as far as I know) training a language model from scratch is quite expensive, and although I do have some budget for this, I do not have on the order of thousands of dollars.

I am starting to explore the second option, but I am confused on how to properly modify the vocabulary in a way that makes sense for the model, and concerned about what other side effects this could cause.

To summarize:

I'd really like to know if there is a low cost option for training a LM from scratch to do option 1 above
Or, if option 2 makes more sense, how to properly modify a vocabulary (find good new tokens, remove unused ones, etc), and adapt the model to overcome potential negative side effects of messing with the embeddings.
Thanks for the help. Sorry for a long question but I thought some context may be needed since I might be asking the wrong question in the first place. Cheers. ##Someone already asked this but no satisfactory answer was there....

jacobdevlin-google closed this as completed Nov 1, 2018

hoonkai mentioned this issue Dec 6, 2018

Handling domain specific vocabulary #237

Closed

tholor mentioned this issue Jan 30, 2019

How can I change vocab size for pretrained model? huggingface/transformers#237

Closed

harmanpreet93 mentioned this issue Mar 29, 2019

embeddings after fine tuning huggingface/transformers#405

Closed

thomwolf mentioned this issue May 10, 2019

BERT tokenizer - set special tokens huggingface/transformers#599

Closed

gro1m mentioned this issue May 30, 2019

Vocab changes in lm_finetuning in BERT huggingface/transformers#463

Closed

rzepinskip mentioned this issue Mar 30, 2020

Vocabulary checks rzepinskip/spoiler-detection#9

Open

srhouyu mentioned this issue Jun 29, 2020

添加额外专业词汇 dbiir/UER-py#51

Open

loveritsu929 mentioned this issue Jul 28, 2020

Dealing with ellipses in BERT tokenization #1116

Open

SeongIkKim mentioned this issue May 3, 2021

[TODO] Special Token 추가 관련 정보 탐색 및 예시코드 작성 VumBleBot/Group-Activity#20

Closed

This was referenced Jul 29, 2021

Questions on modifying a vocabulary vs. training a LM from scratch huggingface/tokenizers#747

Closed

Adding New Vocabulary Tokens to the Models huggingface/transformers#1413

Closed

XuperX mentioned this issue Feb 26, 2024

How to handle [UNUSED1] tokens in Bert huggingface training. XuperX/useful_snippets#2

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding domain specific vocabulary #9

Adding domain specific vocabulary #9

artemisart commented Oct 31, 2018

jacobdevlin-google commented Oct 31, 2018

kumarme072 commented Jan 5, 2024

Adding domain specific vocabulary #9

Adding domain specific vocabulary #9

Comments

artemisart commented Oct 31, 2018

jacobdevlin-google commented Oct 31, 2018

kumarme072 commented Jan 5, 2024