Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding domain specific vocabulary #9

Closed
artemisart opened this issue Oct 31, 2018 · 2 comments
Closed

Adding domain specific vocabulary #9

artemisart opened this issue Oct 31, 2018 · 2 comments

Comments

@artemisart
Copy link
Contributor

Hi, thanks for the release !

I will need to add some domain specific vocabulary, do you have any suggestion on how to do it ?
I was thinking of replacing some [unused#] tokens in the vocab file (so if i'm not mistaken they already have existing weights in the checkpoint model files) to avoid extending the matrices, and then finetuning the LM with a domain specific corpus.
If it feasible I would also try to do a first LM finetuning pass with the existing vocab embeddings freezed, to only learn the new words, and then a second pass with everything unfreezed.

Do you think it's the right way to do it ?
How can I freeze a subset of the embeddings ? (Gradient masking ?)

@jacobdevlin-google
Copy link
Contributor

My recommendation would be to just use the existing wordpiece vocab and run pre-trianing for more steps on the in-domain text, and it should learn the compositionality "for free". Keep in mind that with a wordpiece vocabulary there are basically no out-of-vocabulary words, and you don't really know which words were seen in the pre-training and not. Just because a word was split up by word pieces doesn't mean it's rare, in fact many words which were split into wordpieces were seen 5,000+ times in the pre-training data.

But if you want to add more vocab you can either:
(a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
(b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

@kumarme072
Copy link

@jacobdevlin-google
I have come across many similar issues asking about how to add new tokens to a vocabulary, for reference, here are a couple links to useful comments made for doing roughly that:

huggingface/transformers#1413 (comment)
huggingface/transformers#2691 (comment)
huggingface/tokenizers#627 (comment)
However, I am concerned with how to first identify tokens that make sense to add to an existing tokenizer's vocabulary, and also possibly whether or not it makes sense to consider removing tokens from a vocabulary.

Some context into my situation:

My situation I believe is quite typical: I have a reasonably large domain-specific text dataset available, all in English, and my end goal is produce a domain-specific language model that can be used for various downstream NLP tasks, (e.g., text classification, sentence similarity, etc), after additional fine-tuning on those tasks.

But from my current understanding, to first obtain that domain-specific language model, I basically have two options:

train a tokenizer from scratch and then use that tokenizer to train a LM from scratch.
modify the vocabulary of a pretrained tokenizer, adjust a (also pretrained) LM's embedding matrix to work with this new vocab size, and fine-tune the pretrained LM on my domain-specific text dataset on something like MLM.
I am struggling with the first option because (as far as I know) training a language model from scratch is quite expensive, and although I do have some budget for this, I do not have on the order of thousands of dollars.

I am starting to explore the second option, but I am confused on how to properly modify the vocabulary in a way that makes sense for the model, and concerned about what other side effects this could cause.

To summarize:

I'd really like to know if there is a low cost option for training a LM from scratch to do option 1 above
Or, if option 2 makes more sense, how to properly modify a vocabulary (find good new tokens, remove unused ones, etc), and adapt the model to overcome potential negative side effects of messing with the embeddings.
Thanks for the help. Sorry for a long question but I thought some context may be needed since I might be asking the wrong question in the first place. Cheers. ##Someone already asked this but no satisfactory answer was there....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants