-
Notifications
You must be signed in to change notification settings - Fork 817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions on modifying a vocabulary vs. training a LM from scratch #747
Comments
Hi @brijow , There is no good way to add any token to an existing LM, as you wouldn't know what embedding makes the most sense, and during finetuning it would probably pick it up, but IMO, will most likely slow down the training, and be pretty bad if such new tokens are rare (which I imagine they could very well be). Worth trying if you're willing to put the effort. Usually the recommended way is to keep the vocabulary as-is and simply finetune. Relevant tokens will be updated more often, and this will lead to better overall performance. Rare tokens, will still benefit from the pre-training on general langage (instead of being random and potentially destroying performance) |
Thanks @Narsil . I'm just a little unclear about a couple things you mention:
Thanks! |
Well if you remove token "abc" from the vocab and keep around the merge "a", "bc" , you're likely to encounter issues (I am not sure, but it's definitely not intended by the library). Yes, exactly. |
Thank you, makes sense! |
I am on the same boat, why can't we repurpose the unused tokens in the vocab [UNK] ? Is there a way to replace '[UNK]' instead of extending the vocab file? |
You should be able to manually update the content of the tokenizer.json to make the id corresponding to |
I have come across many similar issues asking about how to add new tokens to a vocabulary, for reference, here are a couple links to useful comments made for doing roughly that:
However, I am concerned with how to first identify tokens that make sense to add to an existing tokenizer's vocabulary, and also possibly whether or not it makes sense to consider removing tokens from a vocabulary.
Some context into my situation:
My situation I believe is quite typical: I have a reasonably large domain-specific text dataset available, all in English, and my end goal is produce a domain-specific language model that can be used for various downstream NLP tasks, (e.g., text classification, sentence similarity, etc), after additional fine-tuning on those tasks.
But from my current understanding, to first obtain that domain-specific language model, I basically have two options:
I am struggling with the first option because (as far as I know) training a language model from scratch is quite expensive, and although I do have some budget for this, I do not have on the order of thousands of dollars.
I am starting to explore the second option, but I am confused on how to properly modify the vocabulary in a way that makes sense for the model, and concerned about what other side effects this could cause.
To summarize:
Thanks for the help. Sorry for a long question but I thought some context may be needed since I might be asking the wrong question in the first place. Cheers.
The text was updated successfully, but these errors were encountered: