Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on modifying a vocabulary vs. training a LM from scratch #747

Closed
brijow opened this issue Jun 29, 2021 · 7 comments
Closed

Questions on modifying a vocabulary vs. training a LM from scratch #747

brijow opened this issue Jun 29, 2021 · 7 comments

Comments

@brijow
Copy link

brijow commented Jun 29, 2021

I have come across many similar issues asking about how to add new tokens to a vocabulary, for reference, here are a couple links to useful comments made for doing roughly that:

However, I am concerned with how to first identify tokens that make sense to add to an existing tokenizer's vocabulary, and also possibly whether or not it makes sense to consider removing tokens from a vocabulary.

Some context into my situation:

My situation I believe is quite typical: I have a reasonably large domain-specific text dataset available, all in English, and my end goal is produce a domain-specific language model that can be used for various downstream NLP tasks, (e.g., text classification, sentence similarity, etc), after additional fine-tuning on those tasks.

But from my current understanding, to first obtain that domain-specific language model, I basically have two options:

  1. train a tokenizer from scratch and then use that tokenizer to train a LM from scratch.
  2. modify the vocabulary of a pretrained tokenizer, adjust a (also pretrained) LM's embedding matrix to work with this new vocab size, and fine-tune the pretrained LM on my domain-specific text dataset on something like MLM.

I am struggling with the first option because (as far as I know) training a language model from scratch is quite expensive, and although I do have some budget for this, I do not have on the order of thousands of dollars.

I am starting to explore the second option, but I am confused on how to properly modify the vocabulary in a way that makes sense for the model, and concerned about what other side effects this could cause.

To summarize:

  • I'd really like to know if there is a low cost option for training a LM from scratch to do option 1 above
  • Or, if option 2 makes more sense, how to properly modify a vocabulary (find good new tokens, remove unused ones, etc), and adapt the model to overcome potential negative side effects of messing with the embeddings.

Thanks for the help. Sorry for a long question but I thought some context may be needed since I might be asking the wrong question in the first place. Cheers.

@brijow brijow changed the title Questions on modifying a vocabulary - finding new tokens, and possibly removing existing tokens? Questions on modifying a vocabulary vs. training a LM from scratch Jun 29, 2021
@Narsil
Copy link
Collaborator

Narsil commented Jul 5, 2021

Hi @brijow ,

There is no good way to add any token to an existing LM, as you wouldn't know what embedding makes the most sense, and during finetuning it would probably pick it up, but IMO, will most likely slow down the training, and be pretty bad if such new tokens are rare (which I imagine they could very well be). Worth trying if you're willing to put the effort.
Removing tokens is as tricky as currently ids correspond to row indices within embedding matrices, so removing a row will modifying all following ids. Still doable (by reindexing all vocabulary), but I wouldn't even attempt it unless you are removing 75%+ of the vocabulary, as it's unlikely to lead to better performance before (most compute bottleneck is the actual model, and embedding matrices don't even require that much RAM for single language)
Please note that for BPE, you need ot update both vocab AND merges for instance, Unigram might be slightly simpler

Usually the recommended way is to keep the vocabulary as-is and simply finetune. Relevant tokens will be updated more often, and this will lead to better overall performance. Rare tokens, will still benefit from the pre-training on general langage (instead of being random and potentially destroying performance)

@brijow
Copy link
Author

brijow commented Jul 6, 2021

Thanks @Narsil . I'm just a little unclear about a couple things you mention:

  • When you say "BPE, you need tp update both vocab AND merges", how can I can update the merges? I successfully updated the vocabulary (and the embedding matrix dims for the corresponding BERT based model I am using), but didn't do anything with these merges you're referring to.

  • In your closing sentence when you say "Rare tokens, will still benefit from the pre-training on general language (instead of being random and potentially destroying performance)", are you referring to the case of not modifying the vocab at all and simply fine-tuning the pretrained LM on the LM task (e.g. on MLM)?

Thanks!

@Narsil
Copy link
Collaborator

Narsil commented Jul 7, 2021

Well if you remove token "abc" from the vocab and keep around the merge "a", "bc" , you're likely to encounter issues (I am not sure, but it's definitely not intended by the library).

Yes, exactly.

@brijow
Copy link
Author

brijow commented Jul 7, 2021

Thank you, makes sense!

@brijow brijow closed this as completed Jul 7, 2021
@ptheru
Copy link

ptheru commented Jul 29, 2021

  1. modify the vocabulary of a pretrained tokenizer, adjust a (also pretrained) LM's embedding matrix to work with this new vocab size, and fine-tune the pretrained LM on my domain-specific text dataset on something like MLM.

I am on the same boat, why can't we repurpose the unused tokens in the vocab [UNK] ? Is there a way to replace '[UNK]' instead of extending the vocab file?
google-research/bert#9 (comment)

@kumarme072
Copy link

@ArthurZucker

@ArthurZucker
Copy link
Collaborator

You should be able to manually update the content of the tokenizer.json to make the id corresponding to [unused] to it. It kinda has to be manual I fear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants