-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend tokenizer vocabulary with new words #627
Comments
No that's not possible, you'll have to add the tokens manually indeed. |
Thanks for the reply. Just to clarify, is it a missing feature of the library or is it a limitation of the tokenization algorithm? |
It depends on the specific tokenization algorithm, but the tokenizer doesn't save all the training state that would be needed to pick up the training back where it was initially left. |
Most off-the-shelf models have plenty of unused vocabulary entries that you could repurpose:
If your application needs some unused entries for itself you must of course leave a sufficient number of such entries. |
Hi @anferico , I don't know if this is what you were looking for, but this could be a possible approach for your problem:
The result would be the tokenizer with your specific domain tokens along with the original tokenizer's vocabulary. Of course, you can just encapsulate this in a function and use it like you do in your pseudocode. Remember that for your model to work, you will need to update the embedding layer with the new augmented vocabulary: Hope it helps!! :) |
I don't think .add_tokens() is implemented. https://github.com/huggingface/transformers/blob/cad61b68396a1a387287a8e2e2fef78a25b79383/src/transformers/tokenization_utils_base.py#L952 |
You are pointing to a base class, so yes it's not implemented. |
I think this tutorial shared by the official can help with your question Training a new tokenizer from an old one |
@ArthurZucker @younesbelkada @Narsil @n1t0 I tried to add new vocab to the existing mistral tokenizer vocab using the import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file='mistral_tok.model')
tokenizer1 = transformers.AutoTokenizer.from_pretrained("mistralai/mistral-7b-v0.1")
vocab = [sp.id_to_piece(idx) for idx in range(sp.get_piece_size())]
new_tokens = set(vocab) - set(tokenizer1.vocab.keys())
tokenizer1.add_tokens(list(new_tokens))
# output: 14756
print("After adding new tokens, length of mistral tokenizer:", len(tokenizer1))
# output: 46756
tel_text = "నేను బాగున్నాను. మీరు ఏలా ఉన్నారు?" # original text
mistral_encode_ids = tokenizer1.encode(tel_text)
mistral_decode_text = tokenizer1.decode(mistral_encode_ids, skip_special_tokens=True)
print(mistral_decode_text)
# output: నేనుబాగున్నాను.మీరుఏలాఉన్నారు? # decoded text with missing spaces To dig further into the problem, I re-initialised the mistral tokenizer from its original checkpoint "mistralai/mistral-7b-v0.1". Then I added 3 manually defined random tokens to the tokenizer using the same
Where is the problem? Why is the extended vocab tokenizer not able to decode properly when using the vocab from a different tokenizer? On the contrary, it is able to decode properly when new tokens are added manually. |
Hello @bezir, thanks for your comment. I figured out how to successfully merge two sentencepiece BPE tokenizers without losing the tokenization efficiency. Here's the code:
I adapted this code from this source: https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb However, I have a new problem now. I trained a tiktoken's bytelevel tokenizer using the code from this repo: https://github.com/gautierdag/tokenizer-bench
The new tokenizer's training corpus is completely different from that of the pre-trained one. Also, I made sure to remove any duplicate merges. Still the encoding performance is poor. |
Few things here. The extra spaces that are removed is because you are using the from transformers import AddedToken, AutoTokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained("mistralai/mistral-7b-v0.1")
tokenizer.add_tokens(AddedToken("<bbb>",normalized=True), True)
tokenizer.decode(tokenizer.encode(". <bbb>"))
'<s> .<bbb>' vs from transformers import AddedToken, AutoTokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained("mistralai/mistral-7b-v0.1")
tokenizer.add_tokens(AddedToken("<bbb>",normalized=False), True)
tokenizer.decode(tokenizer.encode(". <bbb>"))
'<s> . <bbb>' |
try setting the |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Suppose I have a pre-trained tokenizer, e.g. a
BertWordPieceTokenizer
, with its own vocabulary. My goal is to use it to tokenize some technical text which will likely contain unknown words (represented as "[UNK]" tokens).Is there a way to fine-tune the tokenizer so that unknown words are automatically added to its vocabulary? I have found similar issues in the
transformers
repository (transformers/issues/2691 and transformers/issues/1413), but what they suggest is to manually add unknown tokens, whereas I would like them to be added automatically.Here's a pseudo-code representation of what I would need:
Can I do that with
huggingface/tokenizers
and/orhuggingface/transformers
?I thought it would be an easy thing to do, but I wasn't able to find anything useful.
The text was updated successfully, but these errors were encountered: