-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use my own additional vocabulary dictionary? #396
Comments
Hi! I think the information you are looking for is in the readme file: https://github.com/google-research/bert#learning-a-new-wordpiece-vocabulary |
Pretrain from scratch or modify the first ~1000 lines of the vocab.txt file with the vocab you'd like to add. |
I noticed that there are a LOT of single foreign characters in the vocab.txt file. I'm wondering whether one could also remove these and replace them with words for fine tuning? |
@bsugerman seems u really couldn't avoid all the foreign characters when u use a very large corpus to train the models and also couldn't replace (could delete them somehow) them till u train the model all over all again because params in embedding layer is part of the pre-trained model. |
I have a domain specific (medical) English corpus that I want to do some additional pre-training on from the Bert checkpoint. However, there are quite a lot of words in the medical vocabulary not present in the vocab.txt-file. Lets say I want to add the top 500 words in the corpus not already in the vocabulary. Is this as easy as just replacing the [unused#] in the vocab.txt? No additional changes to the bert_config.json? |
I have similar question as above @peregilk , how to add domain specific vocab.txt in any language other then english, in their official repo it "This repository does not include code for learning a new WordPiece vocabulary , there are a number of open source options available. However, keep in mind that these are not compatible with our tokenization.py library:" |
@samreenkazi. I ended up using Spacy to make a list of all the words in a portion of the corpus. There is easy built in functions for listing for instance the 10.000 most common words in the text. I then checked this against the bert vocab file, and ended up adding roughly 400 words in the empty spots in the vocab-file. I did a few test, and on my very specific medical language, it seemed to have good effect. However, I noticed that it need quite a lot of pre-training to outperform the standard vocabulary. I trained a couple of days on a 2080Ti until it was better (logical since the weights for the new vocab is initialised from scratch). I am not sure if this answers your question with the urdu corpus. However, if you like to have a look at the script I used for building the vocab-file, just send me a message. |
Yes I would like to have a look at script for building the vocab |
please contact me on "per at capia dot no". Ill send you the code.
…On Tue, 30 Apr 2019 at 10:53, Samreen Kazi ***@***.***> wrote:
Yes I would like to have a look at script for building the vocab
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#396 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACFIYACNU45AZZMT52DNGXTPTACJBANCNFSM4GSBNJ3A>
.
|
@peregilk Are you able to push that code up to a repo and link back here? It would be useful for many. |
@bradfox2 , @peregilk |
or perhaps post the code on https://gist.github.com/ - its free of cost |
Hi, I would like to confirm the idea to add in an unseen word. Suppose I have a new word "xyzw". To include this word, the easiest approach is to replace [uncase1] with "xyzw" in the vocab.txt. Then, I will need to run fine-tuning on my specialized data so that the word vector for "xyzw" can be learned. Is this the correct idea? |
That is also available in the BERT repo. The question is more around the use of some already developed, easy to use vocab comparison scripts. |
I also need to add some a few thousands new tokens that don't exist in the Bert vocab file. Shouldn't we just modify the tokens that aren't likely to exist in the corpus we use for fine-tuning ? |
@irhallac it is the [unusedXXX]-tokens that can be replaced with any word you like. I am running some experiments on how effective this really is, but from my understanding you should prioritise words that are frequent in the domain specific dictionary, not in vocab.txt, and very the current tokenization is unlikely to be any good. You should also take into account that Bert is very good in tokenizing long words. Lets say you have an english football specific corpus. You notice that the the word "footballs" is not in vocab.txt. It is however meaningless to add it. Bert tokenizes this as "football"+"##s" -> [2375] [2016] (look at the line numbers in vocab.txt), and have already learned a very good representation both for the individual tokens and the combination. However your text is a lot about football stadiums, and you see that "Anfield" is not in the vocab.txt. This will be tokenized as "an"+"##field". However, it is reason to believe that the current learned word embedding is not very useful. If you add "anfield" to one of the unused spots in vocab.txt, and then to pre-training from the last checkpoint, this vectorised embedding will just start from random(think "0") and it might learn the word faster since it will not be confused by other uses of the tokens [an] and [##field]. This is my understanding of how this works. It is important to remember that the line-numbers in vocab-txt matters. For instance is "!" at line #1000, and it should be also after you edited the file. For pre-training from existing checkpoint you should not change the size of the vocab-file. This means you have around 1000 extra words for your disposal. For pre-training from scratch, build a new vocab based only on your corpus. Since Bert does an excellent job in tokenising and learning this combinations, do not expect dramatic improvements by adding words to the vocab. In my experience adding very specific terms, like common long medical latin words, have some effect. Adding words like "footballs" will likely just have negative effects since the current vector is already pretty good. |
@peregilk thank you. In the model i downloaded there are only 100 [unusedXXX]-tokens in the vocab.txt not 1000. But you say 1000 can be changed ? |
@peregilk btw I want to use the Bert model on Turkish language.
|
OK. I did not know. Then it is only the uncased version that has 1000 unused spots. |
Alternatively, can you also remove non-English words and the rare symbols ? Would this significantly affect the model ? |
@peregilk can you please share the code to modify vocabulary and then pre-training it to adapt to the new vocab. Also, can to tell me the metrics on basis of which you decided that the weights were better |
@bradfox2 What are we supposed to do after these changes? How is the model retrained? |
@peregilk |
@irhallac Let me post an update on my experiences with using vocab-files during pretraining on a domain specific corpus. As far as I know, the only reasonable way to test if this work is to validate it by also fine-tuning the pretrained networks. You will have to do this multiple times before you get reliable results. My initial experiments indicated that adding custom words to the vocab-file had some effects. However, at least on my corpus that can be described as "medical tweets", this effect just disappears after running the domain specific pretraining for a while. After spending quite some time on this, I have ended up dropping the custom vocab-files totally. Bert seems to be able to learn these specialised words by tokenizing them. |
Fine-tune the model on your specific text corpus. Model weights are tuned during initial pretraining with the tokenized vocabulary, so you need to keep the same token mapped to the same input 'node'. The first 1000 tokens are meaningless and the model learns to essentially ignore them. Give the meaningless vocab some relevance with a custom dataset, continue finetuning and model will start to give the previously ignored tokens/vocab some weight (pun intended) |
hey i would like to have a look at the script can you help? |
I did a few more tests on this (as I mentioned in another post). I am no
longer convinced about my own results. The challenge is that fine-tuning
has a lot of variance. I think the first positive result was mainly "a
fluke". Even if it gives a marginal improvement, it also adds more
complexity (how many words, what words etc).
Domain specific pre-training is essential for getting this models to
perform good on specialised domains, but the extra words in the dictionary
is just a tiny detail that probably is not worth the effort.
…On Wed, 6 Nov 2019 at 07:21, mahanswaray ***@***.***> wrote:
@samreenkazi <https://github.com/samreenkazi>. I ended up using Spacy to
make a list of all the words in a portion of the corpus. There is easy
built in functions for listing for instance the 10.000 most common words in
the text. I then checked this against the bert vocab file, and ended up
adding roughly 400 words in the empty spots in the vocab-file.
I did a few test, and on my very specific medical language, it seemed to
have good effect. However, I noticed that it need quite a lot of
pre-training to outperform the standard vocabulary. I trained a couple of
days on a 2080Ti until it was better (logical since the weights for the new
vocab is initialised from scratch).
I am not sure if this answers your question with the urdu corpus. However,
if you like to have a look at the script I used for building the
vocab-file, just send me a message.
hey i would like to have a look at the script can you help?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#396?email_source=notifications&email_token=ACFIYAA7HNLV5X6Z3POZ3D3QSJO6FA5CNFSM4GSBNJ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDFNIII#issuecomment-550163489>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACFIYAAPQBNN7SRHBXDS573QSJO6FANCNFSM4GSBNJ3A>
.
|
@peregilk hi |
Hello, |
@peregilk |
@muhammadfahid51 If I understand things correctly, Bert works on token level. In addition it learns multi-token embeddings. Lets say we have the word "goodness". Lets say this does not exist in the vocabulary. However, the following tokens exists: Since this is one word, it will be tokenized as "good"+"##ness". Bert will learn an embedding for ("good"-"##ness") as well as embedding for both "good" and "##ness". It is not as good as if "goodness" existed directly, but reasonable. However: Adding extra words is a bit double edged. If you have a domain specific vocabulary and add "goodness" to one of the empty spots, you will have to train from a random weight. Both "good" and "##ness" have OK embeddings already, and even if it never in the training set have seen "goodness" before, it already have a reasonable embedding to start from. If you training the entire network from scratch, it makes more sense to build a vocabulary that is as efficient as possible, ie require as few tokens as possible. I hope this answers your question. |
@peregilk And also to pre-train bert from scratch, how much data is required ? |
@muhammadfahid51 Dont interpret any of this as "correct" answers. I am just another researcher struggling with the same issues. You can use Sentencepiece to build any vocabulary from scratch. Sentencepiece will search your corpus and find the most efficient tokens. If you want to start from pre-trained weights you will have to use the same vocabulary. You can manipulate that vocabulary, but it is really only manipulating the open spots that makes most sense. In most cases it will be easier (and cheaper) to start with multi-lang pre-trained Bert and train with additional data in your target language, than to train from scratch on a separate language (this is if your language is part of multi-lang Bert already - 100 languages). My experience. I would say a reasonable corpus is 1B words. Multilang-Bert is trained with a bit less for each of the languages. Size of training corpus and training time is really a big issue. Fine-tuning the vocabulary is (IMHO) not really something you should spend too much time on. |
@muhammadfahid51 Take a look at this page: |
I'm in the same boat, I need to add a larger vocabulary (not directly present in the original bert vocab) but I want to also use the init checkpoint from the original bert. When you do that, you quickly run into this: ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((new_vocab_size, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader. On some preliminary investigation it seems like a few code changes are required to handle this case in run_pretraining.py where the bert_config vocab size is different from the init checkpoint file. As long as the order of the initial vocab terms is kept the same in the new vocab, it should be possible to initialize the known token weights from the pretrained weights and randomly initialize the rest of the network. |
@peregilk Good afternoon, and thank you so much for your comprehensive responses. I would like to ask you a small question, you say: |
@dhruvsakalley I like your idea and wonder if you managed to implement it. Can you share your experience, please? |
@peregilk |
@dhruvsakalley I am exactly doing the same what you're trying to implement. May I know whether it is implemented or not ? are you able to increase the vocab file ? If yes, can you please share the code. Thanks |
@SravaniSegireddy I implemented it. But it is of no use as the words are divided into 2 and the meanings are changing.. |
@ali4friends71 could you please share your code, if possble. May be i can get some idea. Thanks. |
@peregilk Does it mean, you can pretrain the model with domain specifc data but not necessary to change the vocab file is it. |
Absolutely. Doing additional domain specific pretraining is very effective.
How effective will depend on your task and corpus.
Lots of examples of its efficiancy. Here is just one example:
https://arxiv.org/pdf/2005.07503
Fundamental changes to the vocab will make it impossible to continue from
pretrained weights. Unless you are training a completely new language, and
have lots of resources, this is probably just a bad idea.
Using the open spots is the only alternative. My experience is that on my
corpus it has not been a very big thing. BERT learns new composite words
very easy.
…On Mon, 20 Jul 2020 at 15:45, SRAVANI SEGIREDDY ***@***.***> wrote:
@irhallac <https://github.com/irhallac> Let me post an update on my
experiences with using vocab-files during pretraining on a domain specific
corpus. As far as I know, the only reasonable way to test if this work is
to validate it by also fine-tuning the pretrained networks. You will have
to do this multiple times before you get reliable results.
My initial experiments indicated that adding custom words to the
vocab-file had some effects. However, at least on my corpus that can be
described as "medical tweets", this effect just disappears after running
the domain specific pretraining for a while.
After spending quite some time on this, I have ended up dropping the
custom vocab-files totally. Bert seems to be able to learn these
specialised words by tokenizing them.
@peregilk <https://github.com/peregilk> Does it mean, you can pretrain
the model with domain specifc data but not necessary to change the vocab
file is it.
may i know what is the accuracy improvement you achieved after pretraining
with domain specific data ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#396 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACFIYAHZ6KHVVJKZWCEGUO3R4RC7ZANCNFSM4GSBNJ3A>
.
|
@SravaniSegireddy you could use the code from Colab notebook.Check out the article for further instructions. |
But do we really need to manually add domain specific (out of vocab words?). Isn't the purpose with word pieces that they can in theory construct new words with combining their pieces? And if so, the semantics of the wordpieces would change when doing downstream tasks? |
If I use new vocabulary, |
@peregilk thanks, that helps. to learn domain specific word embedding, any clues on volume of domain specific corpus needed, assuming we continue pretraining from the released model as checkpoint. |
@nagads I understand your question, and have gotten it several times before. I usually answers them "Ill tell you, if you first can tell me what a boat costs!". It really depends on a lot of things, like how good would you like the model to be, how different is the domain from what the original model is trained on, how much data is possible to get at what cost etc. There are a few tricks (like dynamic masking) that you can use to make the most out of your data. In general transformer models require A LOT of text. Always. However, domain specific pre-training is probably the part where you are able to get reasonable results with moderate amounts of data. |
@peregilk thanks for the insightful response. |
In addition to these methods, we can add our own additional vocabulary and add a tensor embedding for the additional vocabulary , concatenate with original embedding tensor for tokens in the original vocab file. Specific details in #82 (comment) |
Hello!
We are Korean students.
We would like to implement a Korean slang filtering system as your BERT model.
A test is in progress by fine-tuning the CoLA task on run_classifier.py from the existing multilingual model. However, I feel a lack of a dictionary of existing words and want to use a BERT model with pre-trained weight by adding words to vocab.txt. However, modifications to the vocab.txt and vert_config.json files do not match the shape stored in the initial bert_model.ckpt.
I'd like to use the pre-training weight with our own additional vocabulary dictionary. Is there a way to modify this specification? Or are we forced to pre-training from the scratch?
Thank you.
cf. we have this error message!
INFO:tensorflow:Error recorded from evaluation_loop: indices[6,33,0] = 105932 is not in [0, 105879)
[[node bert/embeddings/embedding_lookup (defined at /home/user01/code/bert/modeling.py:421) = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bert/embeddings/word_embeddings/read, bert/embeddings/ExpandDims, bert/embeddings/embedding_lookup/axis)]]
The text was updated successfully, but these errors were encountered: