Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use my own additional vocabulary dictionary? #396

Open
EnteLee opened this issue Jan 24, 2019 · 52 comments
Open

How to use my own additional vocabulary dictionary? #396

EnteLee opened this issue Jan 24, 2019 · 52 comments

Comments

@EnteLee
Copy link

EnteLee commented Jan 24, 2019

Hello!
We are Korean students.
We would like to implement a Korean slang filtering system as your BERT model.

A test is in progress by fine-tuning the CoLA task on run_classifier.py from the existing multilingual model. However, I feel a lack of a dictionary of existing words and want to use a BERT model with pre-trained weight by adding words to vocab.txt. However, modifications to the vocab.txt and vert_config.json files do not match the shape stored in the initial bert_model.ckpt.

I'd like to use the pre-training weight with our own additional vocabulary dictionary. Is there a way to modify this specification? Or are we forced to pre-training from the scratch?

Thank you.

cf. we have this error message!
INFO:tensorflow:Error recorded from evaluation_loop: indices[6,33,0] = 105932 is not in [0, 105879)
[[node bert/embeddings/embedding_lookup (defined at /home/user01/code/bert/modeling.py:421) = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bert/embeddings/word_embeddings/read, bert/embeddings/ExpandDims, bert/embeddings/embedding_lookup/axis)]]

@rodgzilla
Copy link
Contributor

Hi!

I think the information you are looking for is in the readme file: https://github.com/google-research/bert#learning-a-new-wordpiece-vocabulary

@bradfox2
Copy link

bradfox2 commented Feb 2, 2019

Pretrain from scratch or modify the first ~1000 lines of the vocab.txt file with the vocab you'd like to add.

@bsugerman
Copy link

I noticed that there are a LOT of single foreign characters in the vocab.txt file. I'm wondering whether one could also remove these and replace them with words for fine tuning?

@yzho0907
Copy link

@bsugerman seems u really couldn't avoid all the foreign characters when u use a very large corpus to train the models and also couldn't replace (could delete them somehow) them till u train the model all over all again because params in embedding layer is part of the pre-trained model.

@peregilk
Copy link

peregilk commented Apr 2, 2019

I have a domain specific (medical) English corpus that I want to do some additional pre-training on from the Bert checkpoint. However, there are quite a lot of words in the medical vocabulary not present in the vocab.txt-file.

Lets say I want to add the top 500 words in the corpus not already in the vocabulary. Is this as easy as just replacing the [unused#] in the vocab.txt? No additional changes to the bert_config.json?

@samreenkazi
Copy link

samreenkazi commented Apr 14, 2019

I have similar question as above @peregilk , how to add domain specific vocab.txt in any language other then english, in their official repo it "This repository does not include code for learning a new WordPiece vocabulary , there are a number of open source options available. However, keep in mind that these are not compatible with our tokenization.py library:"
then how do we learn domain specific vocab, as with availbile multilingual pretraining weights bert didnt perform well on downstream classfication task on urdu corpus.
or we can create workaround for domain specific vocab by modifying ~1000 lines of vocab.txt as suggested by @bradfox2

@peregilk
Copy link

@samreenkazi. I ended up using Spacy to make a list of all the words in a portion of the corpus. There is easy built in functions for listing for instance the 10.000 most common words in the text. I then checked this against the bert vocab file, and ended up adding roughly 400 words in the empty spots in the vocab-file.

I did a few test, and on my very specific medical language, it seemed to have good effect. However, I noticed that it need quite a lot of pre-training to outperform the standard vocabulary. I trained a couple of days on a 2080Ti until it was better (logical since the weights for the new vocab is initialised from scratch).

I am not sure if this answers your question with the urdu corpus. However, if you like to have a look at the script I used for building the vocab-file, just send me a message.

@samreenkazi
Copy link

Yes I would like to have a look at script for building the vocab

@peregilk
Copy link

peregilk commented Apr 30, 2019 via email

@bradfox2
Copy link

bradfox2 commented May 2, 2019

@peregilk Are you able to push that code up to a repo and link back here? It would be useful for many.

@Dhanachandra
Copy link

@bradfox2 , @peregilk
You can use a modified version of Tensor2Tensor/text_encoder_build_subword.py code to generate BERT compatible vocab.
https://github.com/kwonmha/bert-vocab-builder

@techmattersinc
Copy link

@peregilk Are you able to push that code up to a repo and link back here? It would be useful for many.

or perhaps post the code on https://gist.github.com/ - its free of cost

@datduong
Copy link

datduong commented May 4, 2019

Hi, I would like to confirm the idea to add in an unseen word. Suppose I have a new word "xyzw". To include this word, the easiest approach is to replace [uncase1] with "xyzw" in the vocab.txt. Then, I will need to run fine-tuning on my specialized data so that the word vector for "xyzw" can be learned. Is this the correct idea?

@bradfox2
Copy link

bradfox2 commented May 4, 2019

@bradfox2 , @peregilk
You can use a modified version of Tensor2Tensor/text_encoder_build_subword.py code to generate BERT compatible vocab.
https://github.com/kwonmha/bert-vocab-builder

That is also available in the BERT repo. The question is more around the use of some already developed, easy to use vocab comparison scripts.

@irhallac
Copy link

irhallac commented Jun 21, 2019

Pretrain from scratch or modify the first ~1000 lines of the vocab.txt file with the vocab you'd like to add.

I also need to add some a few thousands new tokens that don't exist in the Bert vocab file.
When I check the vocab file of the model (multi_cased_L-12_H-768_A-12), first 100 tokens are "unused" tokens ( unused0-unused99) and they continue with [UNK], [CLS], [SEP], [MASK] tokens.
I think you wouldn't suggest modifying these tokens and also the numbers and letters which come right after them. Because they are in the first ~1000 lines. Can you help me see what i am missing here?

Shouldn't we just modify the tokens that aren't likely to exist in the corpus we use for fine-tuning ?

@peregilk
Copy link

@irhallac it is the [unusedXXX]-tokens that can be replaced with any word you like. I am running some experiments on how effective this really is, but from my understanding you should prioritise words that are frequent in the domain specific dictionary, not in vocab.txt, and very the current tokenization is unlikely to be any good. You should also take into account that Bert is very good in tokenizing long words.

Lets say you have an english football specific corpus. You notice that the the word "footballs" is not in vocab.txt. It is however meaningless to add it. Bert tokenizes this as "football"+"##s" -> [2375] [2016] (look at the line numbers in vocab.txt), and have already learned a very good representation both for the individual tokens and the combination. However your text is a lot about football stadiums, and you see that "Anfield" is not in the vocab.txt. This will be tokenized as "an"+"##field". However, it is reason to believe that the current learned word embedding is not very useful.

If you add "anfield" to one of the unused spots in vocab.txt, and then to pre-training from the last checkpoint, this vectorised embedding will just start from random(think "0") and it might learn the word faster since it will not be confused by other uses of the tokens [an] and [##field]. This is my understanding of how this works.

It is important to remember that the line-numbers in vocab-txt matters. For instance is "!" at line #1000, and it should be also after you edited the file. For pre-training from existing checkpoint you should not change the size of the vocab-file. This means you have around 1000 extra words for your disposal. For pre-training from scratch, build a new vocab based only on your corpus.

Since Bert does an excellent job in tokenising and learning this combinations, do not expect dramatic improvements by adding words to the vocab. In my experience adding very specific terms, like common long medical latin words, have some effect. Adding words like "footballs" will likely just have negative effects since the current vector is already pretty good.

@irhallac
Copy link

@peregilk thank you. In the model i downloaded there are only 100 [unusedXXX]-tokens in the vocab.txt not 1000. But you say 1000 can be changed ?

@peregilk
Copy link

They are in two bulks. From line#2 [unused0] to line #100[unused98]. Then there are 4 tokens that absolutely should not be changed [UNK]|CLS][SEP][MASK].

Then they continue. line #105 to line #999. Totally roughly 1000 unused tokens.

@irhallac
Copy link

irhallac commented Jun 21, 2019

@peregilk btw I want to use the Bert model on Turkish language.
I downloaded it from
download_url = 'https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip'
and it is like this:

.
[unused97]
[unused98]
[unused99]
[UNK]
[CLS]
[SEP]
[MASK]
<S>
<T>
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
.

@peregilk
Copy link

OK. I did not know. Then it is only the uncased version that has 1000 unused spots.

@datduong
Copy link

Alternatively, can you also remove non-English words and the rare symbols ? Would this significantly affect the model ?

@bhoomit
Copy link

bhoomit commented Jun 24, 2019

They are in two bulks. From line#2 [unused0] to line #100[unused98]. Then there are 4 tokens that absolutely should not be changed [UNK]|CLS][SEP][MASK].

Then they continue. line #105 to line #999. Totally roughly 1000 unused tokens.

Does that mean I can't add more than 1000 words?

@jinamshah
Copy link

@samreenkazi. I ended up using Spacy to make a list of all the words in a portion of the corpus. There is easy built in functions for listing for instance the 10.000 most common words in the text. I then checked this against the bert vocab file, and ended up adding roughly 400 words in the empty spots in the vocab-file.

I did a few test, and on my very specific medical language, it seemed to have good effect. However, I noticed that it need quite a lot of pre-training to outperform the standard vocabulary. I trained a couple of days on a 2080Ti until it was better (logical since the weights for the new vocab is initialised from scratch).

I am not sure if this answers your question with the urdu corpus. However, if you like to have a look at the script I used for building the vocab-file, just send me a message.

@peregilk can you please share the code to modify vocabulary and then pre-training it to adapt to the new vocab. Also, can to tell me the metrics on basis of which you decided that the weights were better

@ivanacorovic
Copy link

Pretrain from scratch or modify the first ~1000 lines of the vocab.txt file with the vocab you'd like to add.

@bradfox2 What are we supposed to do after these changes? How is the model retrained?

@Dhanachandra
Copy link

@irhallac it is the [unusedXXX]-tokens that can be replaced with any word you like. I am running some experiments on how effective this really is, but from my understanding you should prioritise words that are frequent in the domain specific dictionary, not in vocab.txt, and very the current tokenization is unlikely to be any good. You should also take into account that Bert is very good in tokenizing long words.

Lets say you have an english football specific corpus. You notice that the the word "footballs" is not in vocab.txt. It is however meaningless to add it. Bert tokenizes this as "football"+"##s" -> [2375] [2016] (look at the line numbers in vocab.txt), and have already learned a very good representation both for the individual tokens and the combination. However your text is a lot about football stadiums, and you see that "Anfield" is not in the vocab.txt. This will be tokenized as "an"+"##field". However, it is reason to believe that the current learned word embedding is not very useful.

If you add "anfield" to one of the unused spots in vocab.txt, and then to pre-training from the last checkpoint, this vectorised embedding will just start from random(think "0") and it might learn the word faster since it will not be confused by other uses of the tokens [an] and [##field]. This is my understanding of how this works.

It is important to remember that the line-numbers in vocab-txt matters. For instance is "!" at line #1000, and it should be also after you edited the file. For pre-training from existing checkpoint you should not change the size of the vocab-file. This means you have around 1000 extra words for your disposal. For pre-training from scratch, build a new vocab based only on your corpus.

Since Bert does an excellent job in tokenising and learning this combinations, do not expect dramatic improvements by adding words to the vocab. In my experience adding very specific terms, like common long medical latin words, have some effect. Adding words like "footballs" will likely just have negative effects since the current vector is already pretty good.

@peregilk
I am also working in the medical domain, can you please share those common long medical latin words that you added in vocab.

@peregilk
Copy link

@irhallac Let me post an update on my experiences with using vocab-files during pretraining on a domain specific corpus. As far as I know, the only reasonable way to test if this work is to validate it by also fine-tuning the pretrained networks. You will have to do this multiple times before you get reliable results.

My initial experiments indicated that adding custom words to the vocab-file had some effects. However, at least on my corpus that can be described as "medical tweets", this effect just disappears after running the domain specific pretraining for a while.

After spending quite some time on this, I have ended up dropping the custom vocab-files totally. Bert seems to be able to learn these specialised words by tokenizing them.

@bradfox2
Copy link

bradfox2 commented Sep 7, 2019

Pretrain from scratch or modify the first ~1000 lines of the vocab.txt file with the vocab you'd like to add.

@bradfox2 What are we supposed to do after these changes? How is the model retrained?

Fine-tune the model on your specific text corpus. Model weights are tuned during initial pretraining with the tokenized vocabulary, so you need to keep the same token mapped to the same input 'node'. The first 1000 tokens are meaningless and the model learns to essentially ignore them. Give the meaningless vocab some relevance with a custom dataset, continue finetuning and model will start to give the previously ignored tokens/vocab some weight (pun intended)

@mahanswaray
Copy link

@samreenkazi. I ended up using Spacy to make a list of all the words in a portion of the corpus. There is easy built in functions for listing for instance the 10.000 most common words in the text. I then checked this against the bert vocab file, and ended up adding roughly 400 words in the empty spots in the vocab-file.

I did a few test, and on my very specific medical language, it seemed to have good effect. However, I noticed that it need quite a lot of pre-training to outperform the standard vocabulary. I trained a couple of days on a 2080Ti until it was better (logical since the weights for the new vocab is initialised from scratch).

I am not sure if this answers your question with the urdu corpus. However, if you like to have a look at the script I used for building the vocab-file, just send me a message.

hey i would like to have a look at the script can you help?

@peregilk
Copy link

peregilk commented Nov 6, 2019 via email

@flaviofafe1414
Copy link

@peregilk hi
I also have an interest in the medical field and I have the same problem, did you have any progress?

@muhammadfahid51
Copy link

I have similar question as above @peregilk , how to add domain specific vocab.txt in any language other then english, in their official repo it "This repository does not include code for learning a new WordPiece vocabulary , there are a number of open source options available. However, keep in mind that these are not compatible with our tokenization.py library:"
then how do we learn domain specific vocab, as with availbile multilingual pretraining weights bert didnt perform well on downstream classfication task on urdu corpus.
or we can create workaround for domain specific vocab by modifying ~1000 lines of vocab.txt as suggested by @bradfox2

Hello,
Did you come up with any solution for this ? I have my own custom tokenizer and it has a lot of new words.

@muhammadfahid51
Copy link

I did a few more tests on this (as I mentioned in another post). I am no longer convinced about my own results. The challenge is that fine-tuning has a lot of variance. I think the first positive result was mainly "a fluke". Even if it gives a marginal improvement, it also adds more complexity (how many words, what words etc). Domain specific pre-training is essential for getting this models to perform good on specialised domains, but the extra words in the dictionary is just a tiny detail that probably is not worth the effort.

On Wed, 6 Nov 2019 at 07:21, mahanswaray @.***> wrote: @samreenkazi https://github.com/samreenkazi. I ended up using Spacy to make a list of all the words in a portion of the corpus. There is easy built in functions for listing for instance the 10.000 most common words in the text. I then checked this against the bert vocab file, and ended up adding roughly 400 words in the empty spots in the vocab-file. I did a few test, and on my very specific medical language, it seemed to have good effect. However, I noticed that it need quite a lot of pre-training to outperform the standard vocabulary. I trained a couple of days on a 2080Ti until it was better (logical since the weights for the new vocab is initialised from scratch). I am not sure if this answers your question with the urdu corpus. However, if you like to have a look at the script I used for building the vocab-file, just send me a message. hey i would like to have a look at the script can you help? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#396?email_source=notifications&email_token=ACFIYAA7HNLV5X6Z3POZ3D3QSJO6FA5CNFSM4GSBNJ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDFNIII#issuecomment-550163489>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFIYAAPQBNN7SRHBXDS573QSJO6FANCNFSM4GSBNJ3A .

@peregilk
One thing that I am confused about BERT works on character level or Word level ?
What I mean here is does bert break the word token further into characters during training and learn embedding accordingly or does it considers only the word tokens(made by tokenizer). Why I am asking this is that vocab.txt contains all the basic characters of my language say urdu. By basic characters I mean a,b,c characters of my language. Anyone please enlighten me on this ?
Let's say we have our new data but that data is also made of those basic characters right.
If it was only about the word tokens then English vocabulary size is more than 120k and the model vocab don't have all of them in it.

@peregilk
Copy link

@muhammadfahid51 If I understand things correctly, Bert works on token level. In addition it learns multi-token embeddings.

Lets say we have the word "goodness". Lets say this does not exist in the vocabulary. However, the following tokens exists:
"good"
"ness"
"##ness"

Since this is one word, it will be tokenized as "good"+"##ness". Bert will learn an embedding for ("good"-"##ness") as well as embedding for both "good" and "##ness". It is not as good as if "goodness" existed directly, but reasonable.

However: Adding extra words is a bit double edged. If you have a domain specific vocabulary and add "goodness" to one of the empty spots, you will have to train from a random weight. Both "good" and "##ness" have OK embeddings already, and even if it never in the training set have seen "goodness" before, it already have a reasonable embedding to start from.

If you training the entire network from scratch, it makes more sense to build a vocabulary that is as efficient as possible, ie require as few tokens as possible.

I hope this answers your question.

@muhammadfahid51
Copy link

@muhammadfahid51 If I understand things correctly, Bert works on token level. In addition it learns multi-token embeddings.

Lets say we have the word "goodness". Lets say this does not exist in the vocabulary. However, the following tokens exists:
"good"
"ness"
"##ness"

Since this is one word, it will be tokenized as "good"+"##ness". Bert will learn an embedding for ("good"-"##ness") as well as embedding for both "good" and "##ness". It is not as good as if "goodness" existed directly, but reasonable.

However: Adding extra words is a bit double edged. If you have a domain specific vocabulary and add "goodness" to one of the empty spots, you will have to train from a random weight. Both "good" and "##ness" have OK embeddings already, and even if it never in the training set have seen "goodness" before, it already have a reasonable embedding to start from.

If you training the entire network from scratch, it makes more sense to build a vocabulary that is as efficient as possible, ie require as few tokens as possible.

I hope this answers your question.

@peregilk
What if I want to pre-train on my a custom language say turkish ?
Can I replace some other language characters in vocab.txt with my own tokens ?

And also to pre-train bert from scratch, how much data is required ?

@peregilk
Copy link

@muhammadfahid51 Dont interpret any of this as "correct" answers. I am just another researcher struggling with the same issues.

You can use Sentencepiece to build any vocabulary from scratch. Sentencepiece will search your corpus and find the most efficient tokens.

If you want to start from pre-trained weights you will have to use the same vocabulary. You can manipulate that vocabulary, but it is really only manipulating the open spots that makes most sense. In most cases it will be easier (and cheaper) to start with multi-lang pre-trained Bert and train with additional data in your target language, than to train from scratch on a separate language (this is if your language is part of multi-lang Bert already - 100 languages). My experience.

I would say a reasonable corpus is 1B words. Multilang-Bert is trained with a bit less for each of the languages. Size of training corpus and training time is really a big issue. Fine-tuning the vocabulary is (IMHO) not really something you should spend too much time on.

@peregilk
Copy link

@dhruvsakalley
Copy link

I'm in the same boat, I need to add a larger vocabulary (not directly present in the original bert vocab) but I want to also use the init checkpoint from the original bert. When you do that, you quickly run into this: ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((new_vocab_size, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader. On some preliminary investigation it seems like a few code changes are required to handle this case in run_pretraining.py where the bert_config vocab size is different from the init checkpoint file. As long as the order of the initial vocab terms is kept the same in the new vocab, it should be possible to initialize the known token weights from the pretrained weights and randomly initialize the rest of the network.

@Aktsvigun
Copy link

@peregilk Good afternoon, and thank you so much for your comprehensive responses. I would like to ask you a small question, you say:
"Bert will learn an embedding for ("good"-"##ness") as well as embedding for both "good" and "##ness"."
What do you mean by embedding for ("good"-"##ness")? Probably I am mistaken, however, I thought as any NLP model Bert has embeddings only for single tokens. Do you mean Bert will learn the interrelation between these two tokens in attention layer or does it have special embeddings for such cases? Thanks in advance!

@boggis30
Copy link

boggis30 commented Jun 10, 2020

I'm in the same boat, I need to add a larger vocabulary (not directly present in the original bert vocab) but I want to also use the init checkpoint from the original bert. When you do that, you quickly run into this: ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((new_vocab_size, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader. On some preliminary investigation it seems like a few code changes are required to handle this case in run_pretraining.py where the bert_config vocab size is different from the init checkpoint file. As long as the order of the initial vocab terms is kept the same in the new vocab, it should be possible to initialize the known token weights from the pretrained weights and randomly initialize the rest of the network.

@dhruvsakalley I like your idea and wonder if you managed to implement it. Can you share your experience, please?

@ali4friends71
Copy link

@peregilk
Can u say me how to train the model after adding our words in vocab.txt ?
Code of how to train BERT With additional vocabulary ?

@SravaniSegireddy
Copy link

@dhruvsakalley I am exactly doing the same what you're trying to implement. May I know whether it is implemented or not ? are you able to increase the vocab file ? If yes, can you please share the code. Thanks

@ali4friends71
Copy link

@SravaniSegireddy I implemented it. But it is of no use as the words are divided into 2 and the meanings are changing..

@SravaniSegireddy
Copy link

@ali4friends71 could you please share your code, if possble. May be i can get some idea. Thanks.

@SravaniSegireddy
Copy link

@irhallac Let me post an update on my experiences with using vocab-files during pretraining on a domain specific corpus. As far as I know, the only reasonable way to test if this work is to validate it by also fine-tuning the pretrained networks. You will have to do this multiple times before you get reliable results.

My initial experiments indicated that adding custom words to the vocab-file had some effects. However, at least on my corpus that can be described as "medical tweets", this effect just disappears after running the domain specific pretraining for a while.

After spending quite some time on this, I have ended up dropping the custom vocab-files totally. Bert seems to be able to learn these specialised words by tokenizing them.

@peregilk Does it mean, you can pretrain the model with domain specifc data but not necessary to change the vocab file is it.
may i know what is the accuracy improvement you achieved after pretraining with domain specific data ?

@peregilk
Copy link

peregilk commented Jul 20, 2020 via email

@ali4friends71
Copy link

@SravaniSegireddy you could use the code from Colab notebook.Check out the article for further instructions.

@timpal0l
Copy link

timpal0l commented Nov 2, 2020

But do we really need to manually add domain specific (out of vocab words?). Isn't the purpose with word pieces that they can in theory construct new words with combining their pieces? And if so, the semantics of the wordpieces would change when doing downstream tasks?

@joancf
Copy link

joancf commented Dec 17, 2020

If I use new vocabulary,
can I initialize the it's embeddings using the average obtained from their subword parts?
And how can I introduce it in the model?

@nagads
Copy link

nagads commented Feb 1, 2021

@irhallac it is the [unusedXXX]-tokens that can be replaced with any word you like. I am running some experiments on how effective this really is, but from my understanding you should prioritise words that are frequent in the domain specific dictionary, not in vocab.txt, and very the current tokenization is unlikely to be any good. You should also take into account that Bert is very good in tokenizing long words.

Lets say you have an english football specific corpus. You notice that the the word "footballs" is not in vocab.txt. It is however meaningless to add it. Bert tokenizes this as "football"+"##s" -> [2375] [2016] (look at the line numbers in vocab.txt), and have already learned a very good representation both for the individual tokens and the combination. However your text is a lot about football stadiums, and you see that "Anfield" is not in the vocab.txt. This will be tokenized as "an"+"##field". However, it is reason to believe that the current learned word embedding is not very useful.

If you add "anfield" to one of the unused spots in vocab.txt, and then to pre-training from the last checkpoint, this vectorised embedding will just start from random(think "0") and it might learn the word faster since it will not be confused by other uses of the tokens [an] and [##field]. This is my understanding of how this works.

It is important to remember that the line-numbers in vocab-txt matters. For instance is "!" at line #1000, and it should be also after you edited the file. For pre-training from existing checkpoint you should not change the size of the vocab-file. This means you have around 1000 extra words for your disposal. For pre-training from scratch, build a new vocab based only on your corpus.

Since Bert does an excellent job in tokenising and learning this combinations, do not expect dramatic improvements by adding words to the vocab. In my experience adding very specific terms, like common long medical latin words, have some effect. Adding words like "footballs" will likely just have negative effects since the current vector is already pretty good.

@peregilk thanks, that helps. to learn domain specific word embedding, any clues on volume of domain specific corpus needed, assuming we continue pretraining from the released model as checkpoint.

@peregilk
Copy link

peregilk commented Feb 3, 2021

@nagads I understand your question, and have gotten it several times before. I usually answers them "Ill tell you, if you first can tell me what a boat costs!".

It really depends on a lot of things, like how good would you like the model to be, how different is the domain from what the original model is trained on, how much data is possible to get at what cost etc. There are a few tricks (like dynamic masking) that you can use to make the most out of your data.

In general transformer models require A LOT of text. Always. However, domain specific pre-training is probably the part where you are able to get reasonable results with moderate amounts of data.

@nagads
Copy link

nagads commented Feb 4, 2021

@nagads I understand your question, and have gotten it several times before. I usually answers them "Ill tell you, if you first can tell me what a boat costs!".

It really depends on a lot of things, like how good would you like the model to be, how different is the domain from what the original model is trained on, how much data is possible to get at what cost etc. There are a few tricks (like dynamic masking) that you can use to make the most out of your data.

In general transformer models require A LOT of text. Always. However, domain specific pre-training is probably the part where you are able to get reasonable results with moderate amounts of data.

@peregilk thanks for the insightful response.

@Yiwen-Yang-666
Copy link

Pretrain from scratch or modify the first ~1000 lines of the vocab.txt file with the vocab you'd like to add.

In addition to these methods, we can add our own additional vocabulary and add a tensor embedding for the additional vocabulary , concatenate with original embedding tensor for tokens in the original vocab file. Specific details in #82 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests