How to use my own additional vocabulary dictionary? #396

EnteLee · 2019-01-24T08:21:46Z

Hello!
We are Korean students.
We would like to implement a Korean slang filtering system as your BERT model.

A test is in progress by fine-tuning the CoLA task on run_classifier.py from the existing multilingual model. However, I feel a lack of a dictionary of existing words and want to use a BERT model with pre-trained weight by adding words to vocab.txt. However, modifications to the vocab.txt and vert_config.json files do not match the shape stored in the initial bert_model.ckpt.

I'd like to use the pre-training weight with our own additional vocabulary dictionary. Is there a way to modify this specification? Or are we forced to pre-training from the scratch?

Thank you.

cf. we have this error message!
INFO:tensorflow:Error recorded from evaluation_loop: indices[6,33,0] = 105932 is not in [0, 105879)
[[node bert/embeddings/embedding_lookup (defined at /home/user01/code/bert/modeling.py:421) = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bert/embeddings/word_embeddings/read, bert/embeddings/ExpandDims, bert/embeddings/embedding_lookup/axis)]]

rodgzilla · 2019-01-24T09:06:53Z

Hi!

I think the information you are looking for is in the readme file: https://github.com/google-research/bert#learning-a-new-wordpiece-vocabulary

bradfox2 · 2019-02-02T16:04:58Z

Pretrain from scratch or modify the first ~1000 lines of the vocab.txt file with the vocab you'd like to add.

bsugerman · 2019-02-06T12:32:08Z

I noticed that there are a LOT of single foreign characters in the vocab.txt file. I'm wondering whether one could also remove these and replace them with words for fine tuning?

yzho0907 · 2019-02-19T06:29:43Z

@bsugerman seems u really couldn't avoid all the foreign characters when u use a very large corpus to train the models and also couldn't replace (could delete them somehow) them till u train the model all over all again because params in embedding layer is part of the pre-trained model.

peregilk · 2019-04-02T17:56:34Z

I have a domain specific (medical) English corpus that I want to do some additional pre-training on from the Bert checkpoint. However, there are quite a lot of words in the medical vocabulary not present in the vocab.txt-file.

Lets say I want to add the top 500 words in the corpus not already in the vocabulary. Is this as easy as just replacing the [unused#] in the vocab.txt? No additional changes to the bert_config.json?

samreenkazi · 2019-04-14T15:13:47Z

I have similar question as above @peregilk , how to add domain specific vocab.txt in any language other then english, in their official repo it "This repository does not include code for learning a new WordPiece vocabulary , there are a number of open source options available. However, keep in mind that these are not compatible with our tokenization.py library:"
then how do we learn domain specific vocab, as with availbile multilingual pretraining weights bert didnt perform well on downstream classfication task on urdu corpus.
or we can create workaround for domain specific vocab by modifying ~1000 lines of vocab.txt as suggested by @bradfox2

peregilk · 2019-04-30T07:14:06Z

@samreenkazi. I ended up using Spacy to make a list of all the words in a portion of the corpus. There is easy built in functions for listing for instance the 10.000 most common words in the text. I then checked this against the bert vocab file, and ended up adding roughly 400 words in the empty spots in the vocab-file.

I did a few test, and on my very specific medical language, it seemed to have good effect. However, I noticed that it need quite a lot of pre-training to outperform the standard vocabulary. I trained a couple of days on a 2080Ti until it was better (logical since the weights for the new vocab is initialised from scratch).

I am not sure if this answers your question with the urdu corpus. However, if you like to have a look at the script I used for building the vocab-file, just send me a message.

samreenkazi · 2019-04-30T08:52:46Z

Yes I would like to have a look at script for building the vocab

peregilk · 2019-04-30T09:29:01Z

please contact me on "per at capia dot no". Ill send you the code.

…

On Tue, 30 Apr 2019 at 10:53, Samreen Kazi ***@***.***> wrote: Yes I would like to have a look at script for building the vocab — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#396 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACFIYACNU45AZZMT52DNGXTPTACJBANCNFSM4GSBNJ3A> .

bradfox2 · 2019-05-02T04:28:56Z

@peregilk Are you able to push that code up to a repo and link back here? It would be useful for many.

Dhanachandra · 2019-05-02T12:59:22Z

@bradfox2 , @peregilk
You can use a modified version of Tensor2Tensor/text_encoder_build_subword.py code to generate BERT compatible vocab.
https://github.com/kwonmha/bert-vocab-builder

techmattersinc · 2019-05-02T17:47:11Z

@peregilk Are you able to push that code up to a repo and link back here? It would be useful for many.

or perhaps post the code on https://gist.github.com/ - its free of cost

datduong · 2019-05-04T01:03:24Z

Hi, I would like to confirm the idea to add in an unseen word. Suppose I have a new word "xyzw". To include this word, the easiest approach is to replace [uncase1] with "xyzw" in the vocab.txt. Then, I will need to run fine-tuning on my specialized data so that the word vector for "xyzw" can be learned. Is this the correct idea?

bradfox2 · 2019-05-04T04:54:28Z

@bradfox2 , @peregilk
You can use a modified version of Tensor2Tensor/text_encoder_build_subword.py code to generate BERT compatible vocab.
https://github.com/kwonmha/bert-vocab-builder

That is also available in the BERT repo. The question is more around the use of some already developed, easy to use vocab comparison scripts.

irhallac · 2019-06-21T10:11:41Z

Pretrain from scratch or modify the first ~1000 lines of the vocab.txt file with the vocab you'd like to add.

I also need to add some a few thousands new tokens that don't exist in the Bert vocab file.
When I check the vocab file of the model (multi_cased_L-12_H-768_A-12), first 100 tokens are "unused" tokens ( unused0-unused99) and they continue with [UNK], [CLS], [SEP], [MASK] tokens.
I think you wouldn't suggest modifying these tokens and also the numbers and letters which come right after them. Because they are in the first ~1000 lines. Can you help me see what i am missing here?

Shouldn't we just modify the tokens that aren't likely to exist in the corpus we use for fine-tuning ?

peregilk · 2019-06-21T10:58:35Z

@irhallac it is the [unusedXXX]-tokens that can be replaced with any word you like. I am running some experiments on how effective this really is, but from my understanding you should prioritise words that are frequent in the domain specific dictionary, not in vocab.txt, and very the current tokenization is unlikely to be any good. You should also take into account that Bert is very good in tokenizing long words.

Lets say you have an english football specific corpus. You notice that the the word "footballs" is not in vocab.txt. It is however meaningless to add it. Bert tokenizes this as "football"+"##s" -> [2375] [2016] (look at the line numbers in vocab.txt), and have already learned a very good representation both for the individual tokens and the combination. However your text is a lot about football stadiums, and you see that "Anfield" is not in the vocab.txt. This will be tokenized as "an"+"##field". However, it is reason to believe that the current learned word embedding is not very useful.

If you add "anfield" to one of the unused spots in vocab.txt, and then to pre-training from the last checkpoint, this vectorised embedding will just start from random(think "0") and it might learn the word faster since it will not be confused by other uses of the tokens [an] and [##field]. This is my understanding of how this works.

It is important to remember that the line-numbers in vocab-txt matters. For instance is "!" at line #1000, and it should be also after you edited the file. For pre-training from existing checkpoint you should not change the size of the vocab-file. This means you have around 1000 extra words for your disposal. For pre-training from scratch, build a new vocab based only on your corpus.

Since Bert does an excellent job in tokenising and learning this combinations, do not expect dramatic improvements by adding words to the vocab. In my experience adding very specific terms, like common long medical latin words, have some effect. Adding words like "footballs" will likely just have negative effects since the current vector is already pretty good.

irhallac · 2019-06-21T11:16:51Z

@peregilk thank you. In the model i downloaded there are only 100 [unusedXXX]-tokens in the vocab.txt not 1000. But you say 1000 can be changed ?

peregilk · 2019-06-21T11:28:22Z

They are in two bulks. From line#2 [unused0] to line #100[unused98]. Then there are 4 tokens that absolutely should not be changed [UNK]|CLS][SEP][MASK].

Then they continue. line #105 to line #999. Totally roughly 1000 unused tokens.

irhallac · 2019-06-21T11:33:13Z

@peregilk btw I want to use the Bert model on Turkish language.
I downloaded it from
download_url = 'https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip'
and it is like this:

.
[unused97]
[unused98]
[unused99]
[UNK]
[CLS]
[SEP]
[MASK]
<S>
<T>
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
.

peregilk · 2019-06-21T11:54:32Z

OK. I did not know. Then it is only the uncased version that has 1000 unused spots.

datduong · 2019-06-21T23:52:12Z

Alternatively, can you also remove non-English words and the rare symbols ? Would this significantly affect the model ?

bhoomit · 2019-06-24T21:04:03Z

They are in two bulks. From line#2 [unused0] to line #100[unused98]. Then there are 4 tokens that absolutely should not be changed [UNK]|CLS][SEP][MASK].

Then they continue. line #105 to line #999. Totally roughly 1000 unused tokens.

Does that mean I can't add more than 1000 words?

jinamshah · 2019-07-18T11:13:56Z

@samreenkazi. I ended up using Spacy to make a list of all the words in a portion of the corpus. There is easy built in functions for listing for instance the 10.000 most common words in the text. I then checked this against the bert vocab file, and ended up adding roughly 400 words in the empty spots in the vocab-file.

I did a few test, and on my very specific medical language, it seemed to have good effect. However, I noticed that it need quite a lot of pre-training to outperform the standard vocabulary. I trained a couple of days on a 2080Ti until it was better (logical since the weights for the new vocab is initialised from scratch).

I am not sure if this answers your question with the urdu corpus. However, if you like to have a look at the script I used for building the vocab-file, just send me a message.

@peregilk can you please share the code to modify vocabulary and then pre-training it to adapt to the new vocab. Also, can to tell me the metrics on basis of which you decided that the weights were better

ivanacorovic · 2019-07-26T20:56:43Z

Pretrain from scratch or modify the first ~1000 lines of the vocab.txt file with the vocab you'd like to add.

@bradfox2 What are we supposed to do after these changes? How is the model retrained?

Dhanachandra · 2019-08-27T12:17:15Z

@irhallac it is the [unusedXXX]-tokens that can be replaced with any word you like. I am running some experiments on how effective this really is, but from my understanding you should prioritise words that are frequent in the domain specific dictionary, not in vocab.txt, and very the current tokenization is unlikely to be any good. You should also take into account that Bert is very good in tokenizing long words.

Lets say you have an english football specific corpus. You notice that the the word "footballs" is not in vocab.txt. It is however meaningless to add it. Bert tokenizes this as "football"+"##s" -> [2375] [2016] (look at the line numbers in vocab.txt), and have already learned a very good representation both for the individual tokens and the combination. However your text is a lot about football stadiums, and you see that "Anfield" is not in the vocab.txt. This will be tokenized as "an"+"##field". However, it is reason to believe that the current learned word embedding is not very useful.

If you add "anfield" to one of the unused spots in vocab.txt, and then to pre-training from the last checkpoint, this vectorised embedding will just start from random(think "0") and it might learn the word faster since it will not be confused by other uses of the tokens [an] and [##field]. This is my understanding of how this works.

It is important to remember that the line-numbers in vocab-txt matters. For instance is "!" at line #1000, and it should be also after you edited the file. For pre-training from existing checkpoint you should not change the size of the vocab-file. This means you have around 1000 extra words for your disposal. For pre-training from scratch, build a new vocab based only on your corpus.

Since Bert does an excellent job in tokenising and learning this combinations, do not expect dramatic improvements by adding words to the vocab. In my experience adding very specific terms, like common long medical latin words, have some effect. Adding words like "footballs" will likely just have negative effects since the current vector is already pretty good.

@peregilk
I am also working in the medical domain, can you please share those common long medical latin words that you added in vocab.

peregilk · 2019-08-27T13:35:36Z

@irhallac Let me post an update on my experiences with using vocab-files during pretraining on a domain specific corpus. As far as I know, the only reasonable way to test if this work is to validate it by also fine-tuning the pretrained networks. You will have to do this multiple times before you get reliable results.

My initial experiments indicated that adding custom words to the vocab-file had some effects. However, at least on my corpus that can be described as "medical tweets", this effect just disappears after running the domain specific pretraining for a while.

After spending quite some time on this, I have ended up dropping the custom vocab-files totally. Bert seems to be able to learn these specialised words by tokenizing them.

bradfox2 · 2019-09-07T05:43:12Z

Pretrain from scratch or modify the first ~1000 lines of the vocab.txt file with the vocab you'd like to add.

@bradfox2 What are we supposed to do after these changes? How is the model retrained?

Fine-tune the model on your specific text corpus. Model weights are tuned during initial pretraining with the tokenized vocabulary, so you need to keep the same token mapped to the same input 'node'. The first 1000 tokens are meaningless and the model learns to essentially ignore them. Give the meaningless vocab some relevance with a custom dataset, continue finetuning and model will start to give the previously ignored tokens/vocab some weight (pun intended)

mahanswaray · 2019-11-06T06:20:26Z

@samreenkazi. I ended up using Spacy to make a list of all the words in a portion of the corpus. There is easy built in functions for listing for instance the 10.000 most common words in the text. I then checked this against the bert vocab file, and ended up adding roughly 400 words in the empty spots in the vocab-file.

I did a few test, and on my very specific medical language, it seemed to have good effect. However, I noticed that it need quite a lot of pre-training to outperform the standard vocabulary. I trained a couple of days on a 2080Ti until it was better (logical since the weights for the new vocab is initialised from scratch).

I am not sure if this answers your question with the urdu corpus. However, if you like to have a look at the script I used for building the vocab-file, just send me a message.

hey i would like to have a look at the script can you help?

peregilk · 2019-11-06T09:52:11Z

I did a few more tests on this (as I mentioned in another post). I am no longer convinced about my own results. The challenge is that fine-tuning has a lot of variance. I think the first positive result was mainly "a fluke". Even if it gives a marginal improvement, it also adds more complexity (how many words, what words etc). Domain specific pre-training is essential for getting this models to perform good on specialised domains, but the extra words in the dictionary is just a tiny detail that probably is not worth the effort.

…

On Wed, 6 Nov 2019 at 07:21, mahanswaray ***@***.***> wrote: @samreenkazi <https://github.com/samreenkazi>. I ended up using Spacy to make a list of all the words in a portion of the corpus. There is easy built in functions for listing for instance the 10.000 most common words in the text. I then checked this against the bert vocab file, and ended up adding roughly 400 words in the empty spots in the vocab-file. I did a few test, and on my very specific medical language, it seemed to have good effect. However, I noticed that it need quite a lot of pre-training to outperform the standard vocabulary. I trained a couple of days on a 2080Ti until it was better (logical since the weights for the new vocab is initialised from scratch). I am not sure if this answers your question with the urdu corpus. However, if you like to have a look at the script I used for building the vocab-file, just send me a message. hey i would like to have a look at the script can you help? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#396?email_source=notifications&email_token=ACFIYAA7HNLV5X6Z3POZ3D3QSJO6FA5CNFSM4GSBNJ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDFNIII#issuecomment-550163489>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACFIYAAPQBNN7SRHBXDS573QSJO6FANCNFSM4GSBNJ3A> .

flaviofafe1414 · 2020-01-01T15:26:43Z

@peregilk hi
I also have an interest in the medical field and I have the same problem, did you have any progress?

muhammadfahid51 · 2020-01-10T05:29:19Z

I have similar question as above @peregilk , how to add domain specific vocab.txt in any language other then english, in their official repo it "This repository does not include code for learning a new WordPiece vocabulary , there are a number of open source options available. However, keep in mind that these are not compatible with our tokenization.py library:"
then how do we learn domain specific vocab, as with availbile multilingual pretraining weights bert didnt perform well on downstream classfication task on urdu corpus.
or we can create workaround for domain specific vocab by modifying ~1000 lines of vocab.txt as suggested by @bradfox2

Hello,
Did you come up with any solution for this ? I have my own custom tokenizer and it has a lot of new words.

muhammadfahid51 · 2020-01-10T05:43:32Z

I did a few more tests on this (as I mentioned in another post). I am no longer convinced about my own results. The challenge is that fine-tuning has a lot of variance. I think the first positive result was mainly "a fluke". Even if it gives a marginal improvement, it also adds more complexity (how many words, what words etc). Domain specific pre-training is essential for getting this models to perform good on specialised domains, but the extra words in the dictionary is just a tiny detail that probably is not worth the effort.
…
On Wed, 6 Nov 2019 at 07:21, mahanswaray @.***> wrote: @samreenkazi https://github.com/samreenkazi. I ended up using Spacy to make a list of all the words in a portion of the corpus. There is easy built in functions for listing for instance the 10.000 most common words in the text. I then checked this against the bert vocab file, and ended up adding roughly 400 words in the empty spots in the vocab-file. I did a few test, and on my very specific medical language, it seemed to have good effect. However, I noticed that it need quite a lot of pre-training to outperform the standard vocabulary. I trained a couple of days on a 2080Ti until it was better (logical since the weights for the new vocab is initialised from scratch). I am not sure if this answers your question with the urdu corpus. However, if you like to have a look at the script I used for building the vocab-file, just send me a message. hey i would like to have a look at the script can you help? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#396?email_source=notifications&email_token=ACFIYAA7HNLV5X6Z3POZ3D3QSJO6FA5CNFSM4GSBNJ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDFNIII#issuecomment-550163489>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFIYAAPQBNN7SRHBXDS573QSJO6FANCNFSM4GSBNJ3A .

@peregilk
One thing that I am confused about BERT works on character level or Word level ?
What I mean here is does bert break the word token further into characters during training and learn embedding accordingly or does it considers only the word tokens(made by tokenizer). Why I am asking this is that vocab.txt contains all the basic characters of my language say urdu. By basic characters I mean a,b,c characters of my language. Anyone please enlighten me on this ?
Let's say we have our new data but that data is also made of those basic characters right.
If it was only about the word tokens then English vocabulary size is more than 120k and the model vocab don't have all of them in it.

peregilk · 2020-01-10T10:45:25Z

@muhammadfahid51 If I understand things correctly, Bert works on token level. In addition it learns multi-token embeddings.

Lets say we have the word "goodness". Lets say this does not exist in the vocabulary. However, the following tokens exists:
"good"
"ness"
"##ness"

Since this is one word, it will be tokenized as "good"+"##ness". Bert will learn an embedding for ("good"-"##ness") as well as embedding for both "good" and "##ness". It is not as good as if "goodness" existed directly, but reasonable.

However: Adding extra words is a bit double edged. If you have a domain specific vocabulary and add "goodness" to one of the empty spots, you will have to train from a random weight. Both "good" and "##ness" have OK embeddings already, and even if it never in the training set have seen "goodness" before, it already have a reasonable embedding to start from.

If you training the entire network from scratch, it makes more sense to build a vocabulary that is as efficient as possible, ie require as few tokens as possible.

I hope this answers your question.

muhammadfahid51 · 2020-01-10T10:53:06Z

@muhammadfahid51 If I understand things correctly, Bert works on token level. In addition it learns multi-token embeddings.

Lets say we have the word "goodness". Lets say this does not exist in the vocabulary. However, the following tokens exists:
"good"
"ness"
"##ness"

Since this is one word, it will be tokenized as "good"+"##ness". Bert will learn an embedding for ("good"-"##ness") as well as embedding for both "good" and "##ness". It is not as good as if "goodness" existed directly, but reasonable.

However: Adding extra words is a bit double edged. If you have a domain specific vocabulary and add "goodness" to one of the empty spots, you will have to train from a random weight. Both "good" and "##ness" have OK embeddings already, and even if it never in the training set have seen "goodness" before, it already have a reasonable embedding to start from.

If you training the entire network from scratch, it makes more sense to build a vocabulary that is as efficient as possible, ie require as few tokens as possible.

I hope this answers your question.

@peregilk
What if I want to pre-train on my a custom language say turkish ?
Can I replace some other language characters in vocab.txt with my own tokens ?

And also to pre-train bert from scratch, how much data is required ?

peregilk · 2020-01-10T11:08:41Z

@muhammadfahid51 Dont interpret any of this as "correct" answers. I am just another researcher struggling with the same issues.

You can use Sentencepiece to build any vocabulary from scratch. Sentencepiece will search your corpus and find the most efficient tokens.

If you want to start from pre-trained weights you will have to use the same vocabulary. You can manipulate that vocabulary, but it is really only manipulating the open spots that makes most sense. In most cases it will be easier (and cheaper) to start with multi-lang pre-trained Bert and train with additional data in your target language, than to train from scratch on a separate language (this is if your language is part of multi-lang Bert already - 100 languages). My experience.

I would say a reasonable corpus is 1B words. Multilang-Bert is trained with a bit less for each of the languages. Size of training corpus and training time is really a big issue. Fine-tuning the vocabulary is (IMHO) not really something you should spend too much time on.

peregilk · 2020-01-10T11:12:37Z

@muhammadfahid51 Take a look at this page:
https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages

dhruvsakalley · 2020-02-14T22:19:11Z

I'm in the same boat, I need to add a larger vocabulary (not directly present in the original bert vocab) but I want to also use the init checkpoint from the original bert. When you do that, you quickly run into this: ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((new_vocab_size, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader. On some preliminary investigation it seems like a few code changes are required to handle this case in run_pretraining.py where the bert_config vocab size is different from the init checkpoint file. As long as the order of the initial vocab terms is kept the same in the new vocab, it should be possible to initialize the known token weights from the pretrained weights and randomly initialize the rest of the network.

Aktsvigun · 2020-06-05T10:03:43Z

@peregilk Good afternoon, and thank you so much for your comprehensive responses. I would like to ask you a small question, you say:
"Bert will learn an embedding for ("good"-"##ness") as well as embedding for both "good" and "##ness"."
What do you mean by embedding for ("good"-"##ness")? Probably I am mistaken, however, I thought as any NLP model Bert has embeddings only for single tokens. Do you mean Bert will learn the interrelation between these two tokens in attention layer or does it have special embeddings for such cases? Thanks in advance!

boggis30 · 2020-06-10T14:28:15Z

I'm in the same boat, I need to add a larger vocabulary (not directly present in the original bert vocab) but I want to also use the init checkpoint from the original bert. When you do that, you quickly run into this: ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((new_vocab_size, 768)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([30522, 768]) from checkpoint reader. On some preliminary investigation it seems like a few code changes are required to handle this case in run_pretraining.py where the bert_config vocab size is different from the init checkpoint file. As long as the order of the initial vocab terms is kept the same in the new vocab, it should be possible to initialize the known token weights from the pretrained weights and randomly initialize the rest of the network.

@dhruvsakalley I like your idea and wonder if you managed to implement it. Can you share your experience, please?

ali4friends71 · 2020-06-26T17:43:34Z

@peregilk
Can u say me how to train the model after adding our words in vocab.txt ?
Code of how to train BERT With additional vocabulary ?

SravaniSegireddy · 2020-07-17T07:48:18Z

@dhruvsakalley I am exactly doing the same what you're trying to implement. May I know whether it is implemented or not ? are you able to increase the vocab file ? If yes, can you please share the code. Thanks

ali4friends71 · 2020-07-18T05:10:38Z

@SravaniSegireddy I implemented it. But it is of no use as the words are divided into 2 and the meanings are changing..

SravaniSegireddy · 2020-07-20T13:38:17Z

@ali4friends71 could you please share your code, if possble. May be i can get some idea. Thanks.

SravaniSegireddy · 2020-07-20T13:45:14Z

@irhallac Let me post an update on my experiences with using vocab-files during pretraining on a domain specific corpus. As far as I know, the only reasonable way to test if this work is to validate it by also fine-tuning the pretrained networks. You will have to do this multiple times before you get reliable results.

My initial experiments indicated that adding custom words to the vocab-file had some effects. However, at least on my corpus that can be described as "medical tweets", this effect just disappears after running the domain specific pretraining for a while.

After spending quite some time on this, I have ended up dropping the custom vocab-files totally. Bert seems to be able to learn these specialised words by tokenizing them.

@peregilk Does it mean, you can pretrain the model with domain specifc data but not necessary to change the vocab file is it.
may i know what is the accuracy improvement you achieved after pretraining with domain specific data ?

peregilk · 2020-07-20T17:37:43Z

Absolutely. Doing additional domain specific pretraining is very effective. How effective will depend on your task and corpus. Lots of examples of its efficiancy. Here is just one example: https://arxiv.org/pdf/2005.07503 Fundamental changes to the vocab will make it impossible to continue from pretrained weights. Unless you are training a completely new language, and have lots of resources, this is probably just a bad idea. Using the open spots is the only alternative. My experience is that on my corpus it has not been a very big thing. BERT learns new composite words very easy.

…

On Mon, 20 Jul 2020 at 15:45, SRAVANI SEGIREDDY ***@***.***> wrote: @irhallac <https://github.com/irhallac> Let me post an update on my experiences with using vocab-files during pretraining on a domain specific corpus. As far as I know, the only reasonable way to test if this work is to validate it by also fine-tuning the pretrained networks. You will have to do this multiple times before you get reliable results. My initial experiments indicated that adding custom words to the vocab-file had some effects. However, at least on my corpus that can be described as "medical tweets", this effect just disappears after running the domain specific pretraining for a while. After spending quite some time on this, I have ended up dropping the custom vocab-files totally. Bert seems to be able to learn these specialised words by tokenizing them. @peregilk <https://github.com/peregilk> Does it mean, you can pretrain the model with domain specifc data but not necessary to change the vocab file is it. may i know what is the accuracy improvement you achieved after pretraining with domain specific data ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#396 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACFIYAHZ6KHVVJKZWCEGUO3R4RC7ZANCNFSM4GSBNJ3A> .

ali4friends71 · 2020-07-21T04:45:27Z

@SravaniSegireddy you could use the code from Colab notebook.Check out the article for further instructions.

timpal0l · 2020-11-02T11:52:03Z

But do we really need to manually add domain specific (out of vocab words?). Isn't the purpose with word pieces that they can in theory construct new words with combining their pieces? And if so, the semantics of the wordpieces would change when doing downstream tasks?

joancf · 2020-12-17T10:57:39Z

If I use new vocabulary,
can I initialize the it's embeddings using the average obtained from their subword parts?
And how can I introduce it in the model?

nagads · 2021-02-01T05:32:03Z

@irhallac it is the [unusedXXX]-tokens that can be replaced with any word you like. I am running some experiments on how effective this really is, but from my understanding you should prioritise words that are frequent in the domain specific dictionary, not in vocab.txt, and very the current tokenization is unlikely to be any good. You should also take into account that Bert is very good in tokenizing long words.

Lets say you have an english football specific corpus. You notice that the the word "footballs" is not in vocab.txt. It is however meaningless to add it. Bert tokenizes this as "football"+"##s" -> [2375] [2016] (look at the line numbers in vocab.txt), and have already learned a very good representation both for the individual tokens and the combination. However your text is a lot about football stadiums, and you see that "Anfield" is not in the vocab.txt. This will be tokenized as "an"+"##field". However, it is reason to believe that the current learned word embedding is not very useful.

If you add "anfield" to one of the unused spots in vocab.txt, and then to pre-training from the last checkpoint, this vectorised embedding will just start from random(think "0") and it might learn the word faster since it will not be confused by other uses of the tokens [an] and [##field]. This is my understanding of how this works.

It is important to remember that the line-numbers in vocab-txt matters. For instance is "!" at line #1000, and it should be also after you edited the file. For pre-training from existing checkpoint you should not change the size of the vocab-file. This means you have around 1000 extra words for your disposal. For pre-training from scratch, build a new vocab based only on your corpus.

Since Bert does an excellent job in tokenising and learning this combinations, do not expect dramatic improvements by adding words to the vocab. In my experience adding very specific terms, like common long medical latin words, have some effect. Adding words like "footballs" will likely just have negative effects since the current vector is already pretty good.

@peregilk thanks, that helps. to learn domain specific word embedding, any clues on volume of domain specific corpus needed, assuming we continue pretraining from the released model as checkpoint.

peregilk · 2021-02-03T10:16:45Z

@nagads I understand your question, and have gotten it several times before. I usually answers them "Ill tell you, if you first can tell me what a boat costs!".

It really depends on a lot of things, like how good would you like the model to be, how different is the domain from what the original model is trained on, how much data is possible to get at what cost etc. There are a few tricks (like dynamic masking) that you can use to make the most out of your data.

In general transformer models require A LOT of text. Always. However, domain specific pre-training is probably the part where you are able to get reasonable results with moderate amounts of data.

nagads · 2021-02-04T06:36:49Z

@nagads I understand your question, and have gotten it several times before. I usually answers them "Ill tell you, if you first can tell me what a boat costs!".

It really depends on a lot of things, like how good would you like the model to be, how different is the domain from what the original model is trained on, how much data is possible to get at what cost etc. There are a few tricks (like dynamic masking) that you can use to make the most out of your data.

In general transformer models require A LOT of text. Always. However, domain specific pre-training is probably the part where you are able to get reasonable results with moderate amounts of data.

@peregilk thanks for the insightful response.

Yiwen-Yang-666 · 2021-09-18T02:21:23Z

Pretrain from scratch or modify the first ~1000 lines of the vocab.txt file with the vocab you'd like to add.

In addition to these methods, we can add our own additional vocabulary and add a tensor embedding for the additional vocabulary , concatenate with original embedding tensor for tokens in the original vocab file. Specific details in #82 (comment)

hsm207 mentioned this issue May 18, 2019

BERT pre-training using only domain specific text #615

Open

rzepinskip mentioned this issue Mar 30, 2020

Vocabulary checks rzepinskip/spoiler-detection#9

Open

SeongIkKim mentioned this issue May 3, 2021

[TODO] Special Token 추가 관련 정보 탐색 및 예시코드 작성 VumBleBot/Group-Activity#20

Closed

tlapusan mentioned this issue Jan 31, 2024

Adding new tokens to various models changes tokenization of adjacent elements in strings huggingface/transformers#14770

Closed

How to use my own additional vocabulary dictionary? #396

How to use my own additional vocabulary dictionary? #396

Comments

EnteLee commented Jan 24, 2019 • edited Loading

rodgzilla commented Jan 24, 2019

bradfox2 commented Feb 2, 2019

bsugerman commented Feb 6, 2019

yzho0907 commented Feb 19, 2019

peregilk commented Apr 2, 2019

samreenkazi commented Apr 14, 2019 • edited Loading

peregilk commented Apr 30, 2019

samreenkazi commented Apr 30, 2019

peregilk commented Apr 30, 2019 via email

bradfox2 commented May 2, 2019 • edited Loading

Dhanachandra commented May 2, 2019

techmattersinc commented May 2, 2019

datduong commented May 4, 2019

bradfox2 commented May 4, 2019

irhallac commented Jun 21, 2019 • edited Loading

peregilk commented Jun 21, 2019

irhallac commented Jun 21, 2019

peregilk commented Jun 21, 2019

irhallac commented Jun 21, 2019 • edited Loading

peregilk commented Jun 21, 2019

datduong commented Jun 21, 2019

bhoomit commented Jun 24, 2019

jinamshah commented Jul 18, 2019

ivanacorovic commented Jul 26, 2019

Dhanachandra commented Aug 27, 2019

peregilk commented Aug 27, 2019

bradfox2 commented Sep 7, 2019

mahanswaray commented Nov 6, 2019

peregilk commented Nov 6, 2019 via email

flaviofafe1414 commented Jan 1, 2020

muhammadfahid51 commented Jan 10, 2020

muhammadfahid51 commented Jan 10, 2020

peregilk commented Jan 10, 2020

muhammadfahid51 commented Jan 10, 2020

peregilk commented Jan 10, 2020

peregilk commented Jan 10, 2020

dhruvsakalley commented Feb 14, 2020

Aktsvigun commented Jun 5, 2020

boggis30 commented Jun 10, 2020 • edited Loading

ali4friends71 commented Jun 26, 2020

SravaniSegireddy commented Jul 17, 2020

ali4friends71 commented Jul 18, 2020

SravaniSegireddy commented Jul 20, 2020

SravaniSegireddy commented Jul 20, 2020

peregilk commented Jul 20, 2020 via email

ali4friends71 commented Jul 21, 2020

timpal0l commented Nov 2, 2020

joancf commented Dec 17, 2020

nagads commented Feb 1, 2021

peregilk commented Feb 3, 2021 • edited Loading

nagads commented Feb 4, 2021

Yiwen-Yang-666 commented Sep 18, 2021

EnteLee commented Jan 24, 2019 •

edited

Loading

samreenkazi commented Apr 14, 2019 •

edited

Loading

bradfox2 commented May 2, 2019 •

edited

Loading

irhallac commented Jun 21, 2019 •

edited

Loading

irhallac commented Jun 21, 2019 •

edited

Loading

boggis30 commented Jun 10, 2020 •

edited

Loading

peregilk commented Feb 3, 2021 •

edited

Loading