-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is the vocab.txt correct? #1
Comments
Yes, the WordPiece vocab is exactly the same as the original BERT for several reasons. First, we wanted to use pre-trained BERT released by Google which makes us to use the same WordPiece vocab. Second, because the WordPiece vocab is based on subword units, any new words in biomedical corpus could be turned into proper embeddings (might be tuned during fine-tuning). We could try building our own vocabs using biomedical corpora, but that would lose compatibility with the original pre-trained BERT. |
Got it! Thanks for the quick and helpful reply 👍 I can understand why keeping compatibility with the original BERT is important. Personally, I would like to have a custom dictionary, since I think there might be some interesting opportunity for fine tuning as a lot of medical jargon (like drug names and chemicals) have somewhat of a unique internal structure that is now lost during the subword tokenization. But it'd be rude to ask you to train that! Feel free to close, and thank you for this great contribution! |
We have a plan for using a custom dictionary, but it will require much more GPU hours to pre-train such model compared to starting from the pre-trained BERT. We'll share it if it works. Thank you for your interest, and I'll close the issue. |
I wonder if there's any update on using the custom dictionary and if it's a work in progress or on your TODO list? |
Hi @phosseini, Thanks. |
Hi @jhyuklee: A random idea I had: would it be possible to use a custom vocabulary without redoing the BERT pretraining? One way to transfer the model onto a different vocabulary might proceed as follows:
The benefit of this is to avoid most of the expensive BERT pre-training: only the first layer would be trained from scratch, rather than the whole model. Thoughts? |
Hi @jhyuklee,
Thank you! |
Just a general question I guess, but after inspecting the vocab.txt it doesn't seem to be particularly biomedically related (seems like its the old one) is this correct?
I'm trying to use these pretrained models in an experiment for NER, and I'd like to be able to acquire a distributional vector given a sequence of tokens (ideally bolting it into an existing Keras model, but I'm not set on that idea)
The text was updated successfully, but these errors were encountered: