-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do I pre-train the T5 model in HuggingFace library using my own text corpus? #5079
Comments
Hi, @abhisheknovoic this might help you https://huggingface.co/transformers/model_doc/t5.html#training |
@patil-suraj , do you mean this class? - T5ForConditionalGeneration Also, at the top of the page, there is the following code:
Any idea which class is the model instantiated from? I could not find any class with lm_labels parameter. Thanks |
Yes, it's Pinging @patrickvonplaten for more details. |
@patil-suraj , I tried the following code which throws an error. Any idea why? Thanks
My versions are
|
If you are using 2.11.0 then use |
@patil-suraj , thanks. I have installed the master version. It still complains with the same error. It seems like I need to specify something for the decoder_start_token_id. |
Ok, I got it working. I initialized config like follows:
|
@patil-suraj , however, if we use the master branch, it seems like the tokenizers are broken. The T5 tokenizer doesn't tokenize the sentinel tokens correctly. |
Feel free to also open a PR to correct |
Just saw that @patil-suraj already did this - awesome thanks :-) @abhisheknovoic regarding the T5 tokenizer, can you post some code here that shows that T5 tokenization is broken (would be great if we can easily reproduce the error) |
@patrickvonplaten it would be nice if we also add seq-2-seq (t5, bart) model pre-training examples in official examples cc @sshleifer |
Definitely! |
Not sure if this should be a separate issue or not, but I am having difficulty training my own T5 tokenizer. When training a BPE tokenizer using the amazing huggingface tokenizer library and attempting to load it via tokenizer = T5Tokenizer.from_pretrained('./tokenizer') I get the following error:
I attempted to train a sentencepiece model instead using the, again amazing, huggingface tokenizer library, I get the same error because the Am I doing something wrong? Tranformers version: 2.11.0 Here is a colab to reproduce the error: https://colab.research.google.com/drive/1WX1Q2Ze9k0SxFMLLv1aFgVGBFMEVTyDe?usp=sharing |
@mfuntowicz @n1t0 - maybe you can help here |
The pre-training scripts would really help.original mesh transformer is very complicated to understand. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax). You can take a look! Any suggestions are more than welcome. |
Hello,
I understand how the T5 architecture works and I have my own large corpus where I decide to mask a sequence of tokens and replace them with sentinel tokens.
I also understand about the tokenizers in HuggingFace, specially the T5 tokenizer.
Can someone point me to a document or refer me to the class that I need to use to pretrain T5 model on my corpus using the masked language model approach?
Thanks
The text was updated successfully, but these errors were encountered: