-
Notifications
You must be signed in to change notification settings - Fork 8
my i use my own vocabulary #89
Comments
Hello @520jefferson, There are examples at https://github.com/google/sentencepiece showing how to create custom vocabularies for T5. Once the custom vocab is created as spm file, you can use HF to load it as a tokenizer. Please let us know exactly what you want to achieve, and we can provide further guidance. |
Thanks for reply, codes and vocab like follows. The vocab.txt (vocab.json is manually constructed from vocab.txt ) and meregs.txt i upload to google drive as follows: 3.i want to use the tokenizer to load the vocab and tokenizer to tokenizer my sentence and give it to the t5 model. load tokenizer: then i met this errror:AttributeError: 'tokenizers.models.WordLevel' object has no attribute 'truncation' So i want to load the vocab into tokenizer and use it like this { source = tokenizer.batch_encode_plus([source_text], max_length= 75, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors='pt') |
I want to train t5 from scratch, and use my own vocabulary.
the model i can load like this:
config = T5Config.from_json_file(config_file)
model = T5ForConditionalGeneration(config)
the vocabulary is like this below, it seems the tokenizer cannot load this vocab. how should i load this to a proper tokenizer?
{
"": 0,
"": 1,
"": 2,
"": 3,
"": 4,
",": 5,
"的": 6,
"?": 7,
"了": 8,
.....
.....
.....
"<s_181>": 33786,
"<s_182>": 33787,
"<s_183>": 33788,
"<s_184>": 33789,
"<s_185>": 33790,
"<s_186>": 33791,
"<s_187>": 33792,
"<s_188>": 33793,
"<s_189>": 33794
}
The text was updated successfully, but these errors were encountered: