my i use my own vocabulary #89

520jefferson · 2022-06-08T07:31:14Z

I want to train t5 from scratch, and use my own vocabulary.

the model i can load like this:
config = T5Config.from_json_file(config_file)
model = T5ForConditionalGeneration(config)

the vocabulary is like this below, it seems the tokenizer cannot load this vocab. how should i load this to a proper tokenizer?
{
"": 0,
"": 1,
"": 2,
"": 3,
"": 4,
"，": 5,
"的": 6,
"？": 7,
"了": 8,
.....
.....
.....
"<s_181>": 33786,
"<s_182>": 33787,
"<s_183>": 33788,
"<s_184>": 33789,
"<s_185>": 33790,
"<s_186>": 33791,
"<s_187>": 33792,
"<s_188>": 33793,
"<s_189>": 33794
}

mwitiderrick · 2022-07-18T08:49:24Z

Hello @520jefferson, There are examples at https://github.com/google/sentencepiece showing how to create custom vocabularies for T5. Once the custom vocab is created as spm file, you can use HF to load it as a tokenizer.

Please let us know exactly what you want to achieve, and we can provide further guidance.

520jefferson · 2022-07-19T06:41:54Z

@mwitiderrick

Thanks for reply, codes and vocab like follows.

The vocab.txt (vocab.json is manually constructed from vocab.txt ) and meregs.txt i upload to google drive as follows:
vocab.txt:https://drive.google.com/file/d/10jC8L_-RDLRv5QkAato8nJWGU1UQQcz1/view?usp=sharing
vocab.json:https://drive.google.com/file/d/1e5Ll0bAHhikhnYV5XaW3NB8aTSWdCvnC/view?usp=sharing
merges.txt:https://drive.google.com/file/d/1ifXlQaYod_kobqgNe82tmHTtHpxBYnBq/view?usp=sharing
2.The sentences for training and validation and test like this (after bpe, tokens split by " "):
你觉得大人辛苦还是学生辛苦都很辛苦
头条文章没啥违规，却被小@@ 浪@@ 浪屏蔽了，而且删了先生的转发评价，农历新年将至，俺不想发火，行，俺再发一遍！怎么删了，还没看呢
专辑有签名么？！ … 没有机会去签@@ 售@@ 会啦幸好里面的容和小卡片有签名
你帮我买东西吗你给钱我，当然帮你买耶
你说那个早晨喝那个水有什么好处可以提高睡眠质量养成良好的睡眠时间和习惯慢慢养成早睡早起的习惯，习@@ 惯@@ 成@@ 自然
求个风景超美的网游最好是韩国的剑侠情缘叁
现在百度帐号是不能拿邮箱注册了么？只能拿手机号了么？如果可以应该怎么拿邮箱注册？谢谢！先用手机注册，然后绑定一个邮箱，再@@ 解绑手机即可
咱们出去转会儿遛@@ 弯@@ 儿去呗我在工@@ 体的漫咖啡，要不要来坐会儿
我知道最近做什么准备演唱会的事吧

3.i want to use the tokenizer to load the vocab and tokenizer to tokenizer my sentence and give it to the t5 model.
load model like this(config: https://drive.google.com/file/d/1WOb-gqjkt1m6GBTFeq4wOWS3dW3Qt1oK/view?usp=sharing):
from transformers import T5Config, T5ForConditionalGeneration
config = T5Config.from_json_file(config_file)
model = T5ForConditionalGeneration(config)

load tokenizer:
from tokenizers.models import WordLevel
from transformers import PreTrainedTokenizerFast
vocab = WordLevel.from_file("vocab.json","")
fast_tokenizer=PreTrainedTokenizerFast(tokenizer_object=vocab)
fast_tokenizer.encode("你觉得大人辛苦还是学生辛苦都很辛苦")

then i met this errror:AttributeError: 'tokenizers.models.WordLevel' object has no attribute 'truncation'

So i want to load the vocab into tokenizer and use it like this { source = tokenizer.batch_encode_plus([source_text], max_length= 75, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors='pt')
} and return { 'source_ids': source_ids.to(dtype=torch.long), 'source_mask': source_mask.to(dtype=torch.long), 'target_ids': target_ids.to(dtype=torch.long), 'target_ids_y': target_ids.to(dtype=torch.long) } , and give the tokenizer result to model and train the model like translation task, how should i do ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

my i use my own vocabulary #89

my i use my own vocabulary #89

520jefferson commented Jun 8, 2022

mwitiderrick commented Jul 18, 2022 •

edited

Loading

520jefferson commented Jul 19, 2022

my i use my own vocabulary #89

my i use my own vocabulary #89

Comments

520jefferson commented Jun 8, 2022

mwitiderrick commented Jul 18, 2022 • edited Loading

520jefferson commented Jul 19, 2022

mwitiderrick commented Jul 18, 2022 •

edited

Loading