Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

my i use my own vocabulary #89

Open
520jefferson opened this issue Jun 8, 2022 · 2 comments
Open

my i use my own vocabulary #89

520jefferson opened this issue Jun 8, 2022 · 2 comments

Comments

@520jefferson
Copy link

I want to train t5 from scratch, and use my own vocabulary.

the model i can load like this:
config = T5Config.from_json_file(config_file)
model = T5ForConditionalGeneration(config)

the vocabulary is like this below, it seems the tokenizer cannot load this vocab. how should i load this to a proper tokenizer?
{
"": 0,
"": 1,
"": 2,
"": 3,
"": 4,
",": 5,
"的": 6,
"?": 7,
"了": 8,
.....
.....
.....
"<s_181>": 33786,
"<s_182>": 33787,
"<s_183>": 33788,
"<s_184>": 33789,
"<s_185>": 33790,
"<s_186>": 33791,
"<s_187>": 33792,
"<s_188>": 33793,
"<s_189>": 33794
}

@mwitiderrick
Copy link
Contributor

mwitiderrick commented Jul 18, 2022

Hello @520jefferson, There are examples at https://github.com/google/sentencepiece showing how to create custom vocabularies for T5. Once the custom vocab is created as spm file, you can use HF to load it as a tokenizer.

Please let us know exactly what you want to achieve, and we can provide further guidance.

@520jefferson
Copy link
Author

@mwitiderrick

Thanks for reply, codes and vocab like follows.

The vocab.txt (vocab.json is manually constructed from vocab.txt ) and meregs.txt i upload to google drive as follows:
vocab.txt:https://drive.google.com/file/d/10jC8L_-RDLRv5QkAato8nJWGU1UQQcz1/view?usp=sharing
vocab.json:https://drive.google.com/file/d/1e5Ll0bAHhikhnYV5XaW3NB8aTSWdCvnC/view?usp=sharing
merges.txt:https://drive.google.com/file/d/1ifXlQaYod_kobqgNe82tmHTtHpxBYnBq/view?usp=sharing
2.The sentences for training and validation and test like this (after bpe, tokens split by " "):
你 觉得 大人 辛苦 还是 学生 辛苦 都 很 辛苦
头条 文章 没 啥 违规 , 却 被 小@@ 浪@@ 浪 屏蔽 了 , 而且 删 了 先生 的 转发 评价 , 农历 新年 将 至 , 俺 不想 发火 , 行 , 俺 再 发 一遍 ! 怎么 删 了 , 还 没 看 呢
专辑 有 签名 么 ? ! … 没有 机会 去 签@@ 售@@ 会 啦 幸好 里面 的 容 和 小 卡片 有 签名
你 帮 我 买 东西 吗 你 给钱 我 , 当然 帮 你 买 耶
你 说 那个 早晨 喝 那个 水有 什么 好处 可以 提高 睡眠 质量 养成 良好 的 睡眠 时间 和 习惯 慢慢 养成 早睡早起 的 习惯 , 习@@ 惯@@ 成@@ 自然
求个 风景 超 美的 网游 最好 是 韩国 的 剑侠情缘 叁
现在 百度 帐号 是 不能 拿 邮箱 注册 了 么 ? 只能 拿 手机号 了 么 ? 如果 可以 应该 怎么 拿 邮箱 注册 ? 谢谢 ! 先 用 手机 注册 , 然后 绑定 一个 邮箱 , 再@@ 解 绑 手机 即可
咱们 出去 转 会儿 遛@@ 弯@@ 儿 去 呗 我 在 工@@ 体 的 漫 咖啡 , 要 不要 来 坐 会儿
我 知道 最近 做 什么 准备 演唱会 的 事 吧

3.i want to use the tokenizer to load the vocab and tokenizer to tokenizer my sentence and give it to the t5 model.
load model like this(config: https://drive.google.com/file/d/1WOb-gqjkt1m6GBTFeq4wOWS3dW3Qt1oK/view?usp=sharing):
from transformers import T5Config, T5ForConditionalGeneration
config = T5Config.from_json_file(config_file)
model = T5ForConditionalGeneration(config)

load tokenizer:
from tokenizers.models import WordLevel
from transformers import PreTrainedTokenizerFast
vocab = WordLevel.from_file("vocab.json","")
fast_tokenizer=PreTrainedTokenizerFast(tokenizer_object=vocab)
fast_tokenizer.encode("你 觉得 大人 辛苦 还是 学生 辛苦 都 很 辛苦")

then i met this errror:AttributeError: 'tokenizers.models.WordLevel' object has no attribute 'truncation'

So i want to load the vocab into tokenizer and use it like this { source = tokenizer.batch_encode_plus([source_text], max_length= 75, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors='pt')
} and return { 'source_ids': source_ids.to(dtype=torch.long), 'source_mask': source_mask.to(dtype=torch.long), 'target_ids': target_ids.to(dtype=torch.long), 'target_ids_y': target_ids.to(dtype=torch.long) } , and give the tokenizer result to model and train the model like translation task, how should i do ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants