Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

添加额外专业词汇 #51

Open
srhouyu opened this issue Jun 29, 2020 · 1 comment
Open

添加额外专业词汇 #51

srhouyu opened this issue Jun 29, 2020 · 1 comment

Comments

@srhouyu
Copy link

srhouyu commented Jun 29, 2020

我有一些专业词汇想添加进词典。 google_zh_vocab.txt 里面有100个空位,但是这个数量远远达不到需求。不知道我如果想添加成千上万的专业词汇该怎么办?

在这个回答中看到,在词典中加新词是可以的
google-research/bert#9

(b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

但是具体怎么做我也不太懂。

所以,不知道UER-py能否考虑加入附加词典的功能呢?

@zhezhaoa
Copy link
Collaborator

您好
使用scritps文件夹下面的dynamic_vocab_adapter.py就可以
这个脚本会根据旧的词典和新的词典的区别修改embedding层和softmax前一层,从而得到新的预训练模型
新的词对应的向量会随机初始化
然后我们可以在新的预训练模型基础上增量预训练或者微调

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants