Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PaddlePaddle Hackathon] 55题 提交 #1133

Merged
merged 45 commits into from
Dec 10, 2021

Conversation

nosaydomore
Copy link
Contributor

@nosaydomore nosaydomore commented Oct 10, 2021

Task: #1075

PR types

New features

PR changes

Models

Description

Task: #1075

  1. 在PaddleNLP的Roberta模型代码中,新增 RobertaForMultipleChoice,RobertaForMaskedLM 和 RobertaForCausalLM三个类,(代码+注释)+ 项目单测文件
  2. 新增 roberta-large,roberta-base,deepset/roberta-base-squad2,uer/roberta-base-finetuned-chinanews-chinese,sshleifer/tiny-distilroberta-base,uer/roberta-base-finetuned-cluener2020-chinese 和 uer/roberta-base-chinese-extractive-qa 7个模型参数权重. 模型转换代码(convert.py)包含在项目中。
    模型权重对齐验证以在AiStudio中建立了一个项目方便验证:运行项目https://aistudio.baidu.com/aistudio/projectdetail/2453823 notebook即可验证

转换后的7个模型权重下载链接(aistudio):https://aistudio.baidu.com/aistudio/datasetdetail/111650/0

  1. 由于roberta-base, roberta-large, deepset/roberta-base-squad2等 使用BPETokenizer, 当前仓库中没有提供roberta的BPETokenizer,新增RobertBPETokenizer,返回token 与huggingface的对齐

@CLAassistant
Copy link

CLAassistant commented Oct 10, 2021

CLA assistant check
All committers have signed the CLA.

@nosaydomore nosaydomore changed the title Task 55 [PaddlePaddle hackathon] 55题 提交 Oct 10, 2021
@nosaydomore nosaydomore changed the title [PaddlePaddle hackathon] 55题 提交 [PaddlePaddle Hackathon] 55题 提交 Oct 11, 2021
@yingyibiao
Copy link
Contributor

@yingyibiao yingyibiao self-assigned this Oct 13, 2021
@nosaydomore
Copy link
Contributor Author

新增权重请参考 https://paddlenlp.readthedocs.io/zh/latest/community/contribute_models/contribute_awesome_pretrained_models.html

权重文件链接file.json信息与权重README已更新
@yingyibiao

@yingyibiao
Copy link
Contributor

新增权重请参考 https://paddlenlp.readthedocs.io/zh/latest/community/contribute_models/contribute_awesome_pretrained_models.html

权重文件链接file.json信息与权重README已更新 @yingyibiao

OK

@nosaydomore
Copy link
Contributor Author

已修改


def forward(self, input_ids, token_type_ids=None, position_ids=None):
if position_ids is None:
# maybe need use shape op to unify static graph and dynamic graph
ones = paddle.ones_like(input_ids, dtype="int64")
seq_length = paddle.cumsum(ones, axis=-1)
position_ids = seq_length - ones
cls_token_id = input_ids[0][0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roberta的position_id跟cls_token_id相关的吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里cls_token_id为了兼容不同形式postion_id.
55题中有英文的模型(postion_id 用的原论文的方式), 原仓库的Roberta只考虑了中文roberta的position_id编码方式(与bert一致)
为了兼容这两个不同的编码形式,函数传进来的参数中只有input_id 中的cls_token_id(bert为101,原论文roberta为0)可以用来判断positon_id是用bert的形式还是roberta原论文的形式. @joey12300

Comment on lines 84 to 89
"roberta-base-ft-chinanews-chn":
"https://huggingface.co/uer/roberta-base-finetuned-chinanews-chinese/resolve/main/vocab.txt",
"roberta-base-ft-cluener2020-chn":
"https://huggingface.co/uer/roberta-base-finetuned-cluener2020-chinese/resolve/main/vocab.txt",
"roberta-base-chn-extractive-qa":
"https://huggingface.co/uer/roberta-base-chinese-extractive-qa/resolve/main/vocab.txt",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的链接需要修改一下哈

Comment on lines 391 to 408
roberta_en_base_vocab_link = "https://huggingface.co/roberta-base/resolve/main/vocab.json"
roberta_en_base_merges_link = "https://huggingface.co/roberta-base/resolve/main/merges.txt"
pretrained_resource_files_map = {
"vocab_file": {
"roberta-en-base": roberta_en_base_vocab_link,
"roberta-en-large": roberta_en_base_vocab_link,
"roberta-base-squad2": roberta_en_base_vocab_link,
"tiny-distilroberta-base":
"https://huggingface.co/sshleifer/tiny-distilroberta-base/resolve/main/vocab.json"
},
"merges_file": {
"roberta-en-base": roberta_en_base_merges_link,
"roberta-en-large": roberta_en_base_merges_link,
"roberta-base-squad2": roberta_en_base_merges_link,
"tiny-distilroberta-base":
"https://huggingface.co/sshleifer/tiny-distilroberta-base/resolve/main/merges.txt"
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的链接需要修改一下哈

@yingyibiao
Copy link
Contributor

Task: #1075

PR types

New features

PR changes

Models

Description

Task: #1075

  1. 在PaddleNLP的Roberta模型代码中,新增 RobertaForMultipleChoice,RobertaForMaskedLM 和 RobertaForCausalLM三个类,(代码+注释)+ 项目单测文件
  2. 新增 roberta-large,roberta-base,deepset/roberta-base-squad2,uer/roberta-base-finetuned-chinanews-chinese,sshleifer/tiny-distilroberta-base,uer/roberta-base-finetuned-cluener2020-chinese 和 uer/roberta-base-chinese-extractive-qa 7个模型参数权重. 模型转换代码(convert.py)包含在项目中。
    模型权重对齐验证以在AiStudio中建立了一个项目方便验证:运行项目https://aistudio.baidu.com/aistudio/projectdetail/2453823 notebook即可验证

转换后的7个模型权重下载链接(aistudio):https://aistudio.baidu.com/aistudio/datasetdetail/111650/0

  1. 由于roberta-base, roberta-large, deepset/roberta-base-squad2等 使用BPETokenizer, 当前仓库中没有提供roberta的BPETokenizer,新增RobertBPETokenizer,返回token 与huggingface的对齐

权重已上传至bos

pad_token_id=0):
pad_token_id=0,
layer_norm_eps=1e-12
): # roberta-base,large的eps=1e-5; wwm-ext为1e-12,方便通过config调整对齐
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删除中文注释~

layer._epsilon = 1e-12
elif isinstance(
layer, nn.LayerNorm
): # roberta-base,large的eps=1e-5; wwm-ext为1e-12,方便通过config调整对齐
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删除中文注释~

Comment on lines +585 to +587
class RobertaForMultipleChoice(RobertaPretrainedModel):
def __init__(self, roberta):
super().__init__()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RobertaForMultipleChoice类添加docstrings

Comment on lines 628 to 630
class RobertaForMaskedLM(RobertaPretrainedModel):
def __init__(self, roberta):
super().__init__()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RobertaForMaskedLM类添加docstrings

Comment on lines 685 to 687
class RobertaForCausalLM(RobertaPretrainedModel):
def __init__(self, roberta):
super().__init__()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RobertaForCausalLM类添加docstrings

Comment on lines 709 to 718
r"""
Example::
>>> from paddlenlp.transformers import RobertaTokenizer, RobertaForCausalLM, RobertaConfig
>>> import paddle
>>> tokenizer = RobertaBPETokenizer.from_pretrained('roberta-base')
>>> model = RobertaForCausalLM.from_pretrained('roberta-base', config=config)
>>> inputs = tokenizer("Hello, my dog is cute")['input_ids']
>>> inputs = paddle.to_tensor(inputs)
>>> outputs = model(inputs)
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example格式需要和其他类统一。

@nosaydomore
Copy link
Contributor Author

done

yingyibiao
yingyibiao previously approved these changes Dec 10, 2021
Copy link
Contributor

@yingyibiao yingyibiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yingyibiao yingyibiao merged commit 41ab265 into PaddlePaddle:develop Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants