[PaddlePaddle Hackathon] 55题提交 #1133

nosaydomore · 2021-10-10T18:47:00Z

Task: #1075

PR types

New features

PR changes

Models

Description

Task: #1075

在PaddleNLP的Roberta模型代码中，新增 RobertaForMultipleChoice，RobertaForMaskedLM 和 RobertaForCausalLM三个类，（代码+注释）+ 项目单测文件
新增 roberta-large，roberta-base，deepset/roberta-base-squad2，uer/roberta-base-finetuned-chinanews-chinese，sshleifer/tiny-distilroberta-base，uer/roberta-base-finetuned-cluener2020-chinese 和 uer/roberta-base-chinese-extractive-qa 7个模型参数权重. 模型转换代码(convert.py)包含在项目中。
模型权重对齐验证以在AiStudio中建立了一个项目方便验证：运行项目https://aistudio.baidu.com/aistudio/projectdetail/2453823 notebook即可验证

转换后的7个模型权重下载链接(aistudio)：https://aistudio.baidu.com/aistudio/datasetdetail/111650/0

由于roberta-base, roberta-large, deepset/roberta-base-squad2等使用BPETokenizer, 当前仓库中没有提供roberta的BPETokenizer，新增RobertBPETokenizer，返回token 与huggingface的对齐

…nto task_55

CLAassistant · 2021-10-10T18:47:05Z

All committers have signed the CLA.

yingyibiao · 2021-10-13T11:06:19Z

新增权重请参考
https://paddlenlp.readthedocs.io/zh/latest/community/contribute_models/contribute_awesome_pretrained_models.html

…nto task_55

…to task_55

…nto task_55

nosaydomore · 2021-10-14T07:56:54Z

新增权重请参考 https://paddlenlp.readthedocs.io/zh/latest/community/contribute_models/contribute_awesome_pretrained_models.html

权重文件链接file.json信息与权重README已更新
@yingyibiao

yingyibiao · 2021-10-14T12:50:57Z

新增权重请参考 https://paddlenlp.readthedocs.io/zh/latest/community/contribute_models/contribute_awesome_pretrained_models.html

权重文件链接file.json信息与权重README已更新 @yingyibiao

OK

nosaydomore · 2021-10-27T14:50:12Z

已修改

joey12300 · 2021-11-19T08:33:30Z

paddlenlp/transformers/roberta/modeling.py


    def forward(self, input_ids, token_type_ids=None, position_ids=None):
        if position_ids is None:
            # maybe need use shape op to unify static graph and dynamic graph
            ones = paddle.ones_like(input_ids, dtype="int64")
            seq_length = paddle.cumsum(ones, axis=-1)
-            position_ids = seq_length - ones
+            cls_token_id = input_ids[0][0]


Roberta的position_id跟cls_token_id相关的吗？

这里cls_token_id为了兼容不同形式postion_id.
55题中有英文的模型（postion_id 用的原论文的方式）, 原仓库的Roberta只考虑了中文roberta的position_id编码方式（与bert一致）
为了兼容这两个不同的编码形式，函数传进来的参数中只有input_id 中的cls_token_id(bert为101，原论文roberta为0)可以用来判断positon_id是用bert的形式还是roberta原论文的形式. @joey12300

yingyibiao · 2021-12-03T07:26:57Z

paddlenlp/transformers/roberta/tokenizer.py

+            "roberta-base-ft-chinanews-chn":
+            "https://huggingface.co/uer/roberta-base-finetuned-chinanews-chinese/resolve/main/vocab.txt",
+            "roberta-base-ft-cluener2020-chn":
+            "https://huggingface.co/uer/roberta-base-finetuned-cluener2020-chinese/resolve/main/vocab.txt",
+            "roberta-base-chn-extractive-qa":
+            "https://huggingface.co/uer/roberta-base-chinese-extractive-qa/resolve/main/vocab.txt",


这里的链接需要修改一下哈

yingyibiao · 2021-12-03T07:28:19Z

paddlenlp/transformers/roberta/tokenizer.py

+    roberta_en_base_vocab_link = "https://huggingface.co/roberta-base/resolve/main/vocab.json"
+    roberta_en_base_merges_link = "https://huggingface.co/roberta-base/resolve/main/merges.txt"
+    pretrained_resource_files_map = {
+        "vocab_file": {
+            "roberta-en-base": roberta_en_base_vocab_link,
+            "roberta-en-large": roberta_en_base_vocab_link,
+            "roberta-base-squad2": roberta_en_base_vocab_link,
+            "tiny-distilroberta-base":
+            "https://huggingface.co/sshleifer/tiny-distilroberta-base/resolve/main/vocab.json"
+        },
+        "merges_file": {
+            "roberta-en-base": roberta_en_base_merges_link,
+            "roberta-en-large": roberta_en_base_merges_link,
+            "roberta-base-squad2": roberta_en_base_merges_link,
+            "tiny-distilroberta-base":
+            "https://huggingface.co/sshleifer/tiny-distilroberta-base/resolve/main/merges.txt"
+        }
+    }


这里的链接需要修改一下哈

yingyibiao · 2021-12-03T07:44:11Z

Task: #1075

PR types

New features

PR changes

Models

Description

Task: #1075

在PaddleNLP的Roberta模型代码中，新增 RobertaForMultipleChoice，RobertaForMaskedLM 和 RobertaForCausalLM三个类，（代码+注释）+ 项目单测文件

新增 roberta-large，roberta-base，deepset/roberta-base-squad2，uer/roberta-base-finetuned-chinanews-chinese，sshleifer/tiny-distilroberta-base，uer/roberta-base-finetuned-cluener2020-chinese 和 uer/roberta-base-chinese-extractive-qa 7个模型参数权重. 模型转换代码(convert.py)包含在项目中。
模型权重对齐验证以在AiStudio中建立了一个项目方便验证：运行项目https://aistudio.baidu.com/aistudio/projectdetail/2453823 notebook即可验证

转换后的7个模型权重下载链接(aistudio)：https://aistudio.baidu.com/aistudio/datasetdetail/111650/0

由于roberta-base, roberta-large, deepset/roberta-base-squad2等使用BPETokenizer, 当前仓库中没有提供roberta的BPETokenizer，新增RobertBPETokenizer，返回token 与huggingface的对齐

权重已上传至bos

yingyibiao · 2021-12-03T08:01:43Z

paddlenlp/transformers/roberta/modeling.py

-                 pad_token_id=0):
+                 pad_token_id=0,
+                 layer_norm_eps=1e-12
+                 ):  # roberta-base,large的eps=1e-5; wwm-ext为1e-12,方便通过config调整对齐


删除中文注释～

yingyibiao · 2021-12-03T08:02:24Z

paddlenlp/transformers/roberta/modeling.py

-            layer._epsilon = 1e-12
+        elif isinstance(
+                layer, nn.LayerNorm
+        ):  # roberta-base,large的eps=1e-5; wwm-ext为1e-12,方便通过config调整对齐


删除中文注释～

yingyibiao · 2021-12-03T08:03:09Z

paddlenlp/transformers/roberta/modeling.py

+class RobertaForMultipleChoice(RobertaPretrainedModel):
+    def __init__(self, roberta):
+        super().__init__()


RobertaForMultipleChoice类添加docstrings

yingyibiao · 2021-12-03T08:03:31Z

paddlenlp/transformers/roberta/modeling.py

+class RobertaForMaskedLM(RobertaPretrainedModel):
+    def __init__(self, roberta):
+        super().__init__()


RobertaForMaskedLM类添加docstrings

yingyibiao · 2021-12-03T08:03:56Z

paddlenlp/transformers/roberta/modeling.py

+class RobertaForCausalLM(RobertaPretrainedModel):
+    def __init__(self, roberta):
+        super().__init__()


RobertaForCausalLM类添加docstrings

yingyibiao · 2021-12-03T08:06:42Z

paddlenlp/transformers/roberta/modeling.py

+        r"""
+        Example::
+            >>> from paddlenlp.transformers import RobertaTokenizer, RobertaForCausalLM, RobertaConfig
+            >>> import paddle
+            >>> tokenizer = RobertaBPETokenizer.from_pretrained('roberta-base')
+            >>> model = RobertaForCausalLM.from_pretrained('roberta-base', config=config)
+            >>> inputs = tokenizer("Hello, my dog is cute")['input_ids']
+            >>> inputs = paddle.to_tensor(inputs)
+            >>> outputs = model(inputs)
+        """


example格式需要和其他类统一。

paddlenlp/transformers/roberta/tokenizer.py

nosaydomore · 2021-12-10T02:03:58Z

done

yingyibiao

LGTM

nosaydomore added 16 commits October 8, 2021 22:59

add roberta model

0808e64

rollback

af30b0a

.

acc0ad4

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

c6dcf08

…nto task_55

upd roberta convert

e44930d

upd

fca1c9c

upd

e9f185b

upd compare

654c978

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

00e786f

…nto task_55

upd unitest

fcf5ef6

upd unnitest

4b2328e

upd roberta convert

e4eae49

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

66d4fc1

…nto task_55

clean code

819bfca

upd tokenizer config

d1eedfe

upd model_config

c369e6c

nosaydomore changed the title ~~Task 55~~ [PaddlePaddle hackathon] 55题提交 Oct 10, 2021

nosaydomore mentioned this pull request Oct 10, 2021

【PaddlePaddle Hackathon】任务总览 PaddlePaddle/Paddle#35940

Closed

Merge branch 'develop' into task_55

f0162dd

nosaydomore changed the title ~~[PaddlePaddle hackathon] 55题提交~~ [PaddlePaddle Hackathon] 55题提交 Oct 11, 2021

yingyibiao self-assigned this Oct 13, 2021

nosaydomore added 4 commits October 14, 2021 13:24

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

9e08918

…nto task_55

add pretrain weight file json

0efd480

Merge branch 'task_55' of https://github.com/nosaydomore/PaddleNLP in…

2cdaa7d

…to task_55

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

5299048

…nto task_55

yingyibiao mentioned this pull request Oct 15, 2021

[PaddlePaddle Hackathon] 第55题 #1116

Closed

nosaydomore and others added 2 commits October 27, 2021 22:35

upd model config

ab2166d

Merge branch 'develop' into task_55

60897f4

nosaydomore requested a review from yingyibiao November 3, 2021 07:56

Merge branch 'develop' into task_55

47b0eca

joey12300 reviewed Nov 19, 2021

View reviewed changes

Merge branch 'develop' into task_55

068153e

yingyibiao reviewed Dec 3, 2021

View reviewed changes

paddlenlp/transformers/roberta/tokenizer.py Show resolved Hide resolved

ZeyuChen added contributions and removed Hackathon labels Dec 4, 2021

nosaydomore and others added 7 commits December 6, 2021 22:25

Merge branch 'PaddlePaddle:develop' into task_55

cc124f1

Merge branch 'PaddlePaddle:develop' into task_55

de92d71

Merge branch 'develop' into task_55

53e0448

Merge branch 'PaddlePaddle:develop' into task_55

31ab67d

unify tokenizer

1442c14

unify tokenizer

c20ff2b

fix conflict

556d73b

yingyibiao added 3 commits December 10, 2021 11:02

Merge branch 'develop' into task_55

7da5698

Update tokenizer.py

8fdafcb

Update README.md

1696221

yingyibiao previously approved these changes Dec 10, 2021

View reviewed changes

Merge branch 'develop' into task_55

04a7d83

yingyibiao dismissed their stale review via 04a7d83 December 10, 2021 03:50

yingyibiao approved these changes Dec 10, 2021

View reviewed changes

yingyibiao merged commit 41ab265 into PaddlePaddle:develop Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PaddlePaddle Hackathon] 55题提交 #1133

[PaddlePaddle Hackathon] 55题提交 #1133

nosaydomore commented Oct 10, 2021 •

edited

Loading

CLAassistant commented Oct 10, 2021 •

edited

Loading

yingyibiao commented Oct 13, 2021

nosaydomore commented Oct 14, 2021

yingyibiao commented Oct 14, 2021

nosaydomore commented Oct 27, 2021

joey12300 Nov 19, 2021

nosaydomore Nov 24, 2021

yingyibiao Dec 3, 2021

yingyibiao Dec 3, 2021

yingyibiao commented Dec 3, 2021

PR types

PR changes

Description

yingyibiao Dec 3, 2021

yingyibiao Dec 3, 2021

yingyibiao Dec 3, 2021

yingyibiao Dec 3, 2021

yingyibiao Dec 3, 2021

yingyibiao Dec 3, 2021

nosaydomore commented Dec 10, 2021

yingyibiao left a comment

[PaddlePaddle Hackathon] 55题 提交 #1133

[PaddlePaddle Hackathon] 55题 提交 #1133

Conversation

nosaydomore commented Oct 10, 2021 • edited Loading

PR types

PR changes

Description

CLAassistant commented Oct 10, 2021 • edited Loading

yingyibiao commented Oct 13, 2021

nosaydomore commented Oct 14, 2021

yingyibiao commented Oct 14, 2021

nosaydomore commented Oct 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yingyibiao commented Dec 3, 2021

PR types

PR changes

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nosaydomore commented Dec 10, 2021

yingyibiao left a comment

Choose a reason for hiding this comment

[PaddlePaddle Hackathon] 55题提交 #1133

[PaddlePaddle Hackathon] 55题提交 #1133

nosaydomore commented Oct 10, 2021 •

edited

Loading

CLAassistant commented Oct 10, 2021 •

edited

Loading