Skip to content

Commit

Permalink
Merge pull request #20 from BAAI-Open/fix_document_lg
Browse files Browse the repository at this point in the history
fix bugs in readme
  • Loading branch information
marscrazy authored May 28, 2022
2 parents 598af04 + 2a8a3a7 commit e8aa473
Show file tree
Hide file tree
Showing 20 changed files with 389 additions and 299 deletions.
89 changes: 89 additions & 0 deletions CLA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# The Contributor License Agreement

The [Cloud Native Computing Foundation](https://www.cncf.io) (CNCF) defines
the legal status of the contributed code in two different types of _Contributor License Agreements_
(CLAs), [individual contributors](https://github.com/cncf/cla/blob/master/individual-cla.pdf) and [corporations](https://github.com/cncf/cla/blob/master/corporate-cla.pdf).

FlagAI can only accept original source code from CLA signatories.


It is important to read and understand this legal agreement.

## How do I sign?

After creating your first Pull Request the linux-foundation-easycla bot will respond with information regarding your CLA status along with a link to sign the CLA.

<img width="1065" alt="EasyCLA bot" src="https://user-images.githubusercontent.com/69111235/152226443-f6fe61ee-0e92-46c5-b6ea-c0deb718a585.png">

#### 1. Authorize EasyCLA to read some of your GitHub information

<img width="554" alt="GitHub EasyCLA Authorization" src="https://user-images.githubusercontent.com/69111235/152228712-7d22f9d0-9f3c-4226-9ee0-bacba4b47725.png">

Click on the "Please click here to be authorized" link to navigate to the GitHub Authorize Linux Foundation: EasyCLA page. Then click Authorize LF-Engineering to give the Linux Foundation read-only access to list the email addresses associated with your GitHub account.

#### 2. Select from the two types of contributor

<img width="1407" alt="EasyCLA" src="https://user-images.githubusercontent.com/69111235/152224818-1246453a-b086-4a57-9d14-c10d62ad438f.png">


After authorizing EasyCLA, you will be redirected to a page to identify which type of contributor you are.
Select the most appropriate option:
* Individual Contributor: You are contributing as yourself, and not as part of another organization.
* Corporate Contributor: You are contributing on behalf of your employer or other organization.

#### 3. Sign the CLA

Once you select the type of contributor, proceed to Sign the CLA and follow the instructions to complete the signing process through DocuSign.

**Ensure your GitHub e-mail address matches e-mail address used to sign CLA**

After you have filled out the information, Click "Finish" and you will be redirected back to your Pull Request.

#### 4. Look for an email indicating successful signup.

> Hello,
>
> This is a notification email from EasyCLA regarding the project Cloud Native Computing > Foundation (CNCF).
>
> The CLA has now been signed. You can download the signed CLA as a PDF here.
>
> If you need help or have questions about EasyCLA, you can read the documentation or reach out to us for support.
>
> Thanks,
> EasyCLA Support Team


#### 5. Validate your CLA

Once you are redirected back to your GitHub Pull Request, reply with a comment `/easycla` to update the CLA status of your PR.


## Changing your Affiliation

If you've changed employers and still contribute to Kubernetes, your affiliation
needs to be updated. The Cloud Native Computing Foundation uses [gitdm](https://github.com/cncf/gitdm)
to track who is contributing and from where. Create a pull request on the [gitdm](https://github.com/cncf/gitdm)
repository with a change to the corresponding developer affiliation text file.
Your entry should look similar to this:

```
Jorge O. Castro*: jorge!heptio.com, jorge!ubuntu.com, jorge.castro!gmail.com
Heptio
Canonical until 2017-03-31
```

## Troubleshooting

If you encounter any problems signing the CLA and need further assistance, log a ticket by clicking on the link [please submit a support request ticket](https://jira.linuxfoundation.org/plugins/servlet/theme/portal/4) from the EasyCLA bot's response. Someone from the CNCF will respond to your ticket to help.

Should you have any issues using the LF Support Site, send a message to the
backup e-mail support address <login-issues@jira.linuxfoundation.org>

## Setting up the CNCF CLA check

If you are a Kubernetes GitHub organization or repo owner and would like to setup
the Linux Foundation CNCF CLA check for your repositories, [read the docs on setting up the CNCF CLA check](/github-management/setting-up-cla-check.md)


[Linux Foundation Support Site]: https://support.linuxfoundation.org/
17 changes: 8 additions & 9 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ side, please stick to the following process:
3. If we decide your concern needs code changes, we would be happy to accept a pull request. Please consider the
commit guidelines below.

## Sign the CLA

Before you can contribute to FlagAI, you will need to sign the [Contributor License Agreement](CLA.md).

## Git Commit Guidelines

Expand All @@ -34,17 +37,13 @@ pip install -r requirements.txt
```

### tests



To run all basic tests execute:
```bash
python test.py
Install `pytest` for testing
```

To check the test results in
pip install pytest
```
tests/test_report
To run all basic tests execute:
```bash
pytest
```

### code formatting
Expand Down
5 changes: 1 addition & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ for text in test_data:
* [Blank_Filling_QA with GLM ](/docs/TUTORIAL_11_GLM_BLANK_FILLING_QA.md)
* [Title Generation with GLM ](/docs/TUTORIAL_12_GLM_EXAMPLE_TITLE_GENERATION.md)
* [Poetry generation with GLM-large-ch](docs/TUTORIAL_13_GLM_EXAMPLE_PEOTRY_GENERATION.md)
* [Using huggingface's t5-3b & tricks ](docs/TUTORIAL_14_HUGGINGFACE_T5.md)
* [Using huggingface's t5-11b & tricks ](docs/TUTORIAL_14_HUGGINGFACE_T5.md)
* [Title Generation with RoBerta-WWM](/docs/TUTORIAL_15_BERT_EXAMPLE_TITLE_GENERATION.md)
* [Semantic Matching with RoBerta-WWM](/docs/TUTORIAL_16_BERT_EXAMPLE_SEMANTIC_MATCHING.md)
* [NER with RoBerta-WWM](/docs/TUTORIAL_17_BERT_EXAMPLE_NER.md)
Expand All @@ -135,15 +135,12 @@ language models, sequence labeling models, and text classification models. Let u

## Tutorials
We provide a set of quick tutorials to get you started with the library:

[//]: # (* [Tutorial 1: Background: Transformers]&#40;docs/TUTORIAL_1_BASICS.md&#41;)
* [Tutorial 1: How to construct and use Tokenizer](/docs/TUTORIAL_1_TOKENIZER.md)
* [Tutorial 2: Dataset Preprocessing Pipeline](/docs/TUTORIAL_2_DATASET.md)
* [Tutorial 3: Major Function of Model Module](/docs/TUTORIAL_3_MODEL.md)
* [Tutorial 4: Customize trainer for model and data-parallel training](/docs/TUTORIAL_4_TRAINER.md)
* [Tutorial 5: Simplify model and tokenizer Initialization by Using Autoloader](/docs/TUTORIAL_5_INSTRUCTIONS_FOR_AutoLoader.md)
* [Tutorial 6: Use off-the-shelf inference Algorithms with Predictor](/docs/TUTORIAL_6_INSTRUCTIONS_FOR_PREDICTOR.md)

* [Tutorial 7: Use FlagAI prompt-learning tool-kit to improve performance on SuperGLUE](/docs/TUTORIAL_7_PROMPT_LERANING.md)
* [Tutorial 8: Setup environment for training models with multi-machine](/docs/TUTORIAL_8_ENVIRONMENT_SETUP.md)
* [Tutorial 9: Text generation with encoder/decoder/encoder-decoder models](/docs/TUTORIAL_9_SEQ2SEQ_METHOD.md)
Expand Down
2 changes: 1 addition & 1 deletion README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ for text_pair in test_data:
* [GLM-large-ch用户完形填空问答](/doc_zh/TUTORIAL_11_GLM_BLANK_FILLING_QA.md)
* [GLM-large-ch用于诗歌生成](doc_zh/TUTORIAL_13_GLM_EXAMPLE_PEOTRY_GENERATION.md)
* [GLM-large-ch用于标题生成](doc_zh/TUTORIAL_12_GLM_EXAMPLE_TITLE_GENERATION.md)
* [对 huggingface t5-3b 模型的支持 以及加速的tricks](doc_zh/TUTORIAL_14_HUGGINGFACE_T5.md)
* [对 huggingface t5-11b 模型的支持 以及加速的tricks](doc_zh/TUTORIAL_14_HUGGINGFACE_T5.md)
* [RoBERTa-base-ch用于标题生成](doc_zh/TUTORIAL_15_BERT_EXAMPLE_TITLE_GENERATION.md)
* [RoBERTa-base-ch用于语义相似度匹配](doc_zh/TUTORIAL_16_BERT_EXAMPLE_SEMANTIC_MATCHING.md)
* [RoBERTa-base-ch用于命名实体识别](/doc_zh/TUTORIAL_17_BERT_EXAMPLE_NER.md)
Expand Down
28 changes: 19 additions & 9 deletions doc_zh/GLM.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,13 @@
目前,存在几种不同的预训练模型架构:仅实现编码器架构的自动编码模型(例如BERT),仅实现解码器的自回归模型(例如GPT),以及同时实现编码器和解码器的编码器-解码器模型(例如T5)。

[**GLM模型**](https://arxiv.org/abs/2103.10360)与这些模型略有不同。它采用了一种自回归的空白填充方法, 并且在NLP领域三种主要的任务(自然语言理解,无条件生成,有条件生成)上都取得了不错的结果。
<div align=center><img src="img/glm_example_1.png" width="600px"></div>

| Framwork | NLU | Cond.Gen. | Uncond.Gen |
|-----------------|-----|-----------|------------|
| Augoregressive | - | - ||
| Autoencoding || × | × |
| Encoder-Decoder | - || - |
| GLM ||||
GLM的主要功能包括:

- 任务一:文本的一些区间会被屏蔽(参照自动编码的做法)。 这些区间将被随机重新排列,并以自动回归方式进行预测。屏蔽的区间覆盖原始文本的15%。
Expand All @@ -18,18 +23,23 @@ GLM的主要功能包括:

## GLM的表现

1. 通过多任务预训练,GLM-Doc 和 GLM-Sent 的表现略逊于 GLM-Large,但仍优于 BERT-Large 和 UniLM-Large。

### [SuperGLUE](https://super.gluebenchmark.com)
单模型单任务微调在`dev`集上的效果,更多结果在[这里](https://github.com/THUDM/GLM)

2. 在多任务模型中,GLM-Sent 平均优于 GLM-Doc 1.1%。将 GLM-Doc 的参数增加到 410M(1.25×BERT-Large)会得到比 GLM-Large 更好的性能。至于具有 515M 参数(1.5×BERT-Large)的 GLM 能表现得更好。
| Model | COPA | WSC | RTE | WiC | CB | MultiRC | BoolQ | ReCoRD |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| GLM-10b-ch | 98.0 | 95.2 | 93.1 | 75.7 | 98.7/98.2 | 88.1/63.3 | 88.7 | 94.4/94.0 |
| [RoBERTa-Large](https://github.com/pytorch/fairseq/tree/master/examples/roberta) | 94.0 | 91.3 | 86.6 | 75.6 | 98.2/- | 85.7/- | 86.9 |89.5/89.0|
| [DeBERTa-XXLarge-v2](https://github.com/microsoft/DeBERTa/tree/master/experiments/superglue) | 97.0 | - | 93.5 | - | - | 87.8/63.6 | 88.3 | 94.1/93.7 |

<div align=center><img src="img/glm_results2.png"></div>
### [CLUE](https://www.cluebenchmarks.com)
单模型单任务微调在CLUE数据集上的结果(测试还在进行中,这里列出了部分任务)。如果想要使用`GLM-10b-ch`请点[这里](https://model.baai.ac.cn/model-detail/100001)

1. GLM-XXLarge 的平均得分为 79.297,在多项任务中得到显着提高。在选择的3个通用+2业务评估任务中,平均提升2.47pp。

2. CLUE1.0中的任务中,除CMRC任务外,平均提升1.56pp,其中C3和OCNLI数据集提升明显(+9.9PP,+2.84PP)。

<div align=center><img src="img/glm_performance.png"></div>
| 模型 | AFQMC | TNEWS1.0 | IFLYTEK | OCNLI_50K | CSL | CMRC2018 | CHID1.0 | C3 1.0 |
|:--------------:|:------:|:--------:|:-------:|:---------:|:------:|:--------:|:-------:|:------:|
| RoBERTa XLarge | 75.835 | 68.75 | 62.654 | 82.333 | 83.433 | 80.5 | 86.57 | 77.03 |
| GLM-10b-ch | 75.42 | 69.94 | 62.15 | 85 | 86.17 | 70 | 87.009 | 88.335 |

## FlagAI支持的GLM预训练模型
参考 [Tutorial 5: 使用AutoLoader工具快速构建模型](/doc_zh/TUTORIAL_5_INSTRUCTIONS_FOR_AutoLoader.md)
Expand Down
2 changes: 1 addition & 1 deletion doc_zh/TUTORIAL_11_GLM_BLANK_FILLING_QA.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ GLM 对下游任务进行微调,并将它们重新定义为空白填充生成
该目标旨在生成长文本。 For example: ``[CLS]问题:啤酒伤胃吗?回答:[gMASK]<|startofpiece|>谢邀。 我是啤酒爱好者,但是我不喝酒。 我以前也说过,喝酒伤身,啤酒伤胃,伤肠道。 现在我也知道了啤酒伤人的很多细节,我就不瞎几把的说,大家看图片就知道了。 <n><n>其实啤酒伤身这个说法只是表面而已。 <n><n>啤酒中含有少量的碳酸和酒精,碳酸和酒精是成酸性物质,而乙醇是脂溶性的,酒精在胃里能够被分解,生成乙醇和二氧化碳,在体内是水和二氧化碳,两种物质会迅速发生中和反应,结果导致人体出现头痛、呕吐、胸痛、浑身发热等现象,这就是所谓喝大了,喝多了。 <n><n> 啤酒的含糖量在15%左右,喝多了也是伤身的,啤酒含糖量较高的主要成分是水分,而水分的体积比酒精大,所以酒精进入人体,与水相遇,就会产生大量气体,二氧化碳、水、一氧化碳等刺激人体,造成人体大量出汗,使体内温度升高,``


例如,GLM 以自回归空白的形式完成问题任务——填充任务:
例如,我们用`GLM-large-ch`以自回归空白的形式完成问题任务——填充任务,如果想要使用`GLM-10b-ch`请点[这里](https://model.baai.ac.cn/model-detail/100001):
```python
import torch
from flagai.model.glm_model import GLMModel
Expand Down
2 changes: 1 addition & 1 deletion doc_zh/TUTORIAL_12_GLM_EXAMPLE_TITLE_GENERATION.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# GLM 例子:标题生成

## 背景
标题生成任务需要输入一段文本,模型根据输入文本输出对应的标题。
标题生成任务需要输入一段文本,模型根据输入文本输出对应的标题。这里使用`GLM-large-ch`作为样例,如果想要使用`GLM-10b-ch`请点[这里](https://model.baai.ac.cn/model-detail/100001)

![](./img/bert_title_generation_model.png)

Expand Down
1 change: 1 addition & 0 deletions doc_zh/TUTORIAL_13_GLM_EXAMPLE_PEOTRY_GENERATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
cd ./examples/glm_poetry_generation
python ./train.py
```
这里使用`GLM-large-ch`作为样例,如果想要使用`GLM-10b-ch`请点[这里](https://model.baai.ac.cn/model-detail/100001)
### 1.准备训练数据
1)定义文件读取函数,从文件中读取数据,得到src和tgt列表:
```python
Expand Down
22 changes: 11 additions & 11 deletions doc_zh/TUTORIAL_14_HUGGINGFACE_T5.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,21 +33,21 @@ trainer = MyTrainer(
batch_size=4,
eval_interval=10,
log_interval=10,
experiment_name='t5-3b',
experiment_name='t5-11b',
pytorch_device='cuda:0',
load_dir=None,
lr=1e-4,
fp16=False)

# using huggingface transformers to get tokenizer and models
model_name = 't5-3b'
model_name = 't5-11b'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

print("loading model & tokenizer is done!")
src_dir = 'train_inputs.txt'
tgt_dir = 'train_targets.txt'
model_dir = "./t5-3b" # 模型位置
model_dir = "./t5-11b" # 模型位置
maxlen = 1024


Expand Down Expand Up @@ -139,7 +139,7 @@ trainer.train(model,
```

## 加速训练的技巧
我们可能不会在V100 32G上运行t5-3b。所以,我们需要一些技巧来减少GPU内存的使用。
我们可能不会在V100 32G上运行t5-11b。所以,我们需要一些技巧来减少GPU内存的使用。
### 第一步:fp16
把模型参数变为 `fp16`
```python
Expand All @@ -149,23 +149,23 @@ trainer = MyTrainer(
batch_size=1,
eval_interval=10,
log_interval=10,
experiment_name='t5-3b',
experiment_name='t5-11b',
pytorch_device='cuda:0',
load_dir=None,
lr=1e-4,
fp16=True) # change to `True`
```
### 第二步:梯度重计算(checkpoint)
在forward阶段不将中间结果保存。我们可以运行`batch size`=1的t5-3b
现在,我们可以用 `gradient_accumulation_steps` train/finetune 一个 t5-3b
在forward阶段不将中间结果保存。我们可以运行`batch size`=1的t5-11b
现在,我们可以用 `gradient_accumulation_steps` train/finetune 一个 t5-11b
```python
trainer = MyTrainer(
env_type='pytorch',
epochs=1,
batch_size=1,
eval_interval=10,
log_interval=10,
experiment_name='t5-3b',
experiment_name='t5-11b',
pytorch_device='cuda:0',
load_dir=None,
lr=1e-4,
Expand All @@ -181,7 +181,7 @@ trainer = Trainer(
batch_size=1,
eval_interval=10,
log_interval=10,
experiment_name='t5-3b',
experiment_name='t5-11b',
load_dir=None,
lr=1e-4,
fp16=True
Expand All @@ -205,7 +205,7 @@ trainer = Trainer(
batch_size=1,
eval_interval=10,
log_interval=10,
experiment_name='t5-3b',
experiment_name='t5-11b',
load_dir=None,
lr=1e-4,
fp16=True
Expand All @@ -230,7 +230,7 @@ trainer = Trainer(
batch_size=1,
eval_interval=10,
log_interval=10,
experiment_name='t5-3b',
experiment_name='t5-11b',
load_dir=None,
lr=1e-4,
fp16=True
Expand Down
Loading

0 comments on commit e8aa473

Please sign in to comment.