Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added CLIP module and redesigned tokenizer apis #81

Merged
merged 82 commits into from
Aug 29, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
83fceae
saved workd
Anhforth Jun 29, 2022
f113f7c
saved workd
Anhforth Jun 29, 2022
708ce15
saved work on 6.29
Anhforth Jun 30, 2022
0d1b079
transformed tokenizer: progressing
Anhforth Jul 1, 2022
2763b5d
Opt 30b (#16)
920232796 Jul 1, 2022
3e52907
fix bert tokenizer issue (#18)
Anhforth Jul 1, 2022
deb2612
reconstruct the tokenizer structure
ZhaodongYan1 Jul 3, 2022
c2c6e9d
tested the new tokenizer
Anhforth Jul 4, 2022
fc2b5d8
removed some redundant codes and added sp model
Anhforth Jul 4, 2022
7da1757
updated the tokenizer
ZhaodongYan1 Jul 4, 2022
7c8c0b1
saved work
Anhforth Jul 5, 2022
3a0c8cb
Opt 66b (#19)
920232796 Jul 6, 2022
265d35a
saved work on 7.6
Anhforth Jul 6, 2022
4f8d715
updated release version
Anhforth Jul 6, 2022
efc1310
fix tokenizer issue
Anhforth Jul 6, 2022
59531e7
temp save
Anhforth Jul 6, 2022
3b6c16a
tokenizer test passed
Anhforth Jul 6, 2022
a7ff8f3
fixed some errors
Anhforth Jul 7, 2022
f4ff1a8
test of tokenizer transform
Anhforth Jul 7, 2022
811d9e9
fixed conflicts
Anhforth Jul 7, 2022
1406d89
fixed error
Anhforth Jul 7, 2022
b30eefa
add encode_plus
Anhforth Jul 8, 2022
9b81869
fix bug multi_gpu_training
920232796 Jul 8, 2022
7ad38a0
Merge pull request #21 from baai-open-internal/fix_multi_gpu_training
Anhforth Jul 8, 2022
72ffd6a
changed the version
Anhforth Jul 8, 2022
e6f89a6
fix_validation_bug (#24)
920232796 Jul 11, 2022
29ea850
updated the version
Anhforth Jul 11, 2022
4c68936
updated
Anhforth Jul 15, 2022
4834f23
modified encoder_plus
Anhforth Jul 15, 2022
8d44329
add vit and examples
920232796 Jul 15, 2022
81c438d
vit and examples
920232796 Jul 15, 2022
da24628
Update base_model.py
marscrazy Jul 15, 2022
aff728b
Update vit.py
marscrazy Jul 15, 2022
e5a0ddb
modify readme.md
920232796 Jul 15, 2022
fe56b8b
modify readme.md
920232796 Jul 15, 2022
fc6c32e
delete annotating code
920232796 Jul 15, 2022
cd45e5c
Vit xzh (#25)
920232796 Jul 15, 2022
5448084
updated
Anhforth Jul 17, 2022
eb555fc
updated
Anhforth Jul 17, 2022
9649aa4
performing tests on examples
Anhforth Jul 17, 2022
67c1288
finished example testing
Anhforth Jul 18, 2022
faee281
Merge branch 'develop' into vit_xzh
BAAI-OpenPlatform Jul 19, 2022
06f0b69
Merge pull request #28 from baai-open-internal/vit_xzh
BAAI-OpenPlatform Jul 19, 2022
deaa120
Merge pull request #27 from baai-open-internal/develop
marscrazy Jul 20, 2022
9558a47
env trainer
920232796 Jul 20, 2022
c35d4b6
Merge pull request #29 from baai-open-internal/env_args
marscrazy Jul 20, 2022
437caa4
vit-checkpoint-activations
920232796 Jul 21, 2022
dc6fc3d
vit-checkpoint-activations
920232796 Jul 21, 2022
c1cec9f
Merge pull request #33 from baai-open-internal/vit-checkpointing-acti…
marscrazy Jul 21, 2022
d74cf92
update
jongjyh Jul 25, 2022
044bc80
Merge pull request #34 from baai-open-internal/fix_eval_loss
marscrazy Jul 25, 2022
d85f8af
merged the master
Anhforth Jul 26, 2022
1b5ecc6
inference and train
wchh-2000 Jul 29, 2022
1fe6d3e
fix bug bert model
xuanricheng Aug 5, 2022
0c243d6
add autoloader and example training data
wchh-2000 Aug 15, 2022
2c28a7d
updated seq2seq
shunxing1234 Aug 16, 2022
e03247e
update
wchh-2000 Aug 16, 2022
4a4b003
Merge pull request #52 from baai-open-internal/add_clip
marscrazy Aug 17, 2022
ce5fd31
Merge branch 'master' into transform_tokenizer
Anhforth Aug 18, 2022
8353cd3
Update train.py
marscrazy Aug 18, 2022
5d5e135
Delete tst_superglue.py
marscrazy Aug 18, 2022
4c6ba56
updated according to comments
BAAI-OpenPlatform Aug 19, 2022
6076287
Merge pull request #50 from baai-open-internal/bert_model
BAAI-OpenPlatform Aug 19, 2022
c11e232
merged the clip tokenizer
BAAI-OpenPlatform Aug 22, 2022
6e135ef
merged clip tokenizer
BAAI-OpenPlatform Aug 23, 2022
fd06e4d
Update inference_clip.py
marscrazy Aug 25, 2022
b61b708
Update auto_loader.py
marscrazy Aug 25, 2022
25b659b
Update glm_10b_en_tokenizer.py
marscrazy Aug 25, 2022
8cffa38
Merge pull request #20 from baai-open-internal/transform_tokenizer
marscrazy Aug 25, 2022
9117f78
swinv1v2
920232796 Aug 25, 2022
f3186d9
Merge pull request #58 from baai-open-internal/swinv1v2_checkpoint_ac…
marscrazy Aug 25, 2022
4bd211d
updated the version
Anhforth Aug 25, 2022
6ef4190
updated the requirement packages list
Anhforth Aug 25, 2022
036e337
fixed some issues
BAAI-OpenPlatform Aug 26, 2022
edfd518
fixed some issues
BAAI-OpenPlatform Aug 26, 2022
497d709
tried to fix the data directory not found error
BAAI-OpenPlatform Aug 26, 2022
1ac43c0
fixed issues in running glm_seq2seq
BAAI-OpenPlatform Aug 26, 2022
351fba7
Update test_glm_seq2seq.py
marscrazy Aug 26, 2022
35b5d9a
Merge pull request #59 from baai-open-internal/fix_issues
marscrazy Aug 26, 2022
d71ee8d
merged upstream
Anhforth Aug 26, 2022
e3836aa
Update setup.py
marscrazy Aug 26, 2022
619398b
Merge branch 'develop' of github.com:FlagAI-Open/FlagAI into develop
Anhforth Aug 26, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc_zh/TUTORIAL_12_GLM_EXAMPLE_TITLE_GENERATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ class GLMTitleGenerationCollateFN():
```python
train_src, train_tgt = read_file()
print('-----------train data length:', len(train_src))
my_collate_fn = GLMTitleGenerationCollateFN(pad_id=tokenizer.get_command('pad').Id)
my_collate_fn = GLMTitleGenerationCollateFN(pad_id=tokenizer.get_command_id('pad'))
train_dataset = GLMTitleGenerationDataset(train_src,
train_tgt)
```
Expand Down
2 changes: 1 addition & 1 deletion doc_zh/TUTORIAL_13_GLM_EXAMPLE_PEOTRY_GENERATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ class GLMPoetryDynamicCollateFN():
```python
train_src, train_tgt = read_file()
print('-----------train data length:', len(train_src))
my_collate_fn = GLMPoetryDynamicCollateFN(pad_id=tokenizer.get_command('pad').Id)
my_collate_fn = GLMPoetryDynamicCollateFN(pad_id=tokenizer.get_command_id('pad'))
train_dataset = GLMPoetryDataset(train_src,
train_tgt)
```
Expand Down
2 changes: 1 addition & 1 deletion docs/TUTORIAL_12_GLM_EXAMPLE_TITLE_GENERATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ class GLMTitleGenerationCollateFN():
```python
train_src, train_tgt = read_file()
print('-----------train data length:', len(train_src))
my_collate_fn = GLMTitleGenerationCollateFN(pad_id=tokenizer.get_command('pad').Id)
my_collate_fn = GLMTitleGenerationCollateFN(pad_id=tokenizer.get_command_id('pad'))
train_dataset = GLMTitleGenerationDataset(train_src,
train_tgt)
```
Expand Down
2 changes: 1 addition & 1 deletion docs/TUTORIAL_13_GLM_EXAMPLE_PEOTRY_GENERATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ class GLMPoetryDynamicCollateFN():
```python
train_src, train_tgt = read_file()
print('-----------train data length:', len(train_src))
my_collate_fn = GLMPoetryDynamicCollateFN(pad_id=tokenizer.get_command('pad').Id)
my_collate_fn = GLMPoetryDynamicCollateFN(pad_id=tokenizer.get_command_id('pad'))
train_dataset = GLMPoetryDataset(train_src,
train_tgt)
```
Expand Down
2 changes: 1 addition & 1 deletion examples/bert_title_generation_english/generate.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
maxlen = 512
auto_loader = AutoLoader(
"seq2seq",
model_name="bert-base-uncased",
model_name="BERT-base-en",
model_dir=model_dir,
)
model = auto_loader.get_model()
Expand Down
Binary file added examples/clip/CLIP.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/clip/data/img/0.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added examples/clip/data/img/1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions examples/clip/data/pairs.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
title filepath
a very typical bus station 0.jpg
the jetty : different types of plants to establish a variety of ecosystems . 1.jpg
48 changes: 48 additions & 0 deletions examples/clip/deepspeed.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
{
"train_micro_batch_size_per_gpu": 64,
"gradient_accumulation_steps": 1,
"steps_per_print": 100,
"gradient_clipping": 1.0,
"zero_optimization": {
"stage": 2,
"contiguous_gradients": false,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e7,
"allgather_bucket_size": 5e7,
"cpu_offload": true
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 1e-5,
"warmup_num_steps": 2000
}
},
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "Adam",
"params": {
"lr": 1e-5,
"weight_decay": 0.1,
"betas": [
0.9,
0.98
],
"eps": 1e-6
}
},
"activation_checkpointing": {
"partition_activations": true,
"contiguous_memory_optimization": false
},
"wall_clock_breakdown": false
}
1 change: 1 addition & 0 deletions examples/clip/hostfile
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
127.0.0.1 slots=2
30 changes: 30 additions & 0 deletions examples/clip/inference_clip.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import torch
from PIL import Image
from flagai.auto_model.auto_loader import AutoLoader
from flagai.data.dataset.mm.clip_dataset import clip_transform

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

loader = AutoLoader(task_name="txt_img_matching", #contrastive learning
model_name="clip-base-p32-224")

model = loader.get_model()
model.eval()
model.to(device)
tokenizer = loader.get_tokenizer()
transform = clip_transform(img_size=model.image_size)

def inference():
image = Image.open("./CLIP.png")
image = transform(image).unsqueeze(0).to(device)
text = tokenizer.tokenize_as_tensor(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
text_probs = (image_features @ text_features.T).softmax(dim=-1)

print(text_probs.cpu().numpy()[0].tolist())

if __name__=="__main__":
inference()
36 changes: 36 additions & 0 deletions examples/clip/train_clip.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import torch
from flagai.data.dataset.mm.clip_dataset import CsvDataset, clip_transform, collate_fn
from flagai.trainer import Trainer
from flagai.auto_model.auto_loader import AutoLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# cd examples/clip
data_path = "./data/pairs.csv"
img_dir = "./data/img"

trainer = Trainer(env_type="pytorch",
epochs=5,
pytorch_device=device,
batch_size=64,
lr=1e-4,
log_interval=10,
)

loader = AutoLoader(task_name="txt_img_matching",#contrastive learning
model_name="clip-base-p32-224",
)
model = loader.get_model()
tokenizer = loader.get_tokenizer()

transform = clip_transform(img_size=model.image_size)
train_dataset = CsvDataset(data_path,
img_dir,
transform,
tokenizer)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
trainer.train(model,
optimizer=optimizer,
train_dataset=train_dataset,
collate_fn=collate_fn)

48 changes: 48 additions & 0 deletions examples/clip/train_clip_deepspeed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
import torch
from flagai.data.dataset.mm.clip_dataset import CsvDataset, clip_transform, collate_fn
from flagai.trainer import Trainer
from flagai.auto_model.auto_loader import AutoLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# cd examples/clip
data_path = "./data/pairs.csv"#"/mnt/datasets/multimodal/ConceptualCaptions/Train_GCC-training_output.csv"
img_dir = "./data/img"#"/mnt/datasets/multimodal/ConceptualCaptions"

trainer = Trainer(
env_type="deepspeed",
experiment_name="clip",
batch_size=64,
num_gpus=2,
fp16=True,
gradient_accumulation_steps=1,
lr=1e-4,
weight_decay=1e-5,
epochs=5,
log_interval=1,
load_dir=None,
pytorch_device=device,
save_dir="clip_deepspeed",
save_interval=1000,
num_checkpoints=1,
hostfile="./deepspeed/hostfile",
training_script=__file__,
deepspeed_config="./deepspeed.json"
)
loader = AutoLoader(task_name="txt_img_matching",#contrastive learning
model_name="clip-base-p32-224",
)
model = loader.get_model()
tokenizer = loader.get_tokenizer()

transform = clip_transform(img_size=model.image_size)
train_dataset = CsvDataset(data_path,
img_dir,
transform,
tokenizer)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
trainer.train(model,
optimizer=optimizer,
train_dataset=train_dataset,
collate_fn=collate_fn)

7 changes: 4 additions & 3 deletions examples/glm_blank_filling/glm_generate_samples.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,17 @@
import torch

from flagai.model.glm_model import GLMModel
from flagai.data.tokenizer import GLMLargeChTokenizer
from flagai.data.tokenizer import Tokenizer
from flagai.model.predictor.predictor import Predictor
if __name__ == "__main__":
"""Main training program."""
print('Generate Samples')
# Random seeds for reproducability.
# Model,
model = GLMModel.from_pretrain(model_name='GLM-large-ch',
model_name = 'GLM-large-ch'
model = GLMModel.from_pretrain(model_name=model_name,
download_path="./state_dict/")
tokenizer = GLMLargeChTokenizer()
tokenizer = Tokenizer.from_pretrained(model_name)

model.cuda(torch.cuda.current_device())

Expand Down
2 changes: 1 addition & 1 deletion examples/glm_poetry_generation/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ def __call__(self, batch):
train_src, train_tgt = read_file()
print('-----------train data length:', len(train_src))
my_collate_fn = GLMPoetryDynamicCollateFN(
pad_id=tokenizer.get_command('pad').Id)
pad_id=tokenizer.get_command_id('pad'))
train_dataset = BertSeq2seqDataset(train_src, train_tgt)

trainer.train(model, train_dataset=train_dataset, collate_fn=my_collate_fn)
13 changes: 5 additions & 8 deletions examples/glm_pretrain/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
#
# Licensed under the Apache License, Version 2.0 (the "License")

from flagai.data.tokenizer import GLMLargeChTokenizer
from flagai.data.tokenizer import Tokenizer
from flagai.model.glm_model import GLMForSeq2Seq
from flagai.trainer import Trainer
from flagai.data.dataset import ConstructBlockStrategy
Expand All @@ -24,14 +24,11 @@
load_dir=None,
lr=1e-4,
save_interval=10)

model = GLMForSeq2Seq.from_pretrain(model_name='GLM-large-ch')

model_name = 'GLM-large-ch'
tokenizer = Tokenizer.from_pretrained(model_name)
ds_args = PretrainDatasetArguments()

tokenizer = GLMLargeChTokenizer()

ds_args = add_args(ds_args, tokenizer)
model = GLMForSeq2Seq.from_pretrain(model_name=model_name)

def create_dataset(tokenizer, should_split):
dataset = get_dataset_lazy("./examples/glm_pretrain/data",
Expand Down Expand Up @@ -59,7 +56,7 @@ def create_dataset(tokenizer, should_split):
collate_fn = None
if ds_args.block_lm:
collate_fn = ConstructBlockStrategy(
tokenizer, 512, eod_token=tokenizer.get_command('eos').Id)
tokenizer, 512, eod_token=tokenizer.get_command_id('eos'))
metric_methods = DEFAULT_METRICS['pretrain']
trainer.train(model,
collate_fn=collate_fn,
Expand Down
6 changes: 3 additions & 3 deletions examples/glm_seq2seq/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# Licensed under the Apache License, Version 2.0 (the "License")
from flagai.trainer import Trainer
from flagai.model.glm_model import GLMForSeq2Seq
from flagai.data.tokenizer import GLMLargeEnWordPieceTokenizer, GLMLargeChTokenizer
from flagai.data.tokenizer import Tokenizer
from flagai.data.dataset import Seq2SeqDataset
from flagai.test_utils import Seq2SeqCollateArguments
from flagai.data.dataset.superglue.control import DEFAULT_METRICS, CH_TASKS
Expand All @@ -27,12 +27,12 @@
print("downloading...")

if task_name in CH_TASKS:
tokenizer = GLMLargeChTokenizer()
model_name = 'GLM-large-ch'
else:
tokenizer = GLMLargeEnWordPieceTokenizer()
model_name = 'GLM-large-en'

tokenizer = Tokenizer.from_pretrained(model_name)

train_dataset = Seq2SeqDataset(task_name=task_name,
data_dir='./datasets/',
dataset_type='train',
Expand Down
5 changes: 3 additions & 2 deletions examples/glm_superglue/train_10b_clue.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import os
from flagai.trainer import Trainer
from flagai.model.glm_model import GLMForSingleTokenCloze
from flagai.data.tokenizer import GLMLargeChTokenizer
from flagai.data.tokenizer import Tokenizer
from flagai.metrics import accuracy_metric
from flagai.data.dataset import SuperGlueDataset
from flagai.test_utils import CollateArguments
Expand All @@ -21,11 +21,12 @@
save_dir="./glm_superglue_en",
save_interval=1)

model_name = "GLM-large-ch"
model = GLMForSingleTokenCloze.from_pretrain(download_path="/mnt/test_10b_models",
model_name="GLM-large-ch")


tokenizer = GLMLargeChTokenizer()
tokenizer = Tokenizer.from_pretrained("GLM-large-ch")
train_dataset = SuperGlueDataset(task_name=task_name,
data_dir='./datasets/',
dataset_type='train',
Expand Down
8 changes: 4 additions & 4 deletions examples/glm_superglue/train_10b_superglue.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# Licensed under the Apache License, Version 2.0 (the "License")
from flagai.trainer import Trainer
from flagai.model.glm_model import GLMForSingleTokenCloze
from flagai.data.tokenizer import GLM10bENBPETokenizer, GLMLargeEnWordPieceTokenizer
from flagai.data.tokenizer import Tokenizer
from flagai.metrics import accuracy_metric
from flagai.data.dataset import SuperGlueDataset
from flagai.test_utils import CollateArguments
Expand All @@ -28,11 +28,11 @@
# deepspeed_config='./deepspeed.json',
# training_script=__file__)

model_name = "GLM-large-en"
model = GLMForSingleTokenCloze.from_pretrain(download_path="/mnt/test_10b_models",
model_name="GLM-large-en")
model_name=model_name)

tokenizer = GLMLargeEnWordPieceTokenizer()

tokenizer = Tokenizer.from_pretrained(model_name)
train_dataset = SuperGlueDataset(task_name=task_name,
data_dir='./datasets/',
dataset_type='train',
Expand Down
12 changes: 4 additions & 8 deletions examples/glm_superglue/train_prefix.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,12 @@
#
# Licensed under the Apache License, Version 2.0 (the "License")
from flagai.trainer import Trainer
from flagai.model.glm_model import GLMForSingleTokenCloze, GLMForMultiTokenCloze, GLMForMultiTokenClozeFast, GLMForSequenceClassification
from flagai.data.tokenizer import GLMLargeEnWordPieceTokenizer, GLMLargeChTokenizer
from flagai.model.glm_model import GLMForSequenceClassification
from flagai.data.tokenizer import Tokenizer

from flagai.data.dataset import SuperGlueDataset
from flagai.test_utils import CollateArguments
from flagai.data.dataset.superglue.control import DEFAULT_METRICS, MULTI_TOKEN_TASKS, CH_TASKS
import unittest
from flagai.data.dataset import ConstructSuperglueStrategy


Expand All @@ -32,13 +31,10 @@

if task_name in CH_TASKS:
model_name = 'GLM-large-ch'
tokenizer = GLMLargeChTokenizer(add_block_symbols=True,
add_task_mask=False,
add_decoder_mask=False,
fix_command_token=True)
add_block_symbols=True,
else:
model_name = 'GLM-large-en'
tokenizer = GLMLargeEnWordPieceTokenizer()
tokenizer = Tokenizer.from_pretrained(model_name)

model = GLMForSequenceClassification.from_pretrain(model_name=model_name, spell_length=2,
class_num=3, tune_prefix_layers=1)
Expand Down
Loading