Neural network for Text to Image generation is composed of 2 sub-networks.
Text Encoder
and Generator Network
Therefore, It requires two-step training to train text-to-image generator.
- Image Encoder and Text Encoder are jointly pretrained from image-caption pair thereby projecting image and text to common space.
- After text encoder pretraining, Generator Network is advarsarialy trained to generate realistic image based on text feature.
Recent research proposed using DAMSM loss + Contrastive loss for pretraining text encoder and training DM-GAN
, thereby reaching SOTA.
In this work, We replaced RNN based text encoder and CNN based image encoder with CLIP
, which is pretrained multimodal Vision Language Model
based on transformer architecture.
CLIP
is multimodal encoder for image and natural language, which is pretrained using contrastive loss
with huge batch size(=32768).
This is link for paper and official pytorch implementation of CLIP
Download the preprocessed datasets from AttnGAN
Alternatively, another site is from DM-GAN
- Fine tuning pretrained CLIP encoder
-
With CUBS2011 using DAMSM + Contrastive loss :
$ python pretrain_DAMSM.py --cfg cfg/DAMSM/bird.yml --gpu 0
-
With COCO2014 using DAMSM + Contrastive loss :
$ python pretrain_DAMSM.py --cfg cfg/DAMSM/coco.yml --gpu 0
- Training DM-GAN
-
With CUBS2011 :
$ python main.py --cfg cfg/clip_bird_DMGAN.yml --gpu 0
-
With COCO2014 :
$ python main.py --cfg cfg/clip_coco_DMGAN.yml --gpu 0
- Generate fake images and compute R precision
-
CUBS2011 :
$ python main.py --cfg cfg/eval_clip_bird.yml
-
COCO2014 :
$ python main.py --cfg cfg/eval_clip_coco.yml
- Compute FID(Frechet Inception Distance)
-
CUBS2011 :
$ python fid_score.py --data bird --dims 2048 --batch_size 32
-
COCO2014 :
$ python fid_score.py --data coco --dims 2048 --batch_size 32
- Compute Inception score
-
CUBS2011 :
$ python inception_score.py --data bird
-
COCO2014 :
$ python inception_score.py --data coco