Simple script to compute CLIP scores based on a trained DALL-e model, using OpenAI's CLIP https://github.com/openai/CLIP. CLIP scores measures the compatibility between an image and a caption. The raw value is using cosine similarity, so it is between -1 and 1. In CLIP, the value is scaled by 100 by default, giving a number between -100 and 100, where 100 means maximum compatibility between an image and text. As mentioned in https://arxiv.org/abs/2104.14806, it is rare that the score is negative, but we clamp it to have a number between 0 and 100 anyways. Typical values are around 20-30.
- Install CLIP from https://github.com/openai/CLIP
- Install DALL-E lucidrains implementation https://github.com/lucidrains/DALLE-pytorch
python setup.py install
Here is an example:
clip_score --dalle_path dalle.pt --image_text_folder CUB_200_2011 --taming --num_generate 1 --dump
here:
dalle_path
is the path of the model trained with DALL-E using https://github.com/lucidrains/DALLE-pytorchimage_text_folder
is the folder of the dataset following https://github.com/lucidrains/DALLE-pytorch/loader.py formattaming
: specify that we use taming transformers as an image encodernum_generate
: number of images to generate per captiondump
: save all the generated images in the folderoutputs
(by default) and their respective metrics
Example output:
CLIP_score_real 30.1826171875
CLIP_score 26.7392578125
CLIP_score_top1 26.7392578125
CLIP_score_relative 0.8892822265625
CLIP_score_relative_top1 0.8892822265625
CLIP_atleast 0.7466491460800171
Note that all the metrics will also be saved on clip_score.json
by default.
CLIP_score_real
: average CLIP score for real imagesCLIP_score
: average CLIP score for all generated images.CLIP_score_top1
: for each caption, retain the generated image with best CLIP score, then compute the average CLIP score like inCLIP_score
.CLIP_score_relative
: similar to https://arxiv.org/abs/2104.14806, we compute CLIP score of the generated image divided by the CLIP score of the real image, then average. In general, between 0 and 1, although it can be bigger than 1. Bigger than 1 means the CLIP score of the generated image is higher.CLIP_score_relative_top1
: same asCLIP_score_relative
but using the top CLIP score like inCLIP_score_top1
.CLIP_atleast
: for each caption, it is 1 if CLIP score can reach at least--clip_thresh
(by default 25), 0 if not, then we average over all captions. This score gives a number between 0 and 1.
For all scores, the higher, the better.