Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clip-benchmark model init #6

Closed
escorciav opened this issue Nov 7, 2024 · 5 comments
Closed

clip-benchmark model init #6

escorciav opened this issue Nov 7, 2024 · 5 comments

Comments

@escorciav
Copy link

escorciav commented Nov 7, 2024

Hi guys!

Thanks for making your research accessible to the public & congrats on your CVPRW-2024 paper 🎉

Is this the boilerplate required to plugin SynthCLIP in clip-bench as mentioned in #5 or #2 ?

cp Training/models.py <clip-benchmark-dir/clip_benchmark/models/synthclip.py>

Append this function onto that module

def load_synthclip(pretrained: str = "./checkpoints/synthclip-30m/checkpoint_best.pt",
                   device="cpu", **kwargs):
    model = CLIP_VITB16()
    # Taken from
    # https://github.com/hammoudhasan/SynthCLIP/blob/02ef69764d8dc921650bcac4a98bd0f477790787/Training/main.py#L240
    normalize = transforms.Normalize(
        mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
    )
    transform = transforms.Compose(
        [
            transforms.Resize(224),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            # dunno why I need that but whatever XD. EOM - Victor
            lambda x: x.repeat(3, 1, 1) if x.shape[0] == 1 else x,  # force RGB
            normalize,
        ]
    )
    model = model.to(device)
    tokenizer = open_clip.get_tokenizer("ViT-B-16")
    return model, transform, tokenizer

then register it as mentioned here

Thanks in advance!

@escorciav
Copy link
Author

escorciav commented Nov 7, 2024

Worked for me (I believe). If needed, check my fork of clip-bench out 😉 . Your welcome!

clip_benchmark eval --model "ViT-B-16" --model_type synthclip --pretrained $pretrained --dataset=$dataset --output=$output --dataset_root $dataset_root
# Debugging
# python -m ipdb clip-benchmark/clip_benchmark/cli.py eval --model_type synthclip --pretrained $pretrained --dataset=$dataset --output=$output --dataset_root $dataset_root --num_workers 0

@escorciav
Copy link
Author

In case anyone is interested

@escorciav
Copy link
Author

@hammoudhasan or @HaniItani could you please review if the following stuff is correct

import torch
from PIL import Image
from clip_benchmark.models.synthclip import CLIP_VITB16, load_synthclip

checkpoint_path = "./logs/synthclip-30m/checkpoint_best.pt"
device = 'gpu'
use_clip_benchmark = True

if not use_clip_benchmark:
    print('Load synthclip as per example...')
    model = torch.nn.DataParallel(CLIP_VITB16())
    checkpoint = torch.load(checkpoint_path, map_location=device)
    load_status = model.load_state_dict(checkpoint["state_dict"])
    model = model.module
    print(load_status)
else:
    print('Load synthclip as per clip_benchmark...')
    model, transform, tokenizer = load_synthclip(
        model_path="./logs/synthclip-30m/checkpoint_best.pt",
        map_location=device
    )

print('Load & preprocess image...')
img_path = "./open_clip/docs/CLIP.png"
image = Image.open(img_path)
image = image.convert('RGB')
image = transform(image).unsqueeze(0)
print('Tokenize text...')
text = tokenizer(["a diagram", "a dog", "a cat"])
print('Fwd-pass model...')
amp_kwargs = dict(device_type="cuda", dtype=torch.float16) if "gpu" in device else dict(device_type="cpu")

with torch.no_grad(), torch.amp.autocast(**amp_kwargs):
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    logit_scale = model.logit_scale.exp()

    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    text_probs = (logit_scale * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[1., 0., 0.]]

###############################################################################
# open_clip
###############################################################################

# model, _, preprocess = open_clip.create_model_and_transforms(model_arch, pretrained=model_path,
#                                                              load_weights_only=False)
# tokenizer = open_clip.get_tokenizer(model_arch)

# img_path = "./docs/CLIP.png"
# image = preprocess(Image.open(img_path)).unsqueeze(0)
# text = tokenizer(["a diagram", "a dog", "a cat"])

# with torch.no_grad(), torch.cuda.amp.autocast():
#     image_features = model.encode_image(image)
#     text_features = model.encode_text(text)
#     image_features /= image_features.norm(dim=-1, keepdim=True)
#     text_features /= text_features.norm(dim=-1, keepdim=True)

#     text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

# print("Label probs:", text_probs)  # prints: [[1., 0., 0.]]
... # verbosity
Fwd-pass model...
Label probs: tensor([[0.3227, 0.2939, 0.3834]])

@escorciav
Copy link
Author

BTW, using

output = model(image, text)
image_features, text_features = output["image_embed"], output["text_embed"]
logit_scale = output["logit_scale"]

The result is
Label probs: tensor([[0.2790, 0.3688, 0.3522]])

@escorciav
Copy link
Author

escorciav commented Dec 6, 2024

Latest version polished by the grrreat @HaniItani is here 🙌

Label probs: tensor([[0.0048, 0.0878, 0.9075]], device='cuda:0') 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant