CLIP Training Code #83

vinson2233 · 2021-04-08T06:30:25Z

Not really an issue, I just want to share my training code since some people still have some difficulties to write the training code. Just modify the code to suit your usage.
Feel free to ask or point out any mistakes in my code.

# Latest Update : 18 July 2022, 09:55 GMT+7

# TO ADD :
# Gradient Checkpointing
# Filter out bias from weight decay
# Decaying learning rate with cosine schedule
# Half-precision Adam statistics
# Half-precision stochastically rounded text encoder weights were used

#BATCH_SIZE must larger than 1

device = "cuda:0" if torch.cuda.is_available() else "cpu" # If using GPU then use mixed precision training.
model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training

class image_title_dataset(Dataset):
    def __init__(self, list_image_path,list_txt):

        self.image_path = list_image_path
        self.title  = clip.tokenize(list_txt) #you can tokenize everything at once in here(slow at the beginning), or tokenize it in the training loop.

    def __len__(self):
        return len(self.title)

    def __getitem__(self, idx):
        image = preprocess(Image.open(self.image_path[idx])) # Image from PIL module
        title = self.title[idx]
        return image,title

# use your own data
list_image_path = ['folder/image1.jpg','folder2/image2.jpg'] 
list_txt = ['description for image1.jpg' , 'description for image2.jpg']
dataset = image_title_dataset(list_image_path,list_txt)
train_dataloader = DataLoader(dataset,batch_size = BATCH_SIZE) #Define your own dataloader

#https://github.com/openai/CLIP/issues/57
def convert_models_to_fp32(model): 
    for p in model.parameters(): 
        p.data = p.data.float() 
        p.grad.data = p.grad.data.float() 


if device == "cpu":
  model.float()
else :
  clip.model.convert_weights(model) # Actually this line is unnecessary since clip by default already on float16

loss_img = nn.CrossEntropyLoss()
loss_txt = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=5e-5,betas=(0.9,0.98),eps=1e-6,weight_decay=0.2) #Params used from paper, the lr is smaller, more safe for fine tuning to new dataset

# add your own code to track the training progress.
for epoch in range(EPOCH):
  for batch in train_dataloader :
      optimizer.zero_grad()

      images,texts = batch 
    
      images= images.to(device)
      texts = texts.to(device)
    
      logits_per_image, logits_per_text = model(images, texts)

      ground_truth = torch.arange(len(images),dtype=torch.long,device=device)

      total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2
      total_loss.backward()
      if device == "cpu":
         optimizer.step()
      else : 
        convert_models_to_fp32(model)
        optimizer.step()
        clip.model.convert_weights(model)

NOTE :

that for inference purpose, the conversion step from fp16 to fp32 is not needed, just use the model in full fp16
For multi-GPU training, see my comment on how to use multiple GPUs,the default is to use the first CUDA device #111 (comment)
I'm not the author of this model nor having any relationship with the author. I'm just a random guy who interested in CLIP.
For training image-image or text-text, please refer to this principle : CLIP Training Code #83 (comment)
What is the difference between image loss and text loss? isn't one just a transposed version of the other one? read this then CLIP Training Code #83 (comment)
Why the ground truth is torch.arange? CLIP Training Code #83 (comment)

Code to save the model :

torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': total_loss,
        }, f"model_checkpoint/model_10.pt") #just change to your preferred folder/filename

Code to load the saved model :

model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training
checkpoint = torch.load("model_checkpoint/model_10.pt")

# Use these 3 lines if you use default model setting(not training setting) of the clip. For example, if you set context_length to 100 since your string is very long during training, then assign 100 to checkpoint['model_state_dict']["context_length"] 
checkpoint['model_state_dict']["input_resolution"] = model.input_resolution #default is 224
checkpoint['model_state_dict']["context_length"] = model.context_length # default is 77
checkpoint['model_state_dict']["vocab_size"] = model.vocab_size 

model.load_state_dict(checkpoint['model_state_dict'])

Alternative training code :

@Zasder3 have created a PyTorch lighting version to train the CLIP https://github.com/Zasder3/train-CLIP
@mitchellnw researchers at UW, Google, Stanford, Amazon, Columbia, and Berkeley also create their training code https://github.com/mlfoundations/open_clip

nikky4D · 2021-04-08T13:02:11Z

Very helpful. Thank you

vkmavani · 2021-04-12T03:57:32Z

Not really an issue, I just want to share my training code since some people still have some difficulties to write the training code
Feel free to ask or point out any mistakes in my code.

train_dataloader = DataLoader(...,batch_size = BATCH_SIZE) #Define your own dataloader

#https://github.com/openai/CLIP/issues/57
def convert_models_to_fp32(model): 
    for p in model.parameters(): 
        p.data = p.data.float() 
        p.grad.data = p.grad.data.float() 

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training
clip.model.convert_weights(model)

loss_img = nn.CrossEntropyLoss()
loss_txt = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=5e-5,betas=(0.9,0.98),eps=1e-6,weight_decay=0.2) #Params from paper

for batch in train_dataloader :
    optimizer.zero_grad()

    list_image,list_txt = batch #list_images is list of image in numpy array(np.uint8)
    
    images= torch.stack([preprocess(Image.fromarray(img)) for img in list_image],dim=0)
    texts = clip.tokenize(list_txt)
    
    logits_per_image, logits_per_text = model(images, texts)

    ground_truth = torch.arange(BATCH_SIZE).to(device)
    total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2
    total_loss.backward()

    convert_models_to_fp32(model)
    optimizer.step()
    clip.model.convert_weights(model)

Hi, Thank you for this training code.
I have a dataset, where I want to check the image similarity, and I want to use the CLIP. But I don't know how to prepare(image_size, embedding_size, transforms, etc) a dataset to feed this training code. Can you please provide me the dataset class if possible?

vinson2233 · 2021-04-12T04:17:22Z

@vkmavani sure. The preprocess object from CLIP takes care of all of the preprocessing steps for the image part, so you don't need to worry about image_size or transform(see https://github.com/openai/CLIP/blob/main/clip/clip.py line 58).
For example, maybe your data look like this :

| image  | caption  |
---------------------
| url1   | caption1 |
| url2   | caption2 |

where the URL is the path to the image and the caption is the string of the caption.

Here's the dataset class definition for image-text similarity :

from PIL import Image

class image_caption_dataset(Dataset):
    def __init__(self, df):

        self.images = df["image"].tolist()
        self.caption = df["caption"].tolist()

    def __len__(self):
        return len(self.caption)

    def __getitem__(self, idx):
        
        images = preprocess(Image.open(self.images[idx])) #preprocess from clip.load
        caption = self.caption[idx]
        return images,caption

dataset = image_caption_dataset(df)
train_dataloader = DataLoader(dataset,batch_size = BATCH_SIZE) #Define your own dataloader

With this dataset definition, you can omit the Image.fromarray() and the preprocess step after loading the batch since the actual data already in tensor format

If you are interested in doing image-image similarity, just modify the dataset to return pair of images and
for the training code, adjust the code accordingly, a big change will happen in the creating the logits part. Change the forward method logits_per_image, logits_per_text = model(images, texts) according to https://github.com/openai/CLIP/blob/main/clip/model.py, line 354.

lonngxiang · 2021-04-12T06:46:39Z

what is the clip.model.convert_weights meaning? and can you Provide a complete training code if possible

vinson2233 · 2021-04-12T06:59:51Z

@lonngxiang For more information, read #57, clip.model.convert_weights basically convert the CLIP model weight into float16. This will help accelerate and reduce memory usage during training.
The definition of clip.model.convert_weight can be found at https://github.com/openai/CLIP/blob/main/clip/model.py line 371

I can't give a fully working example code since I'm using a private dataset, but I believe the training code and dataset code that I provided is sufficient.

lonngxiang · 2021-04-12T07:04:03Z

@lonngxiang For more information, read #57, clip.model.convert_weights basically convert the CLIP model weight into float16. This will help accelerate and reduce memory usage during training.
The definition of clip.model.convert_weight can be found at https://github.com/openai/CLIP/blob/main/clip/model.py line 371

I can't give a fully working example code since I'm using a private dataset, but I believe the training code and dataset code that I provided is sufficient.

Thank you for your kind reply

lonngxiang · 2021-04-12T08:01:17Z

there is a error when run this train code：
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'PIL.JpegImagePlugin.JpegImageFile'>

vkmavani · 2021-04-12T08:08:05Z

@vkmavani sure. The preprocess object from CLIP takes care of all of the preprocessing steps for the image part, so you don't need to worry about image_size or transform(see https://github.com/openai/CLIP/blob/main/clip/clip.py line 58).
For example, maybe your data look like this :
| image  | caption  |
---------------------
| url1   | caption1 |
| url2   | caption2 |
where the URL is the path to the image and the caption is the string of the caption.

Here's the dataset class definition for image-text similarity :
from PIL import Image

class image_caption_dataset(Dataset):
    def __init__(self, df):

        self.images = df["image"].tolist()
        self.caption = df["caption"].tolist()

    def __len__(self):
        return len(self.caption)

    def __getitem__(self, idx):
        
        images = Image.open(self.images[idx])
        caption = self.caption[idx]
        return images,caption

dataset = image_caption_dataset(df)
train_dataloader = DataLoader(dataset,batch_size = BATCH_SIZE) #Define your own dataloader
With this dataset definition, you can omit the Image.fromarray() since the actual data already in PIL format.

If you are interested in doing image-image similarity, just modify the dataset to return pair of images and
for the training code, adjust the code accordingly, a big change will happen in the creating the logits part. Change the forward method logits_per_image, logits_per_text = model(images, texts) according to https://github.com/openai/CLIP/blob/main/clip/model.py, line 354.

Thank you very much. It really helps a lot.

vinson2233 · 2021-04-12T08:16:05Z

@lonngxiang oh you are correct. pardon me, I have edited my code above. The dataset should return something that can be put on PyTorch tensor.

lonngxiang · 2021-04-12T08:21:38Z

@lonngxiang oh you are correct. pardon me, I have edited my code above. The dataset should return something that can be put on PyTorch tensor.

one more thing，when you use preprocess in class image_caption_dataset, the torch.stack's preprocess is it still useful?

lonngxiang · 2021-04-12T08:24:19Z

@lonngxiang oh you are correct. pardon me, I have edited my code above. The dataset should return something that can be put on PyTorch tensor.

still have a error in images= torch.stack([preprocess(Image.fromarray(img)) for img in list_image],dim=0):

AttributeError: 'Tensor' object has no attribute 'array_interface'

vinson2233 · 2021-04-12T08:39:29Z

Yeah, if already using preprocess inside the class. The result from the batch can be used directly to the CLIP. So that line can be change into this : images = list_image

lonngxiang · 2021-04-12T08:41:58Z

Yeah, if already using preprocess inside the class. The result from the batch can be used directly to the CLIP. So that line can be change into this : images = list_image

then have anthor error:
RuntimeError: "unfolded2d_copy" not implemented for 'Half'

vinson2233 · 2021-04-12T08:46:42Z

Hmmmm, that error is new for me. Is the error occurred when calculating the loss?

lonngxiang · 2021-04-12T09:12:52Z

Hmmmm, that error is new for me. Is the error occurred when calculating the loss?

yes,the error occurred in this line:
logits_per_image, logits_per_text = model(images, texts)

add model(images.float(), texts.float()) still error:
RuntimeError: "unfolded2d_copy" not implemented for 'Half'

vinson2233 · 2021-04-12T09:19:02Z

Are you using CPU by any chance? The mixed precision training usually don't work on CPU

lonngxiang · 2021-04-12T09:54:29Z

Are you using CPU by any chance? The mixed precision training usually don't work on CPU

yes, i use it on cpu

vinson2233 · 2021-04-12T10:01:02Z

@lonngxiang I have updated the code again. Basically, remove all code related to mixed-precision training when using CPU instead of GPU

lonngxiang · 2021-04-12T10:18:53Z

@lonngxiang I have updated the code again. Basically, remove all code related to mixed-precision training when using CPU instead of GPU

ok. so kind of you; Thank you for your patience

lonngxiang · 2021-04-13T01:24:09Z

@lonngxiang I have updated the code again. Basically, remove all code related to mixed-precision training when using CPU instead of GPU
run it on cpu；There's still a problem. the total_loss is always 0

lonngxiang · 2021-04-13T01:53:14Z

@lonngxiang I have updated the code again. Basically, remove all code related to mixed-precision training when using CPU instead of GPU

how to set BATCH_SIZE to get ground_truth's label?

vinson2233 · 2021-04-13T02:20:12Z

@lonngxiang Hmmmm, I don't have the faintest idea why the loss is = 0.

BATCH_SIZE is just an integer that you set. Since the image-text are in pairs, the first image will correspond to the first text. So the ground truth for the first image is 0, the second image will correspond to the second image, so the ground truth is 1.
This pattern keeps repeating until the last image-text pair.
So the ground truth is a torch tensor like this : torch.tensor([0,1,2,3,...,BATCH_SIZE-1]).
Since the pre-trained CLIP use a massive batch size, just try to use the largest BATCH_SIZE as your system can take.

You can read more info about cross-entropy loss https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html, especially about the target. Also the CLIP paper, page 5, the upper left part.

lonngxiang · 2021-04-13T02:24:26Z

@lonngxiang Hmmmm, I don't have the faintest idea why the loss is = 0.

BATCH_SIZE is just an integer that you set. Since the image-text are in pairs, the first image will correspond to the first text. So the ground truth for the first image is 0, the second image will correspond to the second image, so the ground truth is 1.
This pattern keeps repeating until the last image-text pair.
So the ground truth is a torch tensor like this : torch.tensor([0,1,2,3,...,BATCH_SIZE-1]).
Since the pre-trained CLIP use a massive batch size, just try to use the largest BATCH_SIZE as your system can take.

You can read more info about cross-entropy loss https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html, especially about the target. Also the CLIP paper, page 5, the upper left part.

tks for your reply；so If you have five pairs, so your BATCH_SIZE is five，is right？

vinson2233 · 2021-04-13T02:29:13Z

Your BATCH_SIZE will determince the number of pairs for each batch

For example, If you have 1000 pairs, and set BATCH_SIZE = 20.
Then for each loop of for batch in train_dataloader, the variable batch will give you 20 pairs. The loop will be repeated 50 times to cover all the data for 1 epoch.

lonngxiang · 2021-04-13T02:35:03Z

Your BATCH_SIZE will determince the number of pairs for each batch

For example, If you have 1000 pairs, and set BATCH_SIZE = 20.
Then for each loop of for batch in train_dataloader, the variable batch will give you 20 pairs. The loop will be repeated 50 times to cover all the data for 1 epoch.

yes，but when I set BATCH_SIZE = 1，the total_loss is always 0，is this right？What's wrong with it

vinson2233 · 2021-04-13T02:55:04Z

Yes, that's the problem. BATCH_SIZE must be greater than 1.
The reason is your prediction will return cosine similarity for that image and that text.
CrossEntropyLoss is combination of softmax with logloss.
Since one row only has 1 prediction(because BATCH_SIZE=1), the softmax will return probability=1 for that entry(It doesn't matter whether the logits is high or low), where it automatically correspond to the correct ground truth.

lonngxiang · 2021-04-13T03:00:35Z

Yes, that's the problem. BATCH_SIZE must be greater than 1.
The reason is your prediction will return cosine similarity for that image and that text.
CrossEntropyLoss is combination of softmax with logloss.
Since one row only has 1 prediction(because BATCH_SIZE=1), the softmax will return probability=1 for that entry(It doesn't matter whether the logits is high or low), where it automatically correspond to the correct ground truth.

Thank you for helping me a lot and learning a lot

dmoham1476 · 2021-04-13T04:26:11Z

Don't we need to do clip.load_state_dict after clip.load?
Are we not doing model.encode_image and model.encode_text and then doing norm before training?
Can you please add demo code for early stopping, saving the model (.pt) and metrics as well
Are we fine-tuning only ViT and not the text part? How did this impact performance on custom dataset?

xuntianci · 2023-08-17T01:53:20Z

For that 1. ,I use the CLIP's image encoder and then add a MLP head ，make all paramters learnable, but I got a nan loss in the batch 2. when a only train the head, it will be ok.Then I located nan , the crossentropyloss returned a nan value, can you give me some advice?I would appreciate it.

hey!I have a similar question.Did you solve your problem?I use the CLIP's image encoder and then add a MLP head ，make all paramters learnable, but I got a nan loss in the batch 2. when a only train the head, it will be ok.Then I located nan , the crossentropyloss returned a nan value.

xiyangyang99 · 2023-08-23T03:17:14Z

@vinson2233 >
Firstly, thank you for sharing the fine-tuning code. However, after completing the fine-tuning, in my own dataset, I only fine-tuned the image encoding part, while I did not fine-tune the text encoding part. I used vit base 16 as the pre training weight, but after fine-tuning, the. pt increased by 5 times. Also, how should the. pt model generated after fine-tuning be used for inference? Looking forward to your guidance, thank you.

TaylorLi123 · 2023-09-06T07:44:39Z

Not really an issue, I just want to share my training code since some people still have some difficulties to write the training code. Just modify the code to suit your usage. Feel free to ask or point out any mistakes in my code.

# Latest Update : 18 July 2022, 09:55 GMT+7

# TO ADD :
# Gradient Checkpointing
# Filter out bias from weight decay
# Decaying learning rate with cosine schedule
# Half-precision Adam statistics
# Half-precision stochastically rounded text encoder weights were used

#BATCH_SIZE must larger than 1

device = "cuda:0" if torch.cuda.is_available() else "cpu" # If using GPU then use mixed precision training.
model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training

class image_title_dataset(Dataset):
    def __init__(self, list_image_path,list_txt):

        self.image_path = list_image_path
        self.title  = clip.tokenize(list_txt) #you can tokenize everything at once in here(slow at the beginning), or tokenize it in the training loop.

    def __len__(self):
        return len(self.title)

    def __getitem__(self, idx):
        image = preprocess(Image.open(self.image_path[idx])) # Image from PIL module
        title = self.title[idx]
        return image,title

# use your own data
list_image_path = ['folder/image1.jpg','folder2/image2.jpg'] 
list_txt = ['description for image1.jpg' , 'description for image2.jpg']
dataset = image_title_dataset(list_image_path,list_txt)
train_dataloader = DataLoader(dataset,batch_size = BATCH_SIZE) #Define your own dataloader

#https://github.com/openai/CLIP/issues/57
def convert_models_to_fp32(model): 
    for p in model.parameters(): 
        p.data = p.data.float() 
        p.grad.data = p.grad.data.float() 


if device == "cpu":
  model.float()
else :
  clip.model.convert_weights(model) # Actually this line is unnecessary since clip by default already on float16

loss_img = nn.CrossEntropyLoss()
loss_txt = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=5e-5,betas=(0.9,0.98),eps=1e-6,weight_decay=0.2) #Params used from paper, the lr is smaller, more safe for fine tuning to new dataset

# add your own code to track the training progress.
for epoch in range(EPOCH):
  for batch in train_dataloader :
      optimizer.zero_grad()

      images,texts = batch 
    
      images= images.to(device)
      texts = texts.to(device)
    
      logits_per_image, logits_per_text = model(images, texts)

      ground_truth = torch.arange(len(images),dtype=torch.long,device=device)

      total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2
      total_loss.backward()
      if device == "cpu":
         optimizer.step()
      else : 
        convert_models_to_fp32(model)
        optimizer.step()
        clip.model.convert_weights(model)

NOTE :
that for inference purpose, the conversion step from fp16 to fp32 is not needed, just use the model in full fp16
For multi-GPU training, see my comment on how to use multiple GPUs,the default is to use the first CUDA device #111 (comment)
I'm not the author of this model nor having any relationship with the author. I'm just a random guy who interested in CLIP.
For training image-image or text-text, please refer to this principle : CLIP Training Code #83 (comment)
What is the difference between image loss and text loss? isn't one just a transposed version of the other one? read this then CLIP Training Code #83 (comment)
Why the ground truth is torch.arange? CLIP Training Code #83 (comment)

Code to save the model :

torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': total_loss,
        }, f"model_checkpoint/model_10.pt") #just change to your preferred folder/filename

Code to load the saved model :

model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training
checkpoint = torch.load("model_checkpoint/model_10.pt")

# Use these 3 lines if you use default model setting(not training setting) of the clip. For example, if you set context_length to 100 since your string is very long during training, then assign 100 to checkpoint['model_state_dict']["context_length"] 
checkpoint['model_state_dict']["input_resolution"] = model.input_resolution #default is 224
checkpoint['model_state_dict']["context_length"] = model.context_length # default is 77
checkpoint['model_state_dict']["vocab_size"] = model.vocab_size 

model.load_state_dict(checkpoint['model_state_dict'])

Alternative training code :

@Zasder3 have created a PyTorch lighting version to train the CLIP https://github.com/Zasder3/train-CLIP
@mitchellnw researchers at UW, Google, Stanford, Amazon, Columbia, and Berkeley also create their training code https://github.com/mlfoundations/open_clip

Thanks for sharing the fine-tuning code, but when I called the save model to test after fine-tune, the scores for each category were average. how did this happen?Looking forward to your guidance, thank you.
tensor([0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250],
device='cuda:0', dtype=torch.float16)

xiyangyang99 · 2023-09-14T09:15:11Z

@TaylorLi123
What is your dataset like? How many datasets are there? My dataset is very small, and I trained it to be the same as you. The final output value is the same.

TaylorLi123 · 2023-09-14T09:39:41Z

@QzYER
My dataset contains over 20,000 images
picture format：
-- images
--00001.jpg
--00002.jpg
description text：
--caption.txt
--description for image1.jpg
--description for image2.jpg

TaylorLi123 · 2023-09-16T00:19:33Z

@vinson2233
How do I output the evaluation indicators accuracy, precision, recall, and F1 during the training process? Does anyone have these indicators equal

p1k0pan · 2023-10-27T09:26:31Z

Hi @vinson2233 thanks a lot for sharing your code! If I understand well you never set the model in training phase model.train(). Do you do that on purpose to freeze dropout and batchnorm layers during fine-tuning? And second questions, do you have any ideas about good metrics to keep track of during evaluation to understand if fine-tuning is running fine (validation loss apart)?

I have checked the architechture with printing out the state_dict from the model which loaded by clip.load(). I found there are no Dropout layer in the model, although it is a standard layer from a standard ViT and Transformer. Does it mean the model is not complete and not suffient to do fine-tunning?

IliasParas13 · 2023-11-05T15:51:11Z

Where can I find a dataset with texts?

keshavsharma347 · 2023-12-15T05:18:18Z

@uplusv If you want to modify CLIP as a classifier(Single label, multi class), here's some modification you can do :

Change the ground_truth = torch.arange(BATCH_SIZE).to(device) to integer vector that specify which class your image are on (for example torch.tensor([0,1,2,1,2,3,4,5])). With this now you can set your batch size in arbitrary size.

One image should match 1 label, but 1 label can match will multiple images. You can omit the loss_txt in the total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2 to total_loss = loss_img(logits_per_image,ground_truth)

I'm not sure what you meant by "After fine-tuning, the model outputs sample feature for every image"

Thank you for your reply and advice, I will try it soon! By "After fine-tuning, the model outputs sample feature for every image", I mean that, with "image_features = model.encode_image(image_input)" I print this "image_features" and get image_features: tensor([ [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039], [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039], [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039], ..., [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039], [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039], [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039]]) while the original model outputs: image_features: tensor([ [ 0.0304, -0.0169, -0.0383, ..., 0.0927, 0.0261, 0.0203], [ 0.0013, -0.0067, -0.0524, ..., 0.1029, 0.0028, 0.0169], [ 0.0115, -0.0006, -0.0392, ..., 0.0616, 0.0317, 0.0171], ..., [ 0.0173, -0.0152, -0.0431, ..., 0.0836, 0.0405, 0.0268], [ 0.0287, -0.0236, -0.0401, ..., 0.0856, 0.0119, 0.0287], [ 0.0150, 0.0013, -0.0537, ..., 0.0792, 0.0104, 0.0062]]) After fine-tuning, the features become same and smaller so I get identical and large logits(like 99.8856) for every image😢.

Were you able to finetune for classification task?? If so can you provide some reference. I have my dataset as image, caption and class it belongs to..

Dinosaurcubs · 2023-12-26T02:59:06Z

Not really an issue, I just want to share my training code since some people still have some difficulties to write the training code. Just modify the code to suit your usage. Feel free to ask or point out any mistakes in my code.

# Latest Update : 18 July 2022, 09:55 GMT+7

# TO ADD :
# Gradient Checkpointing
# Filter out bias from weight decay
# Decaying learning rate with cosine schedule
# Half-precision Adam statistics
# Half-precision stochastically rounded text encoder weights were used

#BATCH_SIZE must larger than 1

device = "cuda:0" if torch.cuda.is_available() else "cpu" # If using GPU then use mixed precision training.
model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training

class image_title_dataset(Dataset):
    def __init__(self, list_image_path,list_txt):

        self.image_path = list_image_path
        self.title  = clip.tokenize(list_txt) #you can tokenize everything at once in here(slow at the beginning), or tokenize it in the training loop.

    def __len__(self):
        return len(self.title)

    def __getitem__(self, idx):
        image = preprocess(Image.open(self.image_path[idx])) # Image from PIL module
        title = self.title[idx]
        return image,title

# use your own data
list_image_path = ['folder/image1.jpg','folder2/image2.jpg'] 
list_txt = ['description for image1.jpg' , 'description for image2.jpg']
dataset = image_title_dataset(list_image_path,list_txt)
train_dataloader = DataLoader(dataset,batch_size = BATCH_SIZE) #Define your own dataloader

#https://github.com/openai/CLIP/issues/57
def convert_models_to_fp32(model): 
    for p in model.parameters(): 
        p.data = p.data.float() 
        p.grad.data = p.grad.data.float() 


if device == "cpu":
  model.float()
else :
  clip.model.convert_weights(model) # Actually this line is unnecessary since clip by default already on float16

loss_img = nn.CrossEntropyLoss()
loss_txt = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=5e-5,betas=(0.9,0.98),eps=1e-6,weight_decay=0.2) #Params used from paper, the lr is smaller, more safe for fine tuning to new dataset

# add your own code to track the training progress.
for epoch in range(EPOCH):
  for batch in train_dataloader :
      optimizer.zero_grad()

      images,texts = batch 
    
      images= images.to(device)
      texts = texts.to(device)
    
      logits_per_image, logits_per_text = model(images, texts)

      ground_truth = torch.arange(len(images),dtype=torch.long,device=device)

      total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2
      total_loss.backward()
      if device == "cpu":
         optimizer.step()
      else : 
        convert_models_to_fp32(model)
        optimizer.step()
        clip.model.convert_weights(model)

NOTE :
that for inference purpose, the conversion step from fp16 to fp32 is not needed, just use the model in full fp16
For multi-GPU training, see my comment on how to use multiple GPUs,the default is to use the first CUDA device #111 (comment)
I'm not the author of this model nor having any relationship with the author. I'm just a random guy who interested in CLIP.
For training image-image or text-text, please refer to this principle : CLIP Training Code #83 (comment)
What is the difference between image loss and text loss? isn't one just a transposed version of the other one? read this then CLIP Training Code #83 (comment)
Why the ground truth is torch.arange? CLIP Training Code #83 (comment)

Code to save the model :

torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': total_loss,
        }, f"model_checkpoint/model_10.pt") #just change to your preferred folder/filename

Code to load the saved model :

model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training
checkpoint = torch.load("model_checkpoint/model_10.pt")

# Use these 3 lines if you use default model setting(not training setting) of the clip. For example, if you set context_length to 100 since your string is very long during training, then assign 100 to checkpoint['model_state_dict']["context_length"] 
checkpoint['model_state_dict']["input_resolution"] = model.input_resolution #default is 224
checkpoint['model_state_dict']["context_length"] = model.context_length # default is 77
checkpoint['model_state_dict']["vocab_size"] = model.vocab_size 

model.load_state_dict(checkpoint['model_state_dict'])

Alternative training code :

@Zasder3 have created a PyTorch lighting version to train the CLIP https://github.com/Zasder3/train-CLIP
@mitchellnw researchers at UW, Google, Stanford, Amazon, Columbia, and Berkeley also create their training code https://github.com/mlfoundations/open_clip

Really nice reply. I am trying to borrow the image encoder part of the CLIP and finetune the encoder only, because I plan to use it as a feature extraction part of my own model. For detail, I am trying to add some parameters in it and only train the params, but I donnot know how to split the visual encoder part from CLIP and modify it, can u provide some guide? tks

deadpipe · 2023-12-28T16:00:15Z

@uplusv If you want to modify CLIP as a classifier(Single label, multi class), here's some modification you can do :

Change the ground_truth = torch.arange(BATCH_SIZE).to(device) to integer vector that specify which class your image are on (for example torch.tensor([0,1,2,1,2,3,4,5])). With this now you can set your batch size in arbitrary size.

One image should match 1 label, but 1 label can match will multiple images. You can omit the loss_txt in the total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2 to total_loss = loss_img(logits_per_image,ground_truth)

I'm not sure what you meant by "After fine-tuning, the model outputs sample feature for every image"

Hi

Can you please explain the first point you mentioned? Actually i want to fine tune clip for multi labelled image classification where one image may belong to multiple classes

ItzHaad · 2024-01-07T19:41:40Z

Hi, if you could clarify why is that we are using torch arrange, suppose that data is randomly shuffled after every epoch, we will have image pairs at different positions every time, so essentially we are not learning anything apart from position (which is also changing every time and that too randomly), instead this approach makes alot more sense https://github.com/moein-shariatnia/OpenAI-CLIP (with the projection head). Since, we are going to a common representation between image and text embeddings.

vinson2233 · 2024-01-07T20:19:47Z

@ItzHaad I can give you 2 answers:

actual answer
you shuffle the pair of the data instead of shuffling the image and text independently. that's how you retain the labels.
let say the original data is (Image1, Text1), (Image2, Text2), (Image3, Text3)
it will make no difference if you shuffle the data so the order becomes like this (Image2, Text2), (Image3, Text3), (Image1, Text1)
lazy answer : that is what is presented in the paper (read Figure 3 of the paper)

Heathcliff-saku · 2024-01-17T12:18:24Z

Hi！@vinson2233
Thank you very much for your training script. In my code, I have adopted your dataset method. However, my dataset is quite large (about 2 million img-text pairs), which has led to an unusual phenomenon during training: at the start of each epoch, specifically during the loading of the first batch, there is a prolonged delay (approximately 30 to 40 minutes), and the GPU utilization remains at 0%. Have you encountered this issue before? Is this normal? Do you have any recommended solutions to mitigate this?
PS. I have set an appropriate number of workers and enabled pin memory, but this waiting time still seems unavoidable.

vinson2233 · 2024-01-17T12:57:49Z

@Heathcliff-saku its expected because how the dataset object created. if you don't want huge overhead in the front, another way is to do the image preprocess and clip tokenize after the data produced by data loader, but this will create redundancy in every epoch.

If anyone can give recommendations as well then feel free to do so since i don't use CLIP anymore

manas6266 · 2024-02-03T04:21:25Z

How many epochs does clip need to be finetuned and what should be batch size??

anas2908 · 2024-03-03T22:28:42Z

ell In[13], line 64, in image_title_dataset.getitem(self, idx)
63 def getitem(self, idx):
---> 64 image_path = self.image_path[idx] # Get the image path at the specified index
65 image = preprocess(Image.open(image_path)) # Open the image using PIL
66 title = self.title[idx] # Get the title corresponding to the image

TypeError: list indices must be integers or slices, not list

is there any problem in my dataloader i am using from torch.utils.data import DataLoader
i am stucked can you please help me?

anas2908 · 2024-03-03T23:46:08Z

Hi！@vinson2233 Thank you very much for your training script. In my code, I have adopted your dataset method. However, my dataset is quite large (about 2 million img-text pairs), which has led to an unusual phenomenon during training: at the start of each epoch, specifically during the loading of the first batch, there is a prolonged delay (approximately 30 to 40 minutes), and the GPU utilization remains at 0%. Have you encountered this issue before? Is this normal? Do you have any recommended solutions to mitigate this? PS. I have set an appropriate number of workers and enabled pin memory, but this waiting time still seems unavoidable.

hey can you share me your code

BaochaoZhu · 2024-03-11T02:53:18Z

@vinson2233 ,Hi, I would like to train CLIP with my own custom datasets. Can you please advise me on how many images I need to prepare per class? Thank you.

lalit-pivotchain · 2024-03-15T07:01:24Z

@lonngxiang i have update the code for save and load, basically to load the model use this code :

model, preprocess = clip.load("ViT-B/32",device=device,jit=False) 
checkpoint = torch.load("model_checkpoint/model_10.pt")

# Use these 3 lines if you use default model setting(not training setting) of the clip. For example, if you set context_length to 100 since your string is very long during training, then assign 100 to checkpoint['model_state_dict']["context_length"] 
checkpoint['model_state_dict']["input_resolution"] = model.input_resolution 
checkpoint['model_state_dict']["context_length"] = model.context_length
checkpoint['model_state_dict']["vocab_size"] = model.vocab_size 

model.load_state_dict(checkpoint['model_state_dict'])

Just modify the dict key to match your dict key when saving to .pt file

i implemented the above code to load the saved model(.pt) , but i encountered this error
AttributeError: 'CLIP' object has no attribute 'input_resolution'

Shadedog838 · 2024-04-21T16:59:37Z

I have implemented the training code above and have been trying to train the model on the Flickr dataset but my loss keeps increasing until it eventually stays the the same I don't know what the problem is can anyone provide some insight?

seidasoeun · 2024-04-27T09:49:39Z

ell In[13], line 64, in image_title_dataset.getitem(self, idx) 63 def getitem(self, idx): ---> 64 image_path = self.image_path[idx] # Get the image path at the specified index 65 image = preprocess(Image.open(image_path)) # Open the image using PIL 66 title = self.title[idx] # Get the title corresponding to the image

TypeError: list indices must be integers or slices, not list

is there any problem in my dataloader i am using from torch.utils.data import DataLoader i am stucked can you please help me?

you just fix your code like below

from torch.utils.data import Dataset

nightrain-vampire · 2024-05-10T12:11:49Z

@lonngxiang Hmmmm, I don't have the faintest idea why the loss is = 0.

BATCH_SIZE is just an integer that you set. Since the image-text are in pairs, the first image will correspond to the first text. So the ground truth for the first image is 0, the second image will correspond to the second image, so the ground truth is 1. This pattern keeps repeating until the last image-text pair. So the ground truth is a torch tensor like this : torch.tensor([0,1,2,3,...,BATCH_SIZE-1]). Since the pre-trained CLIP use a massive batch size, just try to use the largest BATCH_SIZE as your system can take.

You can read more info about cross-entropy loss https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html, especially about the target. Also the CLIP paper, page 5, the upper left part.

Hello，I set the Batch Size as 64，and my total loss is always 4.02734375 during the training；I don't know what the problem is can anyone provide some insight?

pengzhao-life · 2024-10-03T20:04:32Z

@vinson2233
Thanks for sharing the code! I am trying to finetune clip model on my own dataset. My data has multiple texts per image. I wonder what the typical solution is to handle multiple texts per image. My understanding is that I need to create each ground truth label to ensure the positive relationship between the image and its every single text. Otherwise, they may be considered as negative pair during the training. Is this correct? Any sample code or post related to this concern?

My second question is that my texts may include image captions, keywords, categories, etc. Should I handle them differently? Many thanks!

jongwook mentioned this issue Apr 8, 2021

Training CLIP-ViT #58

Closed

vinson2233 mentioned this issue Apr 9, 2021

This model can support fine tuning？ #89

Closed

vinid mentioned this issue Sep 19, 2023

Contrastive loss in the model PathologyFoundation/plip#9

Closed

ariawoo mentioned this issue Oct 2, 2023

Where are storing the trained model shashnkvats/Indofashionclip#1

Open

vinid mentioned this issue Oct 19, 2023

Multi-Lingual Clip patrickjohncyh/fashion-clip#25

Closed

noitanec mentioned this issue Dec 8, 2023

how to use multiple GPUs,the default is to use the first CUDA device #111

Closed

EachOneChew mentioned this issue Dec 11, 2023

Using CLIP in transfer learning for multilabel classification #334

Closed

cr42yhc1 mentioned this issue Oct 9, 2024

How can I obtain .gguf file if I use my onw dataset fine-tuned on clip? monatis/clip.cpp#103

Open

CLIP Training Code #83

CLIP Training Code #83

Comments

vinson2233 commented Apr 8, 2021 • edited Loading

nikky4D commented Apr 8, 2021

vkmavani commented Apr 12, 2021

vinson2233 commented Apr 12, 2021 • edited Loading

lonngxiang commented Apr 12, 2021

vinson2233 commented Apr 12, 2021

lonngxiang commented Apr 12, 2021

lonngxiang commented Apr 12, 2021

vkmavani commented Apr 12, 2021

vinson2233 commented Apr 12, 2021

lonngxiang commented Apr 12, 2021

lonngxiang commented Apr 12, 2021

vinson2233 commented Apr 12, 2021

lonngxiang commented Apr 12, 2021

vinson2233 commented Apr 12, 2021

lonngxiang commented Apr 12, 2021

vinson2233 commented Apr 12, 2021

lonngxiang commented Apr 12, 2021

vinson2233 commented Apr 12, 2021

lonngxiang commented Apr 12, 2021

lonngxiang commented Apr 13, 2021

lonngxiang commented Apr 13, 2021

vinson2233 commented Apr 13, 2021

lonngxiang commented Apr 13, 2021

vinson2233 commented Apr 13, 2021 • edited Loading

lonngxiang commented Apr 13, 2021

vinson2233 commented Apr 13, 2021 • edited Loading

lonngxiang commented Apr 13, 2021

dmoham1476 commented Apr 13, 2021

xuntianci commented Aug 17, 2023

xiyangyang99 commented Aug 23, 2023 • edited Loading

TaylorLi123 commented Sep 6, 2023

xiyangyang99 commented Sep 14, 2023

TaylorLi123 commented Sep 14, 2023

TaylorLi123 commented Sep 16, 2023

p1k0pan commented Oct 27, 2023

IliasParas13 commented Nov 5, 2023

keshavsharma347 commented Dec 15, 2023

Dinosaurcubs commented Dec 26, 2023

deadpipe commented Dec 28, 2023

ItzHaad commented Jan 7, 2024 • edited Loading

vinson2233 commented Jan 7, 2024 • edited Loading

Heathcliff-saku commented Jan 17, 2024

vinson2233 commented Jan 17, 2024

manas6266 commented Feb 3, 2024

anas2908 commented Mar 3, 2024 • edited Loading

anas2908 commented Mar 3, 2024

BaochaoZhu commented Mar 11, 2024

lalit-pivotchain commented Mar 15, 2024

Shadedog838 commented Apr 21, 2024

seidasoeun commented Apr 27, 2024

nightrain-vampire commented May 10, 2024

pengzhao-life commented Oct 3, 2024

vinson2233 commented Apr 8, 2021 •

edited

Loading

vinson2233 commented Apr 12, 2021 •

edited

Loading

vinson2233 commented Apr 13, 2021 •

edited

Loading

vinson2233 commented Apr 13, 2021 •

edited

Loading

xiyangyang99 commented Aug 23, 2023 •

edited

Loading

ItzHaad commented Jan 7, 2024 •

edited

Loading

vinson2233 commented Jan 7, 2024 •

edited

Loading

anas2908 commented Mar 3, 2024 •

edited

Loading