-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLIP Training Code #83
Comments
Very helpful. Thank you |
Hi, Thank you for this training code. |
@vkmavani sure. The
where the URL is the path to the image and the caption is the string of the caption. Here's the dataset class definition for image-text similarity : from PIL import Image
class image_caption_dataset(Dataset):
def __init__(self, df):
self.images = df["image"].tolist()
self.caption = df["caption"].tolist()
def __len__(self):
return len(self.caption)
def __getitem__(self, idx):
images = preprocess(Image.open(self.images[idx])) #preprocess from clip.load
caption = self.caption[idx]
return images,caption
dataset = image_caption_dataset(df)
train_dataloader = DataLoader(dataset,batch_size = BATCH_SIZE) #Define your own dataloader With this dataset definition, you can omit the If you are interested in doing image-image similarity, just modify the dataset to return pair of images and |
what is the clip.model.convert_weights meaning? and can you Provide a complete training code if possible |
@lonngxiang For more information, read #57, clip.model.convert_weights basically convert the CLIP model weight into float16. This will help accelerate and reduce memory usage during training. I can't give a fully working example code since I'm using a private dataset, but I believe the training code and dataset code that I provided is sufficient. |
Thank you for your kind reply |
there is a error when run this train code: |
Thank you very much. It really helps a lot. |
@lonngxiang oh you are correct. pardon me, I have edited my code above. The dataset should return something that can be put on PyTorch tensor. |
one more thing,when you use preprocess in class image_caption_dataset, the torch.stack's preprocess is it still useful? |
still have a error in images= torch.stack([preprocess(Image.fromarray(img)) for img in list_image],dim=0): AttributeError: 'Tensor' object has no attribute 'array_interface' |
Yeah, if already using preprocess inside the class. The result from the batch can be used directly to the CLIP. So that line can be change into this : |
then have anthor error: |
Hmmmm, that error is new for me. Is the error occurred when calculating the loss? |
yes,the error occurred in this line: add model(images.float(), texts.float()) still error: |
Are you using CPU by any chance? The mixed precision training usually don't work on CPU |
yes, i use it on cpu |
@lonngxiang I have updated the code again. Basically, remove all code related to mixed-precision training when using CPU instead of GPU |
ok. so kind of you; Thank you for your patience |
|
how to set BATCH_SIZE to get ground_truth's label? |
@lonngxiang Hmmmm, I don't have the faintest idea why the loss is = 0. BATCH_SIZE is just an integer that you set. Since the image-text are in pairs, the first image will correspond to the first text. So the ground truth for the first image is 0, the second image will correspond to the second image, so the ground truth is 1. You can read more info about cross-entropy loss https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html, especially about the target. Also the CLIP paper, page 5, the upper left part. |
tks for your reply;so If you have five pairs, so your BATCH_SIZE is five,is right? |
Your BATCH_SIZE will determince the number of pairs for each batch For example, If you have 1000 pairs, and set BATCH_SIZE = 20. |
yes,but when I set BATCH_SIZE = 1,the total_loss is always 0,is this right?What's wrong with it |
Yes, that's the problem. BATCH_SIZE must be greater than 1. |
Thank you for helping me a lot and learning a lot |
|
hey!I have a similar question.Did you solve your problem?I use the CLIP's image encoder and then add a MLP head ,make all paramters learnable, but I got a nan loss in the batch 2. when a only train the head, it will be ok.Then I located nan , the crossentropyloss returned a nan value. |
@vinson2233 > |
Thanks for sharing the fine-tuning code, but when I called the save model to test after fine-tune, the scores for each category were average. how did this happen?Looking forward to your guidance, thank you. |
@TaylorLi123 |
|
@vinson2233 |
I have checked the architechture with printing out the |
Where can I find a dataset with texts? |
Were you able to finetune for classification task?? If so can you provide some reference. I have my dataset as image, caption and class it belongs to.. |
Really nice reply. I am trying to borrow the image encoder part of the CLIP and finetune the encoder only, because I plan to use it as a feature extraction part of my own model. For detail, I am trying to add some parameters in it and only train the params, but I donnot know how to split the visual encoder part from CLIP and modify it, can u provide some guide? tks |
Hi Can you please explain the first point you mentioned? Actually i want to fine tune clip for multi labelled image classification where one image may belong to multiple classes |
Hi, if you could clarify why is that we are using torch arrange, suppose that data is randomly shuffled after every epoch, we will have image pairs at different positions every time, so essentially we are not learning anything apart from position (which is also changing every time and that too randomly), instead this approach makes alot more sense https://github.com/moein-shariatnia/OpenAI-CLIP (with the projection head). Since, we are going to a common representation between image and text embeddings. |
@ItzHaad I can give you 2 answers:
|
Hi!@vinson2233 |
@Heathcliff-saku its expected because how the dataset object created. if you don't want huge overhead in the front, another way is to do the image preprocess and clip tokenize after the data produced by data loader, but this will create redundancy in every epoch. If anyone can give recommendations as well then feel free to do so since i don't use CLIP anymore |
How many epochs does clip need to be finetuned and what should be batch size?? |
ell In[13], line 64, in image_title_dataset.getitem(self, idx) TypeError: list indices must be integers or slices, not list is there any problem in my dataloader i am using from torch.utils.data import DataLoader |
hey can you share me your code |
@vinson2233 ,Hi, I would like to train CLIP with my own custom datasets. Can you please advise me on how many images I need to prepare per class? Thank you. |
i implemented the above code to load the saved model(.pt) , but i encountered this error |
I have implemented the training code above and have been trying to train the model on the Flickr dataset but my loss keeps increasing until it eventually stays the the same I don't know what the problem is can anyone provide some insight? |
you just fix your code like below
|
Hello,I set the Batch Size as 64,and my total loss is always 4.02734375 during the training;I don't know what the problem is can anyone provide some insight? |
@vinson2233 My second question is that my texts may include image captions, keywords, categories, etc. Should I handle them differently? Many thanks! |
Not really an issue, I just want to share my training code since some people still have some difficulties to write the training code. Just modify the code to suit your usage.
Feel free to ask or point out any mistakes in my code.
Code to save the model :
Code to load the saved model :
Alternative training code :
The text was updated successfully, but these errors were encountered: