Add BLIP #20716

younesbelkada · 2022-12-09T23:16:24Z

What does this PR do?

BLIP is a model from salesforce, capable of performing Visual question answering, image captioning and image-text retrieval. This model has been also used in several Stable-diffusion finetuned variants, such as Pokemon stable diffusion or Naruto Stable diffusion to generate text descriptions from images in order to create text-image paired dataset.

Original repo: https://github.com/salesforce/BLIP

add integration tests
Push weights
document everything

Users would be able to use Blip for three main usecases:

1- Conditional Generation (Image captioning):

from PIL import Image
import requests
from transformers import BlipForConditionalGeneration, BlipProcessor

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')   

model = BlipForConditionalGeneration.from_pretrained("Salesfoce/blip-image-captioning-base")
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
text = "a picture of" # the prefix is optional

inputs = processor(image, text, return_tensors="pt")
output = model.generate(**inputs)

print(processor.decode(output[0], skip_special_tokens=True))
>>> a picture of a woman and a dog sitting in a beach

1- bis Conditional Generation (Image captioning with no prefix!):

from PIL import Image
import requests
from transformers import BlipForConditionalGeneration, BlipProcessor

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')   

model = BlipForConditionalGeneration.from_pretrained("Salesfoce/blip-image-captioning-base")
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")

inputs = processor(image, return_tensors="pt")
output = model.generate(**inputs)

print(processor.decode(output[0], skip_special_tokens=True))
>>> an image of a woman and a dog sitting in a beach

2- Visual question answering

from PIL import Image
import requests
from transformers import BlipForQuestionAnswering, BlipProcessor

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')   

model = BlipForVisualQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")

question = ["How many dogs are in this image?"]
inputs = processor(image, text, return_tensors="pt")

output = model.generate(**inputs)
print(processor.decode(output[0], skip_special_tokens=True))
>>> 1

3- Image text retrieval (score matching)

import torch
from PIL import Image
import requests
from transformers import BlipForQuestionAnswering, BlipProcessor

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')   

model = BlipForImageTextRetrieval.from_pretrained("Salesforce/blip-vqa-base")
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")

question = ["A picture of a woman with a dog sitting in a beach"]
inputs = processor(image, question, return_tensors="pt")

out_itm = model(**inputs, use_itm_head=True)
out = model(**inputs, use_itm_head=False)

print(out) # cosine similarity score
>>> 0.21
print(torch.nn.functional.softmax(out_itm[0], dim=1)[:, 1])
>>> 0.46

cc @NielsRogge

Fixes salesforce/LAVIS#64

HuggingFaceDocBuilderDev · 2022-12-11T12:46:49Z

The documentation is not available anymore as the PR was closed or merged.

- add correct license on modeling text - remove dummy argument

docs/source/en/model_doc/blip.mdx

src/transformers/models/blip/modeling_blip.py

src/transformers/models/blip/modeling_blip_text.py

src/transformers/models/blip/modeling_blip.py

src/transformers/models/blip/processing_blip.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

younesbelkada · 2022-12-15T22:13:53Z

The failing CI test seems to be related to #20790

* add new model like * add v1 * v1 * v1 * vision encoder logits match * v2 * fix * add docstring * CI tests pass * fix tests * make fixup * add to `toctree` * fix processors * fix processors * fix doc * fill title * add content doc * remove from tokenization auto * fix config * change order * add `# Copied from` * few fixes - add correct license on modeling text - remove dummy argument * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * replace name * refactor a bit * more refactor * remove unused arg * make fixup + remove some `# Adapted from ...` * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * more `# Copied from` * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * now `generate` supports no prefix * remove `FeatureExtractor` * fix path * correct dependency * fix tests * few fixes * add integration tests * add correct conversion script * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * add `blip` to tokenization auto * fix docstrings * fix test + add image * remove processor from uncorrect place * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * clean up a bit * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * clean pixel mask * clean pixel mask * fix `F` * Update src/transformers/models/blip/modeling_blip.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * fix output * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * fix pad token id * remove `token_type_ids` * make fixup * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * make fixup * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * add comments * Update src/transformers/models/blip/modeling_blip.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * remove `token_type_ids` * make fixup * better name * replace with `image_attention_mask` * refactor * make fixup * better docstring * replace `answer_xx` * remove ununsed args * add `labels` * add `labels` * fix processing tests * make fixup * make fixup * put correct repo * remove `pad` * remove `crop` and `center_crop` * Update src/transformers/models/blip/image_processing_blip.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * fix * remove `size_divisor` * fix weights `init` * remove unneeded functions * add suggestions * minor changes - change slow test output for PT 1.13 - docstring order * replace `feature_extractor` by `image_processor` * fix doctests * fix weight init order + add fp16 slow test * add `blip` to doctest * add correct repo name and fix test * Update src/transformers/models/blip/processing_blip.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * fix tests * use `convert_to_rgb` from `image_transforms` * make fixup * fix large loading issue Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

ivelin · 2023-01-31T21:09:42Z

Awesome contribution! Thank you @younesbelkada.

Just noticed Salesforce released BLIP2. Not sure how much work it would be to port to huggingface.
https://github.com/salesforce/LAVIS/tree/main/projects/blip2

* add new model like * add v1 * v1 * v1 * vision encoder logits match * v2 * fix * add docstring * CI tests pass * fix tests * make fixup * add to `toctree` * fix processors * fix processors * fix doc * fill title * add content doc * remove from tokenization auto * fix config * change order * add `# Copied from` * few fixes - add correct license on modeling text - remove dummy argument * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * replace name * refactor a bit * more refactor * remove unused arg * make fixup + remove some `# Adapted from ...` * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * more `# Copied from` * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * now `generate` supports no prefix * remove `FeatureExtractor` * fix path * correct dependency * fix tests * few fixes * add integration tests * add correct conversion script * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * add `blip` to tokenization auto * fix docstrings * fix test + add image * remove processor from uncorrect place * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * clean up a bit * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * clean pixel mask * clean pixel mask * fix `F` * Update src/transformers/models/blip/modeling_blip.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * fix output * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * fix pad token id * remove `token_type_ids` * make fixup * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * make fixup * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * add comments * Update src/transformers/models/blip/modeling_blip.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * remove `token_type_ids` * make fixup * better name * replace with `image_attention_mask` * refactor * make fixup * better docstring * replace `answer_xx` * remove ununsed args * add `labels` * add `labels` * fix processing tests * make fixup * make fixup * put correct repo * remove `pad` * remove `crop` and `center_crop` * Update src/transformers/models/blip/image_processing_blip.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * fix * remove `size_divisor` * fix weights `init` * remove unneeded functions * add suggestions * minor changes - change slow test output for PT 1.13 - docstring order * replace `feature_extractor` by `image_processor` * fix doctests * fix weight init order + add fp16 slow test * add `blip` to doctest * add correct repo name and fix test * Update src/transformers/models/blip/processing_blip.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * fix tests * use `convert_to_rgb` from `image_transforms` * make fixup * fix large loading issue Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

younesbelkada added 8 commits December 8, 2022 14:28

add new model like

edda537

add v1

afbc3bc

v1

d21be07

v1

365fe14

vision encoder logits match

4538c24

v2

d6ff6e0

fix

955d7b5

add docstring

05f1849

younesbelkada changed the title ~~Add BLIP~~ [WIP] Add BLIP Dec 11, 2022

younesbelkada added 4 commits December 11, 2022 12:08

CI tests pass

12ffea9

fix tests

5215465

make fixup

da07c0f

add to toctree

859158a

younesbelkada added 11 commits December 11, 2022 12:49

fix processors

ee56ba0

fix processors

fb45b64

fix doc

d256a0a

fill title

980b723

add content doc

e53ad6c

remove from tokenization auto

31e4339

fix config

da7f972

change order

2f3b6dd

add # Copied from

5a1fd7a

few fixes

6387aec

- add correct license on modeling text - remove dummy argument

Merge remote-tracking branch 'upstream/main' into add-blip

59d5131