Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨 Add Blip2ForImageTextRetrieval #29261

Merged

Conversation

jpizarrom
Copy link
Contributor

@jpizarrom jpizarrom commented Feb 23, 2024

What does this PR do?

Add Blip2ForImageTextRetrieval, Blip2TextModelWithProjection, Blip2VisionModelWithProjection models to be able to get Image Text Matching scores, and extract text,image,multimodal features.

Fixes part of #25300 #25245

This is continuation of #25612, I tried to apply most of the feedback received in that PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker @amyeroberts

@jpizarrom jpizarrom marked this pull request as draft February 23, 2024 19:41
@jpizarrom jpizarrom marked this pull request as ready for review February 23, 2024 20:14
@jpizarrom jpizarrom changed the title WIP Add Blip2ForImageTextRetrieval Add Blip2ForImageTextRetrieval Feb 23, 2024
@ArthurZucker
Copy link
Collaborator

cc @NielsRogge and @younesbelkada if one of you want to review on @jpizarrom makes the CIs go green!

@jpizarrom jpizarrom changed the title Add Blip2ForImageTextRetrieval 🚨 Add Blip2ForImageTextRetrieval Mar 2, 2024
@jpizarrom
Copy link
Contributor Author

cc @NielsRogge and @younesbelkada if one of you want to review on @jpizarrom makes the CIs go green!

Hi, what could I do to makes the CIs go green! shall I just merge to upstream/main, or rebase to it?

@amyeroberts
Copy link
Collaborator

@jpizarrom It's preferable for you to rebase onto main. To see how to make the CIs green, you'll need to click on details and look at the output error logs from circleci. I'd suggest doing this after rebasing so see which errors are coming from this branch.

@jpizarrom jpizarrom force-pushed the add_blip2_image_text_retrieval_model branch from 0e82065 to 9aa9a15 Compare March 22, 2024 15:16
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! Overall looks great, just a few small comments

Once they're addressed we can move the checkpoints to be under the salesforce org

Comment on lines 372 to 389
@classmethod
def from_vision_qformer_configs(
cls,
vision_config: Blip2VisionConfig,
qformer_config: Blip2QFormerConfig,
**kwargs,
):
r"""
Instantiate a [`Blip2Config`] (or a derived class) from a BLIP-2 vision and Q-Former model configurations.

Returns:
[`Blip2Config`]: An instance of a configuration object
"""

return cls(
vision_config=vision_config.to_dict(),
qformer_config=qformer_config.to_dict(),
**kwargs,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to add a separate method here. We can just make text_config optional in from_vision_qformer_text_config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_vision_qformer_configs was removed

src/transformers/models/blip_2/modeling_blip_2.py Outdated Show resolved Hide resolved
Comment on lines 2365 to 2387
if self.device != torch.device("cpu"):
with torch.cuda.amp.autocast(dtype=torch.float16):
vision_outputs = self.vision_model(
pixel_values=pixel_values,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
else:
vision_outputs = self.vision_model(
pixel_values=pixel_values,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Autocasting and typing should be handled outside of the model definition

Suggested change
if self.device != torch.device("cpu"):
with torch.cuda.amp.autocast(dtype=torch.float16):
vision_outputs = self.vision_model(
pixel_values=pixel_values,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
else:
vision_outputs = self.vision_model(
pixel_values=pixel_values,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
vision_outputs = self.vision_model(
pixel_values=pixel_values,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was done because in the original model the autocast was applied only to the vision layers, don't know yet how to do this in a different way.

https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/models/blip2_models/blip2_qformer.py#L423-L424

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was removed, as discussed in #29261 (comment)

tests/models/blip_2/test_modeling_blip_2.py Outdated Show resolved Hide resolved
tests/models/blip_2/test_modeling_blip_2.py Outdated Show resolved Hide resolved
src/transformers/models/blip_2/modeling_blip_2.py Outdated Show resolved Hide resolved
Comment on lines 1199 to 1198
if config.use_qformer_text_input:
self.embeddings = Blip2TextEmbeddings(config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using this config argument to conditionally call and create this layer, I'd suggest instead call self.embeddings if input_ids is not None

Suggested change
if config.use_qformer_text_input:
self.embeddings = Blip2TextEmbeddings(config)
self.embeddings = Blip2TextEmbeddings(config)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when this layer is created_always_, I got this type of errors, don't know how to fix them.
Some Blip2 models do not use this bert based embeddings, they use opt or flan-t5 to create the query_embeds. Maybe I could try to refactor the code to move the Blip2TextEmbeddings outside of Blip2QFormerModel and pass always query_embeds. what do you think?

FAILED tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_training_gradient_checkpointing - AssertionError: False is not true : qformer.embeddings.word_embeddings.weight in Blip2ForConditionalGeneration has no gradient!
FAILED tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelTest::test_training_gradient_checkpointing - AssertionError: False is not true : qformer.embeddings.word_embeddings.weight in Blip2ForConditionalGeneration has no gradient!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a refactor, embeddings were removed from Blip2QFormerModel, and place them into Blip2ForImageTextRetrieval and Blip2TextModelWithProjection, but to do so i needed to add query_length param to Blip2QFormerModel.forward.

# past_key_values_length
past_key_values_length = (
past_key_values[0][0].shape[2] - self.config.query_length if past_key_values is not None else 0
)

query_length = query_embeds.shape[1] if query_embeds is not None else 0

embedding_output = self.layernorm(query_embeds)
if self.config.use_qformer_text_input:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.config.use_qformer_text_input:
if input_ids is not None:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is outdated, because embeddings were removed from Blip2QFormerModel

Comment on lines 1373 to 1376
# TODO: maybe have a cleaner way to cast the input (from `Blip2Processor` side?)
expected_dtype = self.dtype
if encoder_hidden_states is not None and encoder_hidden_states.dtype != expected_dtype:
encoder_hidden_states = encoder_hidden_states.to(expected_dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this even necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not be necessary indeed given that modeling code is by default in torch.float32

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was removed

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@jpizarrom jpizarrom changed the title 🚨 Add Blip2ForImageTextRetrieval WIP 🚨 Add Blip2ForImageTextRetrieval May 1, 2024
@jpizarrom jpizarrom force-pushed the add_blip2_image_text_retrieval_model branch from 05327aa to da0cc83 Compare May 1, 2024 06:57
)

if self.device != torch.device("cpu"):
with torch.cuda.amp.autocast(dtype=torch.float16):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell we don't add torch.cuda.amp.autocast code to modeling files, they are just in float32 by default. This was discussed on the original BLIP-2 model addition PR from what I remember. It's up to users to call something like torch.cuda.amp.autocast themselves if they wish to load the model in a different precision than the default one (cc @younesbelkada).

Hence in the conversion script I casted both the original weights and my BLIP-2 implementation to float32 in order to verify the conversion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's right

Copy link
Contributor Author

@jpizarrom jpizarrom May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was removed, a PR on your fork was opened to also remove the autocast from the ITM model NielsRogge/LAVIS#1

@@ -84,6 +84,99 @@ def to_tuple(self) -> Tuple[Any]:
)


@dataclass
class Blip2ImageTextMatchingModelOutput(ModelOutput):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if feasible, but it'd be nice to match the output class of CLIP, which is also an image-text matching model. It consists of the following keys:

  • loss
  • logits_per_image (this I assume is the itm_score)
  • logits_per_text (this I assume is the itm_score transposed)
  • and some other keys which are CLIP-specific.

Making sure that Blip2ForImageTextRetrieval matches this would allow it to be added to the zero-shot image classification pipeline, which relies on this output key:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise we will have a hard time adding BLIP-2 support to the zero-shot image classification pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @NielsRogge, i updated the output to match CLIP output, but this PR is not being updated with my latest commits

Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work! Would request some changes however in order to be able to make BLIP-2 compatible with the zero-shot image classification pipeline.

input_ids: Optional[torch.FloatTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
query_embeds: Optional[torch.FloatTensor] = None,
past_key_values_length: int = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
past_key_values_length: int = 0,

past_key_values are not used I assume

Copy link
Contributor Author

@jpizarrom jpizarrom May 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was removed. thanks

@@ -704,6 +716,16 @@ class Blip2ModelTest(ModelTesterMixin, PipelineTesterMixin, GenerationTesterMixi
test_attention_outputs = False
test_torchscript = False

# TODO: Fix the failed tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be addressed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not on this PR, I don't believe it is related to the changes of this PR, Blip2ForConditionalGeneration fails there, but i will verify that.

I wanted to make the test pass, and leave a comment about this, as i saw similar comments on other models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the error i was getting, it also occurs on the main branch

FAILED tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelTest::test_pipeline_visual_question_answering_fp16 - RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpizarrom Could you open an issue to track this to make sure this isn't lost?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to this comment on why the test fails. You're probably running this on CPU.

@@ -304,7 +401,13 @@ class Blip2PreTrainedModel(PreTrainedModel):
config_class = Blip2Config
base_model_prefix = "blip"
supports_gradient_checkpointing = True
_no_split_modules = ["Blip2Attention", "T5Block", "OPTDecoderLayer"]
_no_split_modules = [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NielsRogge I updated _no_split_modules, i hope this will fix the slow multi-gpu tests that were failing.

@jpizarrom
Copy link
Contributor Author

Hi @amyeroberts, now the slow tests are passing. please let me know if I need to make any more changes.

@atomwalk12
Copy link

Please excuse me as I am not OP. However, I have an inquiry about this feature as I would really like to be able to use it. Generally, how much time does it take once a feature has been merged into the main branch to be made available in the next release?

@amyeroberts
Copy link
Collaborator

@jpizarrom Was an issue created to track the failing test c.f. this comment: #29261 (comment)

@jpizarrom
Copy link
Contributor Author

@jpizarrom Was an issue created to track the failing test c.f. this comment: #29261 (comment)

Not yet, I can do it, but I don't have my computer with me until the first week of September.

@amyeroberts amyeroberts merged commit 7591ca5 into huggingface:main Aug 27, 2024
25 checks passed
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Aug 30, 2024
* add Blip2ForImageTextRetrieval

* use one line and remove unnecessary space in tests

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* use  value from the config, rather than hardcoded

* change order of params in Blip2QFormerModel.forward

* update docstring

* fix style

* update test_inference_opt

* move embeddings out of Blip2QFormerModel

* remove from_vision_qformer_configs

* remove autocast float16 in Blip2QFormerModel

* rename fiels into vision_projection,text_projection,use_image_text_matching_head

* use CLIPOutput for  Blip2ImageTextMatchingModelOutput

* remove past_key_values_length from Blip2TextEmbeddings

* fix small typo in the CLIPOutput docstring

* add Blip2ForImageTextRetrieval to Zero Shot Image Classification mapping

* update docstring and add require_torch_fp16

* rollback test_inference_opt

* use use_image_text_matching_head=True in convert

* skip test_model_get_set_embeddings

* fix create_rename_keys error on new itm fields

* revert to do  scale after dot product between "query" and "key"

* fix ValueError on convert script for blip2-opt-2.7b

* update org of paths to Salesforce

* add is_pipeline_test_to_skip for VisualQuestionAnsweringPipelineTests

* [run_slow] blip_2

* removed Blip2ForImageTextRetrieval from IGNORE_NON_AUTO_CONFIGURED

* fix docstring of Blip2ImageTextMatchingModelOutput

* [run_slow] blip_2

* fix multi-gpu tests

* [run_slow] blip_2

* [run_slow] blip_2

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Aug 30, 2024
* add Blip2ForImageTextRetrieval

* use one line and remove unnecessary space in tests

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* use  value from the config, rather than hardcoded

* change order of params in Blip2QFormerModel.forward

* update docstring

* fix style

* update test_inference_opt

* move embeddings out of Blip2QFormerModel

* remove from_vision_qformer_configs

* remove autocast float16 in Blip2QFormerModel

* rename fiels into vision_projection,text_projection,use_image_text_matching_head

* use CLIPOutput for  Blip2ImageTextMatchingModelOutput

* remove past_key_values_length from Blip2TextEmbeddings

* fix small typo in the CLIPOutput docstring

* add Blip2ForImageTextRetrieval to Zero Shot Image Classification mapping

* update docstring and add require_torch_fp16

* rollback test_inference_opt

* use use_image_text_matching_head=True in convert

* skip test_model_get_set_embeddings

* fix create_rename_keys error on new itm fields

* revert to do  scale after dot product between "query" and "key"

* fix ValueError on convert script for blip2-opt-2.7b

* update org of paths to Salesforce

* add is_pipeline_test_to_skip for VisualQuestionAnsweringPipelineTests

* [run_slow] blip_2

* removed Blip2ForImageTextRetrieval from IGNORE_NON_AUTO_CONFIGURED

* fix docstring of Blip2ImageTextMatchingModelOutput

* [run_slow] blip_2

* fix multi-gpu tests

* [run_slow] blip_2

* [run_slow] blip_2

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024
* add Blip2ForImageTextRetrieval

* use one line and remove unnecessary space in tests

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* use  value from the config, rather than hardcoded

* change order of params in Blip2QFormerModel.forward

* update docstring

* fix style

* update test_inference_opt

* move embeddings out of Blip2QFormerModel

* remove from_vision_qformer_configs

* remove autocast float16 in Blip2QFormerModel

* rename fiels into vision_projection,text_projection,use_image_text_matching_head

* use CLIPOutput for  Blip2ImageTextMatchingModelOutput

* remove past_key_values_length from Blip2TextEmbeddings

* fix small typo in the CLIPOutput docstring

* add Blip2ForImageTextRetrieval to Zero Shot Image Classification mapping

* update docstring and add require_torch_fp16

* rollback test_inference_opt

* use use_image_text_matching_head=True in convert

* skip test_model_get_set_embeddings

* fix create_rename_keys error on new itm fields

* revert to do  scale after dot product between "query" and "key"

* fix ValueError on convert script for blip2-opt-2.7b

* update org of paths to Salesforce

* add is_pipeline_test_to_skip for VisualQuestionAnsweringPipelineTests

* [run_slow] blip_2

* removed Blip2ForImageTextRetrieval from IGNORE_NON_AUTO_CONFIGURED

* fix docstring of Blip2ImageTextMatchingModelOutput

* [run_slow] blip_2

* fix multi-gpu tests

* [run_slow] blip_2

* [run_slow] blip_2

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
dataKim1201 pushed a commit to dataKim1201/transformers that referenced this pull request Oct 7, 2024
* add Blip2ForImageTextRetrieval

* use one line and remove unnecessary space in tests

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* use  value from the config, rather than hardcoded

* change order of params in Blip2QFormerModel.forward

* update docstring

* fix style

* update test_inference_opt

* move embeddings out of Blip2QFormerModel

* remove from_vision_qformer_configs

* remove autocast float16 in Blip2QFormerModel

* rename fiels into vision_projection,text_projection,use_image_text_matching_head

* use CLIPOutput for  Blip2ImageTextMatchingModelOutput

* remove past_key_values_length from Blip2TextEmbeddings

* fix small typo in the CLIPOutput docstring

* add Blip2ForImageTextRetrieval to Zero Shot Image Classification mapping

* update docstring and add require_torch_fp16

* rollback test_inference_opt

* use use_image_text_matching_head=True in convert

* skip test_model_get_set_embeddings

* fix create_rename_keys error on new itm fields

* revert to do  scale after dot product between "query" and "key"

* fix ValueError on convert script for blip2-opt-2.7b

* update org of paths to Salesforce

* add is_pipeline_test_to_skip for VisualQuestionAnsweringPipelineTests

* [run_slow] blip_2

* removed Blip2ForImageTextRetrieval from IGNORE_NON_AUTO_CONFIGURED

* fix docstring of Blip2ImageTextMatchingModelOutput

* [run_slow] blip_2

* fix multi-gpu tests

* [run_slow] blip_2

* [run_slow] blip_2

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
@KilianKW
Copy link

KilianKW commented Oct 9, 2024

@amyeroberts @jpizarrom Thanks a lot for adding this feature! I've noticed that the relevant model weights hosted here are still missing some license information. Do you need a dedicated ticket for that or is this post enough?

@jpizarrom
Copy link
Contributor Author

@amyeroberts @jpizarrom Thanks a lot for adding this feature! I've noticed that the relevant model weights hosted here are still missing some license information. Do you need a dedicated ticket for that or is this post enough?

Hi, i am not sure, other blip2 models like Salesforce/blip2-opt-2.7b-coco show MIT, but in the repo LAVIS there is BSD 3-Clause License

@jpizarrom jpizarrom deleted the add_blip2_image_text_retrieval_model branch October 13, 2024 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants