Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds CLIP to models exportable with ONNX #18515

Merged
merged 21 commits into from
Aug 10, 2022
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/en/serialization.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ Ready-made configurations include the following architectures:
- BlenderbotSmall
- BLOOM
- CamemBERT
- CLIP
- CodeGen
- ConvBERT
- ConvNeXT
Expand Down
1 change: 0 additions & 1 deletion src/transformers/models/big_bird/modeling_flax_big_bird.py
Original file line number Diff line number Diff line change
Expand Up @@ -1862,7 +1862,6 @@ def __call__(
output_hidden_states: bool = False,
return_dict: bool = True,
):

# Model
outputs = self.bert(
input_ids,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -365,7 +365,6 @@ def bigbird_block_sparse_attention(
plan_num_rand_blocks,
output_attentions,
):

# BigBirdPegasus block-sparse attention as suggested in paper

# ITC:
Expand Down Expand Up @@ -2398,7 +2397,6 @@ def forward(
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, Seq2SeqModelOutput]:

# different to other models, BigBirdPegasus automatically creates decoder_input_ids from
# input_ids if no decoder_input_ids are provided
if decoder_input_ids is None and decoder_inputs_embeds is None:
Expand Down
16 changes: 14 additions & 2 deletions src/transformers/models/clip/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,13 @@


_import_structure = {
"configuration_clip": ["CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP", "CLIPConfig", "CLIPTextConfig", "CLIPVisionConfig"],
"configuration_clip": [
"CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
"CLIPConfig",
"CLIPTextConfig",
"CLIPVisionConfig",
"CLIPOnnxConfig",
],
"tokenization_clip": ["CLIPTokenizer"],
}

Expand Down Expand Up @@ -95,7 +101,13 @@


if TYPE_CHECKING:
from .configuration_clip import CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP, CLIPConfig, CLIPTextConfig, CLIPVisionConfig
from .configuration_clip import (
CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
CLIPConfig,
CLIPTextConfig,
CLIPVisionConfig,
CLIPOnnxConfig,
)
from .tokenization_clip import CLIPTokenizer

try:
Expand Down
49 changes: 48 additions & 1 deletion src/transformers/models/clip/configuration_clip.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,14 @@

import copy
import os
from typing import Union
from collections import OrderedDict

from typing import Any, Mapping, Union, Optional

from transformers import TensorType
from transformers.processing_utils import ProcessorMixin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorType and ProcessorMixin are only needed for type hinting. The way we manage imports in Transformers will return an error with this implementation. That is why some tests failed :)

To solve this, you need to put these two imports in a TYPE_CHECKING conditional statement, here is an example.

Also, it's better to use relative imports because absolute imports can lead to weird errors.

I made the exact same mistakes in the PR of LayoutLMv3 haha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a deeper look @unography! There are a few things to correct regarding imports and type hints to make tests pass. I also have comments about the changes to the forward method of the model.

Besides, which version of black have you used to format the code? Did you use the command make style to do it? Because there are many files unrelated to this PR that have been reformatted and this should not have happened.

my black version is 22.6.0, I think I did make fix-copies but tests were still failing so I did black . from inside the CLIP folder.
what is the correct way to fix this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try with black 22.3, and use the command make style from the root of the repo to format the code.


from ...onnx import OnnxConfig
from ...configuration_utils import PretrainedConfig
from ...utils import logging

Expand Down Expand Up @@ -317,3 +323,44 @@ def to_dict(self):
output["vision_config"] = self.vision_config.to_dict()
output["model_type"] = self.__class__.model_type
return output


class CLIPOnnxConfig(OnnxConfig):
@property
def inputs(self) -> Mapping[str, Mapping[int, str]]:
return OrderedDict(
[
("input_ids", {0: "batch", 1: "sequence"}),
("pixel_values", {0: "batch"}),
("attention_mask", {0: "batch", 1: "sequence"}),
]
)

@property
def outputs(self) -> Mapping[str, Mapping[int, str]]:
return OrderedDict(
[
("logits_per_image", {0: "batch"}),
("logits_per_text", {0: "batch"}),
("text_embeds", {0: "batch"}),
("image_embeds", {0: "batch"}),
]
)

@property
def atol_for_validation(self) -> float:
return 1e-4

def generate_dummy_inputs(
self,
processor: ProcessorMixin,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use a forward reference here

framework: Optional[TensorType] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use a forward reference here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by forward reference do you mean making it like processor: "ProcessorMixin"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly

) -> Mapping[str, Any]:

text_input_dict = super().generate_dummy_inputs(processor.tokenizer, framework=framework)
image_input_dict = super().generate_dummy_inputs(processor.feature_extractor, framework=framework)
return {**text_input_dict, **image_input_dict}

@property
def default_onnx_opset(self) -> int:
return 14
5 changes: 3 additions & 2 deletions src/transformers/models/clip/modeling_clip.py
Original file line number Diff line number Diff line change
Expand Up @@ -630,6 +630,7 @@ def forward(
if input_ids is None:
raise ValueError("You have to specify either input_ids")

input_ids = input_ids.to(torch.int) # for onnx compatibility, since onnx doesn't support int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cast is only needed for the argmax, so it would be better to do it in the same call.

input_shape = input_ids.size()
input_ids = input_ids.view(-1, input_shape[-1])

Expand Down Expand Up @@ -1044,8 +1045,8 @@ def forward(
text_embeds = self.text_projection(text_embeds)

# normalized features
image_embeds = image_embeds / image_embeds.norm(p=2, dim=-1, keepdim=True)
text_embeds = text_embeds / text_embeds.norm(p=2, dim=-1, keepdim=True)
image_embeds = image_embeds / image_embeds.norm(p=2, dim=1, keepdim=True)
text_embeds = text_embeds / text_embeds.norm(p=2, dim=1, keepdim=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not modifying the source code like this. Why did you need to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can revert this, this was from the original repo
openai/CLIP@1937532

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay I see. Yes I would prefer to keep it as it is.


# cosine similarity as logits
logit_scale = self.logit_scale.exp()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1059,7 +1059,6 @@ def forward(

loss = None
if labels is not None:

if labels.max() >= self.config.vocab_size:
raise ValueError(f"Label values must be <= vocab_size: {self.config.vocab_size}")

Expand Down
1 change: 0 additions & 1 deletion src/transformers/models/data2vec/modeling_data2vec_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -503,7 +503,6 @@ def forward(
past_key_value = past_key_values[i] if past_key_values is not None else None

if self.gradient_checkpointing and self.training:

if use_cache:
logger.warning(
"`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,6 @@ def __init__(self, config: Data2VecVisionConfig) -> None:
self.dropout = nn.Dropout(config.hidden_dropout_prob)

def forward(self, pixel_values: torch.Tensor, bool_masked_pos: Optional[torch.BoolTensor] = None) -> torch.Tensor:

embeddings = self.patch_embeddings(pixel_values)
batch_size, seq_len, _ = embeddings.size()

Expand Down
2 changes: 0 additions & 2 deletions src/transformers/models/deberta_v2/modeling_tf_deberta_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,6 @@ def __init__(self, axis=-1, **kwargs):
self.axis = axis

def call(self, inputs: tf.Tensor, mask: tf.Tensor):

rmask = tf.logical_not(tf.cast(mask, tf.bool))
output = tf.where(rmask, float("-inf"), inputs)
output = stable_softmax(output, self.axis)
Expand Down Expand Up @@ -1021,7 +1020,6 @@ def call(
return_dict: Optional[bool] = None,
training: bool = False,
) -> Union[TFBaseModelOutput, Tuple[tf.Tensor]]:

if input_ids is not None and inputs_embeds is not None:
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
elif input_ids is not None:
Expand Down
2 changes: 0 additions & 2 deletions src/transformers/models/deit/modeling_deit.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,6 @@ def __init__(self, config: DeiTConfig) -> None:
self.dropout = nn.Dropout(config.hidden_dropout_prob)

def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:

hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)

Expand Down Expand Up @@ -263,7 +262,6 @@ def __init__(self, config: DeiTConfig) -> None:
self.intermediate_act_fn = config.hidden_act

def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:

hidden_states = self.dense(hidden_states)
hidden_states = self.intermediate_act_fn(hidden_states)

Expand Down
2 changes: 0 additions & 2 deletions src/transformers/models/dpt/modeling_dpt.py
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,6 @@ def __init__(self, config: DPTConfig) -> None:
self.dropout = nn.Dropout(config.hidden_dropout_prob)

def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:

hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)

Expand Down Expand Up @@ -284,7 +283,6 @@ def __init__(self, config: DPTConfig) -> None:
self.intermediate_act_fn = config.hidden_act

def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:

hidden_states = self.dense(hidden_states)
hidden_states = self.intermediate_act_fn(hidden_states)

Expand Down
1 change: 0 additions & 1 deletion src/transformers/models/electra/modeling_electra.py
Original file line number Diff line number Diff line change
Expand Up @@ -564,7 +564,6 @@ def forward(
past_key_value = past_key_values[i] if past_key_values is not None else None

if self.gradient_checkpointing and self.training:

if use_cache:
logger.warning(
"`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/groupvit/modeling_groupvit.py
Original file line number Diff line number Diff line change
Expand Up @@ -1102,6 +1102,7 @@ def forward(
if input_ids is None:
raise ValueError("You have to specify either input_ids")

input_ids = input_ids.to(torch.int) # for onnx compatibility, since onnx doesn't support int64
input_shape = input_ids.size()
input_ids = input_ids.view(-1, input_shape[-1])

Expand Down
1 change: 0 additions & 1 deletion src/transformers/models/hubert/modeling_hubert.py
Original file line number Diff line number Diff line change
Expand Up @@ -1174,7 +1174,6 @@ def forward(

loss = None
if labels is not None:

if labels.max() >= self.config.vocab_size:
raise ValueError(f"Label values must be <= vocab_size: {self.config.vocab_size}")

Expand Down
9 changes: 0 additions & 9 deletions src/transformers/models/hubert/modeling_tf_hubert.py
Original file line number Diff line number Diff line change
Expand Up @@ -314,7 +314,6 @@ def __init__(
self._check_axis()

def build(self, input_shape):

self._check_if_input_shape_is_none(input_shape)
self._set_number_of_groups_for_instance_norm(input_shape)
self._check_size_of_dimensions(input_shape)
Expand All @@ -326,7 +325,6 @@ def build(self, input_shape):
super().build(input_shape)

def call(self, inputs):

input_shape = tf.keras.backend.int_shape(inputs)
tensor_input_shape = tf.shape(inputs)

Expand Down Expand Up @@ -363,7 +361,6 @@ def compute_output_shape(self, input_shape):
return input_shape

def _reshape_into_groups(self, inputs, input_shape, tensor_input_shape):

group_shape = [tensor_input_shape[i] for i in range(len(input_shape))]
is_instance_norm = (input_shape[self.axis] // self.groups) == 1
if not is_instance_norm:
Expand All @@ -376,7 +373,6 @@ def _reshape_into_groups(self, inputs, input_shape, tensor_input_shape):
return inputs, group_shape

def _apply_normalization(self, reshaped_inputs, input_shape):

group_shape = tf.keras.backend.int_shape(reshaped_inputs)
group_reduction_axes = list(range(1, len(group_shape)))
is_instance_norm = (input_shape[self.axis] // self.groups) == 1
Expand Down Expand Up @@ -428,7 +424,6 @@ def _set_number_of_groups_for_instance_norm(self, input_shape):
self.groups = dim

def _check_size_of_dimensions(self, input_shape):

dim = input_shape[self.axis]
if dim < self.groups:
raise ValueError(
Expand All @@ -449,19 +444,16 @@ def _check_size_of_dimensions(self, input_shape):
)

def _check_axis(self):

if self.axis == 0:
raise ValueError(
"You are trying to normalize your batch axis. Do you want to use tf.layer.batch_normalization instead"
)

def _create_input_spec(self, input_shape):

dim = input_shape[self.axis]
self.input_spec = tf.keras.layers.InputSpec(ndim=len(input_shape), axes={self.axis: dim})

def _add_gamma_weight(self, input_shape):

dim = input_shape[self.axis]
shape = (dim,)

Expand All @@ -477,7 +469,6 @@ def _add_gamma_weight(self, input_shape):
self.gamma = None

def _add_beta_weight(self, input_shape):

dim = input_shape[self.axis]
shape = (dim,)

Expand Down
1 change: 0 additions & 1 deletion src/transformers/models/layoutlm/modeling_layoutlm.py
Original file line number Diff line number Diff line change
Expand Up @@ -471,7 +471,6 @@ def forward(
past_key_value = past_key_values[i] if past_key_values is not None else None

if self.gradient_checkpointing and self.training:

if use_cache:
logger.warning(
"`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -534,7 +534,6 @@ def _batch_encode_plus(
return_length: bool = False,
verbose: bool = True,
) -> BatchEncoding:

if not isinstance(batch_text_or_text_pairs, list):
raise TypeError(f"batch_text_or_text_pairs has to be a list (got {type(batch_text_or_text_pairs)})")

Expand Down
1 change: 0 additions & 1 deletion src/transformers/models/longt5/modeling_longt5.py
Original file line number Diff line number Diff line change
Expand Up @@ -232,7 +232,6 @@ def __init__(self, hidden_size, eps=1e-6):
self.variance_epsilon = eps

def forward(self, hidden_states):

# LongT5 uses a layer_norm which only scales and doesn't shift, which is also known as Root Mean
# Square Layer Normalization https://arxiv.org/abs/1910.07467 thus varience is calculated
# w/o mean and there is no bias. Additionally we want to make sure that the accumulation for
Expand Down
2 changes: 0 additions & 2 deletions src/transformers/models/mobilebert/modeling_mobilebert.py
Original file line number Diff line number Diff line change
Expand Up @@ -1342,7 +1342,6 @@ def forward(
)
# Copied from transformers.models.bert.modeling_bert.BertForQuestionAnswering with Bert->MobileBert all-casing
class MobileBertForQuestionAnswering(MobileBertPreTrainedModel):

_keys_to_ignore_on_load_unexpected = [r"pooler"]

def __init__(self, config):
Expand Down Expand Up @@ -1548,7 +1547,6 @@ def forward(
)
# Copied from transformers.models.bert.modeling_bert.BertForTokenClassification with Bert->MobileBert all-casing
class MobileBertForTokenClassification(MobileBertPreTrainedModel):

_keys_to_ignore_on_load_unexpected = [r"pooler"]

def __init__(self, config):
Expand Down
1 change: 0 additions & 1 deletion src/transformers/models/nezha/modeling_nezha.py
Original file line number Diff line number Diff line change
Expand Up @@ -579,7 +579,6 @@ def forward(
past_key_value = past_key_values[i] if past_key_values is not None else None

if self.gradient_checkpointing and self.training:

if use_cache:
logger.warning(
"`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
Expand Down
2 changes: 0 additions & 2 deletions src/transformers/models/plbart/modeling_plbart.py
Original file line number Diff line number Diff line change
Expand Up @@ -1041,7 +1041,6 @@ def forward(
past_key_value = past_key_values[idx] if past_key_values is not None else None

if self.gradient_checkpointing and self.training:

if use_cache:
logger.warning(
"`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
Expand All @@ -1066,7 +1065,6 @@ def custom_forward(*inputs):
None,
)
else:

layer_outputs = decoder_layer(
hidden_states,
attention_mask=attention_mask,
Expand Down
1 change: 0 additions & 1 deletion src/transformers/models/realm/modeling_realm.py
Original file line number Diff line number Diff line change
Expand Up @@ -579,7 +579,6 @@ def forward(
past_key_value = past_key_values[i] if past_key_values is not None else None

if self.gradient_checkpointing and self.training:

if use_cache:
logger.warning(
"`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
Expand Down
1 change: 0 additions & 1 deletion src/transformers/models/roberta/modeling_roberta.py
Original file line number Diff line number Diff line change
Expand Up @@ -503,7 +503,6 @@ def forward(
past_key_value = past_key_values[i] if past_key_values is not None else None

if self.gradient_checkpointing and self.training:

if use_cache:
logger.warning(
"`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
Expand Down
1 change: 0 additions & 1 deletion src/transformers/models/sew/modeling_sew.py
Original file line number Diff line number Diff line change
Expand Up @@ -1054,7 +1054,6 @@ def forward(

loss = None
if labels is not None:

if labels.max() >= self.config.vocab_size:
raise ValueError(f"Label values must be <= vocab_size: {self.config.vocab_size}")

Expand Down
Loading