[`PixtralLarge`] Update Pixtral conversion script to support large format! #34801

ArthurZucker · 2024-11-19T09:25:47Z

What does this PR do?

Updates the conversion script

HuggingFaceDocBuilderDev · 2024-11-19T10:08:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/models/pixtral/convert_pixtral_weights_to_hf.py

…n case I need to revert

Rocketknight1 · 2024-12-23T16:22:33Z

This should be just about ready! Quick summary of the changes:

Made sure eps values and activations were handled correctly during conversion
Make sure the tokenizer gets special tokens assigned correctly during conversion
Make biases in the multimodal processor a config flag (this is enabled in Pixtral-12B but not in Pixtral-Large)
BatchMixFeature.to() was buggy when the input was a nested list (changes pulled in from Fix case of nested tensors in BatchMixFeature #35063)
PixtralProcessor made some strange assumptions when lists are passed (changes pulled in from Fix the structure of images in PixtralProcessor #35107)
Some float32 upcasts in Pixtral attention to match the behaviour of the vLLM reference code (vLLM uses xformers, which has a custom kernel that does those computations internally in float32)
The Pixtral template needed a lot of rewrites
- System message handling including datetime
- Undocumented behaviour in mistral-common: When a message has exactly one text chunk, plus one or more images, then the text is moved to the end, after the image tokens, even if this is not the order of the chunks passed in by the user. When there are multiple text chunks, we keep the order from the message. If we don't get this exactly right then model generations are garbage.

TODO:

Should we use xformers instead of manual float32 attention? It would be more accurate + faster, but would add a dependency to the model.
Make sure the conversion script still works for older Pixtral-12B.
Make sure @zucchini-nlp is okay with the changes in the Processor.

# Conflicts: # src/transformers/models/pixtral/modeling_pixtral.py

Rocketknight1 · 2025-01-08T12:55:40Z

src/transformers/models/pixtral/processing_pixtral.py

+        def _recursive_to(obj, device, *args, **kwargs):
+            # Lists can be nested, so keep digging until we hit tensors
+            if isinstance(obj, list):
+                return [_recursive_to(o, device, *args, **kwargs) for o in obj]
+            # We cast only floating point tensors to avoid issues with tokenizers casting `LongTensor` to `FloatTensor`
+            elif isinstance(obj, torch.Tensor) and torch.is_floating_point(obj):
+                # cast and send to device
+                return obj.to(*args, **kwargs)
+            elif isinstance(obj, torch.Tensor) and device is not None:
+                # only send to device, don't cast
+                return obj.to(device=device)
+            else:
+                return obj
+


Note to reviewer: The previous BatchFeature.to() actually flattened the structure of nested inputs, which created several bugs! This fix preserves nested structure

Rocketknight1 · 2025-01-08T12:56:50Z

src/transformers/models/pixtral/processing_pixtral.py

+                if isinstance(text, str) or isinstance(text, list) and len(text) == 1:
+                    # If there's a single sample, the image must belong to it
+                    images = [[images]]
                else:
+                    raise ValueError(
+                        "You have supplied multiple text samples, but `images` is not a nested list. When processing multiple samples, `images` should be a list of lists of images, one list per sample."
+                    )
+            elif isinstance(images, list) and is_image_or_image_url(images[0]):
+                if isinstance(text, str) or isinstance(text, list) and len(text) == 1:
+                    # If there's a single sample, all images must belong to it
                    images = [images]
+                else:
+                    raise ValueError(
+                        "You have supplied multiple text samples, but `images` is not a nested list. When processing multiple samples, `images` should be a list of lists of images, one list per sample."
+                    )


Note to reviewer: Previously there were a lot of edge cases when users passed a single list of images. In some cases, the processor interpreted this as one image per sample rather than a list of images for one sample. This code avoids these error-prone inferences.

Rocketknight1 · 2025-01-08T15:26:49Z

src/transformers/models/pixtral/modeling_pixtral.py

-        patch_embeds_list = [self.patch_conv(img.unsqueeze(0).to(self.dtype)) for img in pixel_values]
+        if len(pixel_values) > 1:
+            raise ValueError("Batching/padding not supported yet!")
+        patch_embeds_list = [self.patch_conv(img.to(self.dtype)) for sample in pixel_values for img in sample]

        # flatten to a single sequence
-        patch_embeds = torch.cat([p.flatten(2).permute(0, 2, 1) for p in patch_embeds_list], dim=1)
+        patch_embeds = torch.cat([p.flatten(1).T for p in patch_embeds_list], dim=0).unsqueeze(0)
        patch_embeds = self.ln_pre(patch_embeds)
-
        # positional embeddings
        position_ids = position_ids_in_meshgrid(
            patch_embeds_list, max_width=self.config.image_size // self.config.patch_size
        ).to(self.device)

        position_embedding = self.patch_positional_embedding(patch_embeds, position_ids)
+


Note to reviewer: These changes are here to handle images being passed in as a list of lists now. Previously, images were passed in as a flat list even though the processor output a list of lists. The only reason this didn't cause an error was because the bug in BatchFeature.to() silently fixed the list structure and made it match the modeling code 😓

Rocketknight1 · 2025-01-08T15:42:54Z

This should be ready for final review @ArthurZucker! I did ablation testing and reverted some of the dtype changes in modeling_pixtral.py - the results seem good without them and performance/memory improves.

ArthurZucker

Let's roll! A todo is to add another test for the new model 😉 Good to go otherwise

ArthurZucker · 2025-01-08T16:32:13Z

src/transformers/models/pixtral/image_processing_pixtral.py

+        def _recursive_to(obj, device, *args, **kwargs):
+            # Lists can be nested, so keep digging until we hit tensors
+            if isinstance(obj, list):
+                return [_recursive_to(o, device, *args, **kwargs) for o in obj]
+            # We cast only floating point tensors to avoid issues with tokenizers casting `LongTensor` to `FloatTensor`
+            elif isinstance(obj, torch.Tensor) and torch.is_floating_point(obj):
+                # cast and send to device
+                return obj.to(*args, **kwargs)
+            elif isinstance(obj, torch.Tensor) and device is not None:
+                # only send to device, don't cast
+                return obj.to(device=device)
+            else:
+                return obj


Should probably be fixed on the parent class

Jintao-Huang mentioned this pull request Nov 20, 2024

support pixtral large modelscope/ms-swift#2481

Draft

ArthurZucker commented Nov 20, 2024

View reviewed changes

src/transformers/models/pixtral/convert_pixtral_weights_to_hf.py Show resolved Hide resolved

Rocketknight1 force-pushed the pixtral-large-script branch 2 times, most recently from aba20bf to 24d9ee5 Compare December 23, 2024 12:04

arthur@huggingface.co and others added 25 commits December 23, 2024 15:32

update conversion script

4f76390

update for bias again

32b1443

remove pdv

1e6695e

use my dir

5ac86ec

Update how we initialize the tokenizer

2448e23

Convert in bfloat16

08ae1d8

Undo that one again

e4658f0

fix config dump

72a1460

.to() was broken for BatchMixFeature

399731b

quick debug breakpoint

22fa744

put the breakpoint in the right place

e95af1b

Add a config flag for the multimodal projector bias

9df78af

Add a config flag for the multimodal projector bias

b756909

Conversion script can load chat templates

069accf

Indent config for comparison

90670a5

Stop clobbering the config

e124a93

Re-enable the config clobber

bbaa6f1

Get rid of the config manual save - it has no effect!

7a7acbb

Handle adapter bias correctly

65e3336

Default vision transformer activation to silu

4c98a65

Remove legacy processing path

17bfbdd

One commit with all the debug breakpoints before I delete them all, i…

2e50b89

…n case I need to revert

Update conversion

e2b6531

Remove vLLM debugging instrumentation

2320c14

Drop xformers

624f3c2

Rocketknight1 added 8 commits December 23, 2024 15:32

Propagate multimodal_projector_bias change

3a87990

Propagate multimodal_projector_bias change

97ebda3

Remove debug device .to()

05294a1

Restore attention weights output

984a55e

Fix Pixtral test

8884969

Drop image_seq_length

8074e95

Drop image_seq_length

0d29bc3

Put the legacy processing code back

99ea497

Rocketknight1 force-pushed the pixtral-large-script branch from d599d5d to 99ea497 Compare December 23, 2024 15:32

Rocketknight1 and others added 2 commits December 23, 2024 15:41

Merge branch 'main' into pixtral-large-script

5fe8af6

Add the bias option to the llava_next_video config

5eb3f6c

This was referenced Dec 23, 2024

Fix case of nested tensors in BatchMixFeature #35063

Closed

Fix the structure of images in PixtralProcessor #35107

Closed

Add the bias option to the llava_next_video config

6f026c1

Rocketknight1 marked this pull request as ready for review December 23, 2024 16:11

Rocketknight1 added 5 commits January 7, 2025 18:11

Make certain args required in converter

e39d0ea

Make certain args required in converter

9562cd9

Merge branch 'main' into pixtral-large-script

41454f7

# Conflicts: # src/transformers/models/pixtral/modeling_pixtral.py

typo

d48f25d

make fixup

5f125b7

Rocketknight1 reviewed Jan 8, 2025

View reviewed changes

Rocketknight1 mentioned this pull request Jan 8, 2025

Add support for nested images to LLava and VipLLava #35558

Open

5 tasks

Reverting some dtype changes since it seems to work without them

bf2ddda

Rocketknight1 reviewed Jan 8, 2025

View reviewed changes

ArthurZucker commented Jan 8, 2025

View reviewed changes

Rocketknight1 approved these changes Jan 8, 2025

View reviewed changes

ArthurZucker merged commit 3f483be into main Jan 8, 2025
18 checks passed

ArthurZucker deleted the pixtral-large-script branch January 8, 2025 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`PixtralLarge`] Update Pixtral conversion script to support large format! #34801

[`PixtralLarge`] Update Pixtral conversion script to support large format! #34801

ArthurZucker commented Nov 19, 2024

HuggingFaceDocBuilderDev commented Nov 19, 2024

Rocketknight1 commented Dec 23, 2024 •

edited

Loading

Rocketknight1 Jan 8, 2025

Rocketknight1 Jan 8, 2025

Rocketknight1 Jan 8, 2025

Rocketknight1 commented Jan 8, 2025

ArthurZucker left a comment

ArthurZucker Jan 8, 2025

[PixtralLarge] Update Pixtral conversion script to support large format! #34801

[PixtralLarge] Update Pixtral conversion script to support large format! #34801

Conversation

ArthurZucker commented Nov 19, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented Nov 19, 2024

Rocketknight1 commented Dec 23, 2024 • edited Loading

Rocketknight1 Jan 8, 2025

Choose a reason for hiding this comment

Rocketknight1 Jan 8, 2025

Choose a reason for hiding this comment

Rocketknight1 Jan 8, 2025

Choose a reason for hiding this comment

Rocketknight1 commented Jan 8, 2025

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jan 8, 2025

Choose a reason for hiding this comment

[`PixtralLarge`] Update Pixtral conversion script to support large format! #34801

[`PixtralLarge`] Update Pixtral conversion script to support large format! #34801

Rocketknight1 commented Dec 23, 2024 •

edited

Loading