-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model][Bugfix] Fix batching with multi-image in PixtralHF #9518
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
We have |
Hi @DarkLight1337 I've improved the change now by always producing a list of 3D tensors from |
The number of dimensions in |
Here is the structure from one of the batches in the description. You can see that there is an overall list with one entry for each request, but there may be nested lists with tensors of different rank if the images do not have the same shape.
I use this function to print this structure of pixel_values def print_image_structure(images, depth=0):
indent = " " * depth
if isinstance(images, torch.Tensor):
print(f"{indent}Tensor shape: {images.shape}")
elif isinstance(images, list):
print(f"{indent}List length: {len(images)}")
for i, item in enumerate(images):
print(f"{indent}Item {i}:")
print_image_structure(item, depth + 1)
else:
print(f"{indent}Unexpected type: {type(images)}") |
I see, thanks for the explanation! So the overall rank is the same (5-D input), but even after |
…ect#9518) Signed-off-by: charlifu <charlifu@amd.com>
…ect#9518) Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…ect#9518) Signed-off-by: Alvant <alvasian@yandex.ru>
…ect#9518) Signed-off-by: Erkin Sagiroglu <erkin@infra-aipipeline-1-at1-prox-prod-a.ipa.corp.telnyx.com>
…ect#9518) Signed-off-by: Amit Garg <mitgarg17495@gmail.com>
…ect#9518) Signed-off-by: qishuai <ferdinandzhong@gmail.com>
…ect#9518) Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>
…ect#9518) Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
…ect#9518) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Before this PR, PixtralHF would fail during forward passes where
pixel_values
would be a list of lists of tensors. This can happen when multiple requests are batch that have multiple images each. Since each image can have a different shape, we end up with many separate tensors.Due to how the HF processor works, when we call
_parse_and_validate_image_input
we can receivepixel_values
as inputs that are either just a Tensor, a list of Tensors, or some mismatched list of Tensor/lists. This PR adds a normalization pass topixel_values
such that all of the image tensors from all requests are unrolled into a list of unbatched 3D tensors. This is the simplest way to avoid the complexity of some sub-requests batching and some not. This could carry a performance penalty in some cases where we could be using batched tensors, but in practice I think it will be uncommon to have images of exactly the same size.I will provide some before+after examples of the types of
pixel_values
we recieve and how we now normalize them to simply aList[torch.Tensor]
.Example 1: 4 images of size 1024x1024 (this is the kv cache memory profiling pass when
limit_mm_per_prompt={"image": 4}
)Example 2: Batch of 3 requests; 1. with 1 image of 688x1024, 2. with 3 images of 688x1024 and 1 image of 704x1024, and 3. with 2 images of 688x1024
Minimal reproducible edge case
I also used this custom case for testing offline batch with multiple-images of both same sizes and different sizes:
Output:
Validation
With this PR, I am able to reproduce the MMMU benchmark for Pixtral: