[RFC]: Adding support for Geospatial models #11065

christian-pinto · 2024-12-10T13:49:34Z

Motivation.

Modern models are now not only targeting the generation of text but also the generation of images from text or image input as well. This RFC wants to open the stage towards supporting models that do not only generate text but also images either as a single modality or even supporting multi-modal output. One example of great interest to us is a set of models developed in co-operation with NASA for earth observation (https://huggingface.co/ibm-nasa-geospatial/Prithvi-100M) working on satellite images can be fine-tuned for several tasks including floods forecast, crop classification etc.

This example model, works on images of a fixed size in input and generates an image of the same size in output. In the specific, input images in the geotiff format are split in patches of dimensions 224×224, each patch is passed through the model for inference that generates a tensor of the same size as the input. This is similar in a way got an autoregressive process, with the difference that at every iteration the data passed to the model is different and there is no relationship when inferencing subsequent patches. All the output patches are then “re-assembled” into a geotiff image.

The goal of this RFC is that of creating enablement for non text output and demonstrate this with the above mentioned model.

Why support models in vLLM that do not generate text? Because consolidating towards a single serving platform simplifies the software stack of those dealing with multiple types of models. Also, with time, models not targeting text might benefit from optimizations introduced by the vLLM community. Similarly to what has been happening for Transformer based causal models.

Proposed Change.

I propose a two phase approach. In the first phase, integrate the model as a PoolingModel and pre/post-process input output data outside of vLLM. In the second phase, a proper integration of the models is performed, taking also care of processing the input image and generating the output one.

Phase 1: Basic enablement of Geospatial model in vLLM

Pre/post processing of the input image is done outside of vLLM. The input image is broken down into patches (generic tensor), all patches are fed into vLLM. Output tensors are collected and post processing would re-create the output image.

For this phase we could piggyback on the support available for pooling models (thanks @Dar for the suggestion) where the hidden states of the model are returned in output.

Changes for phase 1:

Step 1

Extend the output type for pooling models, currently only targeting embeddings, to also support a generic output type. This output would be then post processed outside of vLLM.

@dataclass
class PoolingOutput:
    """The output data of one pooling output of a request.

    Args:
        outputs: This can be either a list of floats (embedding vector), or a generic list of tensors defined by the model. The embedding vector, returned in case of embedding models, is a list of floats whose length depends on the model as listed in the embedding guide.
    """
    outputs: Union[List[float], List[torch.Tensor]]

Step 2

Also, right now, for Pooling models the only two possible methods to execute would be encode or score. Would it make sense to define a third one like transform? This would just be for the sake of not using encode. Also, would it make sense to create a new entrypoint class in addition to LLM? Something Like VisionModel or similar? This is again for sake of completeness since this is not a language model.

Step 3

Exploit the batching capabilites of vLLM and present all the images patches to the vLLM entrypoint as a list of generic tensors. Similar to what is done now when presenting multiple prompts at a time.

Phase 2: Optimized integration of the model

Embed pre/post processing of images into vLLM and handle the recursive pattern for processing bigger images in vLLM.
(This phase might need to be updated/changed depending on the outcome of Phase 1)

Step 1

Integrate processing of the input with the already available multimodal input support. Here among the things to be considered is that an input image could be presented encoded as a string instead of being stored in an file.

Step 2

Introduce the possibility of “installing“ an output processor, that generates images of the required format. In the same spirit of what is done for input processors.

@INPUT_REGISTRY.register_input_processor()

The idea would be to create an output registry and enable models to register an output processor so that all the output generated for a sequence can be converted into the proper image format for the specific model.

Step 3

Create a new output class that allows the output to be presented in the form of an image. We could call it ImageOutput and ImageRequestOutput. Users would be able to either post-process the model output and return a string containing the generated file or, return the raw image output for post-processing outside of vLLM

class ImageOutput:
    image_out: Union[str, torch.Tensor]

class ImageRequestOutput:
    def __init__(self, request_id: str, outputs: "ImageOutput",
                 finished: bool):
        self.request_id = request_id
        self.finished = finished
        self.outputs = outputs

Step 4

Handle recursive processing of image patches within vLLM. Each image is fed to vLLM, pre processed and split in patches. All patches are processed and all the output patches are processed by the output processor. Could we re-use some of the logic used for handling autoregressive queries? In this case we would know already how many times the model inference should be executed (the number of image patches) and no need to append the output of an iteration to the input of the next, we just feed the next patch and so on.

The output of the request in this case will still be of type ImageRequestOutput with the image_data field actually optional and the image_path populated with the path to the image generated during post processing

Feedback Period.

2 weeks

CC List.

@njhill @ywang96 @DarkLight1337 @robertgshaw2-neuralmagic

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

ywang96 · 2024-12-10T17:30:37Z

Hey @christian-pinto! Thank you for the interest and the putting the effort this RFC!

First of all, I would like to clarify that when it comes to model support, IMHO it is less about if we're generating text versus other modalities, but whether or not the underlying backbone of the architecture is autoregressive decoder-only transformer with causal attention. This was the main motivation when we first added multimodality inputs & models a while back, and what every multimodal model on vLLM belongs to today (with the only exception being multimodal llama 3.2 which we had a lot of challenges from supporting it). Therefore, I do not expect vLLM to provide significant performance improvement for models that don't fall into this category.

Now moving on to infra support for generating non-text modalities. This was something I also seriously thought about when I ported Chameleon to vLLM (an autoregressive VLM that's able to generate not only text tokens but also image tokens to be decoded to images). I do see the value of consolidating use cases to a single platform, thus I'm not particularly against opening up the support for image generation. However, there are a few things I'm concerned about and we should think out loud about before we proceed:

Can you share some pseudocode on what user interface will look like? Do you plan to add support for both online & offline inference?
We should keep the codepath for this kind of models separate from the main LLM support as much as possible (similarly to what we did with embedding models) so that the two don't intervene each other. This also means when we design optimizations for the main LLM support, we may not necessarily take other types of models into account.
I think it's also of vLLM's interest to have models and/or features added if they can be also maintained over time. A few great examples are LoRA support from @jeejeelee, tool calling from @K-Mistele, etc. I'm curious if you have plans to maintain multimodal output infra going forward as well.

Hope this make sense and let me know what you think!

christian-pinto · 2024-12-11T15:33:52Z

Hi @ywang96, many thanks for your comments.

I do understand that most of the optimizations that are in vLLM are not going to provide any immediate benefit for models that are not autoregressive with causal attention. This RFC is in fact coming from the need for supporting multiple types of models with the same software stack. My hope is that by adding support for "different" types of models at some point we might start seeing optimizations to land also outside the auto-regressive decoder-only umbrella.

To reply to your points one by one:

Can you share some pseudocode on what user interface will look like? Do you plan to add support for both online & offline inference?

I would start with the offline inference, to get the main pieces in place and then extend of online as well.

as per the user interface at the moment I have in mind something like the below (targeting the model mentioned in this RFC)

model = LLM(model="./test_model", skip_tokenizer_init=True) # Perhaps a different entry point could be defined since these are not LLMs anymore
input_image = read_image("/path/to/the/image")

image_patches: List[torch.Tensor] = process_image(input_image)

model_outputs = model.transform(image_patches)

output_patches: List[List[torch.Tensor]] = []
for output in model_outputs:
    output_patches.append(output.outputs.tensors) # this would be an extension to the PoolingOutput class (different fromthe one I propose on top)

process_output(output_patches, "/path/to/output/image")

Worth noting in the above snippet are:

model_outputs = model.transform(image_patches): this comes from my proposal to add a method for Pooling models that avoids using sample or score and that takes a generic tensor in input.
output_patches.append(output.outputs.tensors): in my original comment I was proposing extending the PoolingOutput differently, but it could actually be extended as follows. WE can find a better name other than just tensors.

class PoolingOutput:
    """The output data of one pooling output of a request.

    Args:
        embedding: The embedding vector, which is a list of floats. The
        length of vector depends on the model as listed in the embedding guide.
    """
    embedding: Optional[List[float]]
    tensors: Optional [List[torch.Tensor]]

Basically, for a first implementation I would heavily re-use whatever is there for pooling models and avoid introducing heavy changes to the current structure, that would also take care of "disabling" the autoregression (see embedding models).

In a second phase I would propose to create a proper abstraction for these models, where there is a dedicated class for image output and users can register an output image processor that converts the output of the model back into an image.

basically similar user interface but developers will have to decorate their model with something like.

@MULTIMODAL_REGISTRY.register_output_image_processor

Or perhaps define an OUTPUT_REGISTRY.

so in this case the user would interact with the model differently as the processing of input and output would be performed in vLLM using the input and output processors registered for the model.

We should keep the codepath for this kind of models separate from the main LLM support as much as possible (similarly to what we did with embedding models) so that the two don't intervene each other. This also means when we design optimizations for the main LLM support, we may not necessarily take other types of models into account.

Absolutely, totally understandable and I agree.

I think it's also of vLLM's interest to have models and/or features added if they can be also maintained over time. A few great examples are LoRA support from @jeejeelee, tool calling from @K-Mistele, etc. I'm curious if you have plans to maintain multimodal output infra going forward as well.

I would be able to maintain this part for the foreseeable future.

DarkLight1337 · 2024-12-12T10:43:26Z

As a first step, I have created #11129 to enable different types of pooling outputs. PTAL!

christian-pinto · 2024-12-13T05:48:59Z

As a first step, I have created #11129 to enable different types of pooling outputs. PTAL!

Thanks @DarkLight1337, that is exactly along the lines of what I had in mind for modifying the output of pooling models. I will submit a PR with the new model in a couple of weeks right after the Christmas vacation break.

christian-pinto added the RFC label Dec 10, 2024

DarkLight1337 mentioned this issue Dec 12, 2024

[Frontend] Separate pooling APIs in offline inference #11129

Merged

DarkLight1337 mentioned this issue Jan 21, 2025

[RFC]: Hidden states processor #12249

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Adding support for Geospatial models #11065

[RFC]: Adding support for Geospatial models #11065

christian-pinto commented Dec 10, 2024 •

edited by ywang96

Loading

ywang96 commented Dec 10, 2024

christian-pinto commented Dec 11, 2024

DarkLight1337 commented Dec 12, 2024

christian-pinto commented Dec 13, 2024

[RFC]: Adding support for Geospatial models #11065

[RFC]: Adding support for Geospatial models #11065

Comments

christian-pinto commented Dec 10, 2024 • edited by ywang96 Loading

Motivation.

Proposed Change.

Phase 1: Basic enablement of Geospatial model in vLLM

Changes for phase 1:

Step 1

Step 2

Step 3

Phase 2: Optimized integration of the model

Step 1

Step 2

Step 3

Step 4

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

ywang96 commented Dec 10, 2024

christian-pinto commented Dec 11, 2024

DarkLight1337 commented Dec 12, 2024

christian-pinto commented Dec 13, 2024

christian-pinto commented Dec 10, 2024 •

edited by ywang96

Loading