-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Adding support for Geospatial models #11065
Comments
Hey @christian-pinto! Thank you for the interest and the putting the effort this RFC! First of all, I would like to clarify that when it comes to model support, IMHO it is less about if we're generating text versus other modalities, but whether or not the underlying backbone of the architecture is autoregressive decoder-only transformer with causal attention. This was the main motivation when we first added multimodality inputs & models a while back, and what every multimodal model on vLLM belongs to today (with the only exception being multimodal llama 3.2 which we had a lot of challenges from supporting it). Therefore, I do not expect vLLM to provide significant performance improvement for models that don't fall into this category. Now moving on to infra support for generating non-text modalities. This was something I also seriously thought about when I ported Chameleon to vLLM (an autoregressive VLM that's able to generate not only text tokens but also image tokens to be decoded to images). I do see the value of consolidating use cases to a single platform, thus I'm not particularly against opening up the support for image generation. However, there are a few things I'm concerned about and we should think out loud about before we proceed:
Hope this make sense and let me know what you think! |
Hi @ywang96, many thanks for your comments. I do understand that most of the optimizations that are in vLLM are not going to provide any immediate benefit for models that are not autoregressive with causal attention. This RFC is in fact coming from the need for supporting multiple types of models with the same software stack. My hope is that by adding support for "different" types of models at some point we might start seeing optimizations to land also outside the auto-regressive decoder-only umbrella. To reply to your points one by one:
I would start with the offline inference, to get the main pieces in place and then extend of online as well. as per the user interface at the moment I have in mind something like the below (targeting the model mentioned in this RFC) model = LLM(model="./test_model", skip_tokenizer_init=True) # Perhaps a different entry point could be defined since these are not LLMs anymore
input_image = read_image("/path/to/the/image")
image_patches: List[torch.Tensor] = process_image(input_image)
model_outputs = model.transform(image_patches)
output_patches: List[List[torch.Tensor]] = []
for output in model_outputs:
output_patches.append(output.outputs.tensors) # this would be an extension to the PoolingOutput class (different fromthe one I propose on top)
process_output(output_patches, "/path/to/output/image") Worth noting in the above snippet are:
class PoolingOutput:
"""The output data of one pooling output of a request.
Args:
embedding: The embedding vector, which is a list of floats. The
length of vector depends on the model as listed in the embedding guide.
"""
embedding: Optional[List[float]]
tensors: Optional [List[torch.Tensor]] Basically, for a first implementation I would heavily re-use whatever is there for pooling models and avoid introducing heavy changes to the current structure, that would also take care of "disabling" the autoregression (see embedding models). In a second phase I would propose to create a proper abstraction for these models, where there is a dedicated class for image output and users can register an output image processor that converts the output of the model back into an image. basically similar user interface but developers will have to decorate their model with something like.
Or perhaps define an so in this case the user would interact with the model differently as the processing of input and output would be performed in vLLM using the input and output processors registered for the model.
Absolutely, totally understandable and I agree.
I would be able to maintain this part for the foreseeable future. |
As a first step, I have created #11129 to enable different types of pooling outputs. PTAL! |
Thanks @DarkLight1337, that is exactly along the lines of what I had in mind for modifying the output of pooling models. I will submit a PR with the new model in a couple of weeks right after the Christmas vacation break. |
Motivation.
Modern models are now not only targeting the generation of text but also the generation of images from text or image input as well. This RFC wants to open the stage towards supporting models that do not only generate text but also images either as a single modality or even supporting multi-modal output. One example of great interest to us is a set of models developed in co-operation with NASA for earth observation (https://huggingface.co/ibm-nasa-geospatial/Prithvi-100M) working on satellite images can be fine-tuned for several tasks including floods forecast, crop classification etc.
This example model, works on images of a fixed size in input and generates an image of the same size in output. In the specific, input images in the geotiff format are split in patches of dimensions 224×224, each patch is passed through the model for inference that generates a tensor of the same size as the input. This is similar in a way got an autoregressive process, with the difference that at every iteration the data passed to the model is different and there is no relationship when inferencing subsequent patches. All the output patches are then “re-assembled” into a geotiff image.
The goal of this RFC is that of creating enablement for non text output and demonstrate this with the above mentioned model.
Why support models in vLLM that do not generate text? Because consolidating towards a single serving platform simplifies the software stack of those dealing with multiple types of models. Also, with time, models not targeting text might benefit from optimizations introduced by the vLLM community. Similarly to what has been happening for Transformer based causal models.
Proposed Change.
I propose a two phase approach. In the first phase, integrate the model as a PoolingModel and pre/post-process input output data outside of vLLM. In the second phase, a proper integration of the models is performed, taking also care of processing the input image and generating the output one.
Phase 1: Basic enablement of Geospatial model in vLLM
Pre/post processing of the input image is done outside of vLLM. The input image is broken down into patches (generic tensor), all patches are fed into vLLM. Output tensors are collected and post processing would re-create the output image.
For this phase we could piggyback on the support available for pooling models (thanks @Dar for the suggestion) where the hidden states of the model are returned in output.
Changes for phase 1:
Step 1
Extend the output type for pooling models, currently only targeting embeddings, to also support a generic output type. This output would be then post processed outside of vLLM.
Step 2
Also, right now, for Pooling models the only two possible methods to execute would be
encode
orscore
. Would it make sense to define a third one liketransform
? This would just be for the sake of not using encode. Also, would it make sense to create a new entrypoint class in addition to LLM? Something Like VisionModel or similar? This is again for sake of completeness since this is not a language model.Step 3
Exploit the batching capabilites of vLLM and present all the images patches to the vLLM entrypoint as a list of generic tensors. Similar to what is done now when presenting multiple prompts at a time.
Phase 2: Optimized integration of the model
Embed pre/post processing of images into vLLM and handle the recursive pattern for processing bigger images in vLLM.
(This phase might need to be updated/changed depending on the outcome of Phase 1)
Step 1
Integrate processing of the input with the already available multimodal input support. Here among the things to be considered is that an input image could be presented encoded as a string instead of being stored in an file.
Step 2
Introduce the possibility of “installing“ an output processor, that generates images of the required format. In the same spirit of what is done for input processors.
@INPUT_REGISTRY.register_input_processor()
The idea would be to create an output registry and enable models to register an output processor so that all the output generated for a sequence can be converted into the proper image format for the specific model.
Step 3
Create a new output class that allows the output to be presented in the form of an image. We could call it ImageOutput and ImageRequestOutput. Users would be able to either post-process the model output and return a string containing the generated file or, return the raw image output for post-processing outside of vLLM
Step 4
Handle recursive processing of image patches within vLLM. Each image is fed to vLLM, pre processed and split in patches. All patches are processed and all the output patches are processed by the output processor. Could we re-use some of the logic used for handling autoregressive queries? In this case we would know already how many times the model inference should be executed (the number of image patches) and no need to append the output of an iteration to the input of the next, we just feed the next patch and so on.
The output of the request in this case will still be of type
ImageRequestOutput
with theimage_data
field actually optional and the image_path populated with the path to the image generated during post processingFeedback Period.
2 weeks
CC List.
@njhill @ywang96 @DarkLight1337 @robertgshaw2-neuralmagic
Any Other Things.
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: