[RFC]: Hidden states processor #12249

DarkLight1337 · 2025-01-21T07:09:32Z

Motivation.

Since #10674, vLLM uses Pooler to extract hidden states from the model and convert them to embeddings, class probabilities, and so on. However, this is still not user-friendly enough:

We have separate model runners for generative and pooling models. This complicates the effort to return hidden states alongside generated text (e.g.: [Feature]: Return hidden states (in progress?) #6165, [Feature]: obtain logits #11397, [Usage]: embeddings API when task is generate #11577, [Feature]: Confidence score for Qwen/Qwen2-VL-7B-Instruct #11606, [Feature]: Support Multiple Tasks Per Model #11905)
Setting the default Pooler based on downstream task only covers the common cases. It may be required to use --override-pooler-config which isn't that intuitive to use (e.g. [Usage]: Token Embeddings from LLMs/VLMs #12085). Even so, we still lack support for custom processing of hidden states (e.g. [RFC]: Adding support for Geospatial models #11065, [Feature]: Support sigmoid for classification models #11881, [New Model]: openbmb/MiniCPM-o-2_6 #12162)

Proposed Change.

Similar to LogitsProcessor (#1469), we can pass a custom HiddenStatesProcessor in SamplingParams and PoolingParams to postprocess the hidden states and return them in the output. This provides maximum flexibility and enables the same model to be used for different downstream tasks.

# Note that we can use a different processor each time we call `llm.generate`
outputs = llm.generate(..., sampling_params=SamplingParams(hidden_states_processor=...))
custom_outputs = outputs.hidden_states_processor_outputs

The interface of HiddenStatesProcessor is similar to VllmModelForTextGeneration.compute_logits and VllmModelForPooling.pooler:

H = TypeVar("H", default=torch.Tensor)
R = TypeVar("R", default=torch.Tensor)

class HiddenStatesProcessor(Protocol[H, R]):
    def __call__(self, model: VllmModel[H], hidden_states: H) -> R:
        ...

The default poolers for each downstream task will be implemented as built-in HiddenStatesProcessor classes.

IdentityHiddenStatesProcessor: Returns hidden states directly (mainly for reward models)
NormalizeHiddenStatesProcessor: Applies normalization to hidden states (mainly for prompt embedding)
SoftmaxHiddenStatesProcessor: Applies softmax to hidden states (mainly for classification)
StepHiddenStatesProcessor: Applies step processing to hidden states (mainly for PRM models)

The existing pooling APIs (LLM.encode, LLM.embed, LLM.score, etc.) will be updated to use these HiddenStatesProcessors automatically.

To get logits from the hidden states, we can have a new hidden states processor that references the LM head of the model:

class LogitsFromHiddenStatesProcessor(HiddenStatesProcessor):
    def __init__(self, lm_head_name: str = "lm_head") -> None:
        self.lm_head_name = lm_head_name

    def __call__(self, model: VllmModel, hidden_states: torch.Tensor) -> torch.Tensor:
        lm_head = getattr(model, self.lm_head_name)
        assert isinstance(lm_head, VocabParallelEmbedding)

        logits = lm_head.linear_method.apply(lm_head, hidden_states)
        return logits

With this design, we can also generate multi-modal outputs:

class ImageFromHiddenStatesProcessor(HiddenStatesProcessor[torch.Tensor, list[Image]]):
    def __init__(self, decoder_name: str = "image_decoder") -> None:
        self.decoder_name = decoder_name
        self._to_pil_image = torchvision.transforms.v2.ToPILImage()

    # Suppose hidden_states is the output of the model's encoder without calling the vision decoder
    def __call__(self, model: VllmModel, hidden_states: torch.Tensor) -> list[Image]:
        image_decoder = getattr(model, self.decoder_name)
        images = image_decoder(hidden_states)  # Shape: [N, C, H, W]
        return [self._to_pil_image(image) for image in images.cpu()]

(Note: This is just one potential approach to generate multi-modal outputs in vLLM. Other methods are still up for discussion.)

Some issues to be addressed:

How to handle TP/PP properly? Should the processor be aware of this?
The hidden states processor is not known at startup time, so it is excluded from model profiling. This may lead to OOM issues especially if the hidden states processor calls a significant portion of the model.

Feedback Period.

Around 2 weeks? See when I have time to work on this...

CC List.

@simon-mo @youkaichao @robertgshaw2-redhat @ywang96 @Isotr0py @maxdebayser @flaviabeo @HwwwwwwwH

Any Other Things.

Since the regular model runner can also return hidden states, we should consider merging the functionality of PoolingModelRunner with the regular ModelRunner in V1 (#8779) to simplify our codebase. I think the only difference is that PoolingModelRunner uses dummy KV caches?

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

youkaichao · 2025-01-21T07:25:49Z

we can pass a custom HiddenStatesProcessor in SamplingParams

can we pass it as an argument to LLM? I don't think people use different processor for each request. It should be the same across one inference instance.

DarkLight1337 · 2025-01-21T07:33:46Z

can we pass it as an argument to LLM? I don't think people use different processor for each request. It should be the same across one inference instance.

Some users want to use the same LLM engine for online generation and embedding, but not necessarily both in the same request (see #11905). It would be a waste of resources to run both in that case.

youkaichao · 2025-01-21T07:47:53Z

I don't think we need to support that. One engine should do one task, otherwise the code would be super-complicated, and would be difficult to optimize.

youkaichao · 2025-01-21T07:49:42Z

complicated sampling parameter is a major factor why vllm became slower previously. We should be very careful about runtime cost that happens per-request. While generally I'm fine with adding engine-level features that do not affect per-request performance.

DarkLight1337 · 2025-01-21T07:56:34Z

I understand your concern. Keep in mind though that hidden states processors are optional and should not affect performance except when they are being used. Unless I'm mistaken, our Poolers aren't being optimized either, so the performance should be similar as now.

DarkLight1337 · 2025-01-21T08:00:04Z

We can pass a dictionary of "allowed" hidden states processors to LLM at startup time so vLLM can profile and optimize them. Then at inference time, we only let the user select from these processors. Would that alleviate your concerns?

youkaichao · 2025-01-21T08:04:14Z

Keep in mind though that hidden states processors are optional and should not affect performance except when they are being used.

adding sampling parameters will in general slow down the inference, because sometimes we need to pass the object across process.

I think hidden states processors should only be instance-level, users can only specify one processor for one instance.

youkaichao · 2025-01-21T08:04:47Z

We can pass a dictionary of "allowed" hidden states processors to LLM at startup time so vLLM can profile and optimize them. Then at inference time, we only let the user select from these processors.

that would be even worse, since you need to validate the processor during runtime, per-request.

DarkLight1337 · 2025-01-21T08:05:23Z

We can pass a dictionary of "allowed" hidden states processors to LLM at startup time so vLLM can profile and optimize them. Then at inference time, we only let the user select from these processors.

that would be even worse, since you need to validate the processor during runtime, per-request.

I mean that the user only passes the (string) key of the processor in the initial dictionary without sending the actual object.

comaniac · 2025-01-21T08:08:17Z

I'm also in favor of setting it at the startup time so that we could better profile and avoid OOM. In general allowing one endpoint to support various pooling mechanism seems not that common. We could mark it as a limitation for now, and think about improvements in the future if there are high demands.

youkaichao · 2025-01-21T09:34:27Z

OOM

out-of-memory error is also a valid concern. if the engine can serve multiple types of requests, then the memory profiling stage would be super complicated, and I doubt if it is even possible. under high load, lots of edge case can occur.

there's also scheduling challenges, the memory cost of generation and embedding can require different memory, and one single token budget would not be enough to quantify and bound the memory usage.

DarkLight1337 added the RFC label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Hidden states processor #12249

[RFC]: Hidden states processor #12249

DarkLight1337 commented Jan 21, 2025 •

edited

Loading

youkaichao commented Jan 21, 2025

DarkLight1337 commented Jan 21, 2025 •

edited

Loading

youkaichao commented Jan 21, 2025

youkaichao commented Jan 21, 2025

DarkLight1337 commented Jan 21, 2025 •

edited

Loading

DarkLight1337 commented Jan 21, 2025 •

edited

Loading

youkaichao commented Jan 21, 2025

youkaichao commented Jan 21, 2025

DarkLight1337 commented Jan 21, 2025

comaniac commented Jan 21, 2025

youkaichao commented Jan 21, 2025

[RFC]: Hidden states processor #12249

[RFC]: Hidden states processor #12249

Comments

DarkLight1337 commented Jan 21, 2025 • edited Loading

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

youkaichao commented Jan 21, 2025

DarkLight1337 commented Jan 21, 2025 • edited Loading

youkaichao commented Jan 21, 2025

youkaichao commented Jan 21, 2025

DarkLight1337 commented Jan 21, 2025 • edited Loading

DarkLight1337 commented Jan 21, 2025 • edited Loading

youkaichao commented Jan 21, 2025

youkaichao commented Jan 21, 2025

DarkLight1337 commented Jan 21, 2025

comaniac commented Jan 21, 2025

youkaichao commented Jan 21, 2025

DarkLight1337 commented Jan 21, 2025 •

edited

Loading

DarkLight1337 commented Jan 21, 2025 •

edited

Loading

DarkLight1337 commented Jan 21, 2025 •

edited

Loading

DarkLight1337 commented Jan 21, 2025 •

edited

Loading