-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Hidden states processor #12249
Comments
can we pass it as an argument to |
Some users want to use the same LLM engine for online generation and embedding, but not necessarily both in the same request (see #11905). It would be a waste of resources to run both in that case. |
I don't think we need to support that. One engine should do one task, otherwise the code would be super-complicated, and would be difficult to optimize. |
complicated sampling parameter is a major factor why vllm became slower previously. We should be very careful about runtime cost that happens per-request. While generally I'm fine with adding engine-level features that do not affect per-request performance. |
I understand your concern. Keep in mind though that hidden states processors are optional and should not affect performance except when they are being used. Unless I'm mistaken, our Poolers aren't being optimized either, so the performance should be similar as now. |
We can pass a dictionary of "allowed" hidden states processors to |
adding sampling parameters will in general slow down the inference, because sometimes we need to pass the object across process. I think hidden states processors should only be instance-level, users can only specify one processor for one instance. |
that would be even worse, since you need to validate the processor during runtime, per-request. |
I mean that the user only passes the (string) key of the processor in the initial dictionary without sending the actual object. |
I'm also in favor of setting it at the startup time so that we could better profile and avoid OOM. In general allowing one endpoint to support various pooling mechanism seems not that common. We could mark it as a limitation for now, and think about improvements in the future if there are high demands. |
out-of-memory error is also a valid concern. if the engine can serve multiple types of requests, then the memory profiling stage would be super complicated, and I doubt if it is even possible. under high load, lots of edge case can occur. there's also scheduling challenges, the memory cost of generation and embedding can require different memory, and one single token budget would not be enough to quantify and bound the memory usage. |
Motivation.
Since #10674, vLLM uses Pooler to extract hidden states from the model and convert them to embeddings, class probabilities, and so on. However, this is still not user-friendly enough:
--override-pooler-config
which isn't that intuitive to use (e.g. [Usage]: Token Embeddings from LLMs/VLMs #12085). Even so, we still lack support for custom processing of hidden states (e.g. [RFC]: Adding support for Geospatial models #11065, [Feature]: Support sigmoid for classification models #11881, [New Model]: openbmb/MiniCPM-o-2_6 #12162)Proposed Change.
Similar to
LogitsProcessor
(#1469), we can pass a customHiddenStatesProcessor
inSamplingParams
andPoolingParams
to postprocess the hidden states and return them in the output. This provides maximum flexibility and enables the same model to be used for different downstream tasks.The interface of
HiddenStatesProcessor
is similar toVllmModelForTextGeneration.compute_logits
andVllmModelForPooling.pooler
:The default poolers for each downstream task will be implemented as built-in
HiddenStatesProcessor
classes.IdentityHiddenStatesProcessor
: Returns hidden states directly (mainly for reward models)NormalizeHiddenStatesProcessor
: Applies normalization to hidden states (mainly for prompt embedding)SoftmaxHiddenStatesProcessor
: Applies softmax to hidden states (mainly for classification)StepHiddenStatesProcessor
: Applies step processing to hidden states (mainly for PRM models)The existing pooling APIs (
LLM.encode
,LLM.embed
,LLM.score
, etc.) will be updated to use theseHiddenStatesProcessor
s automatically.To get logits from the hidden states, we can have a new hidden states processor that references the LM head of the model:
With this design, we can also generate multi-modal outputs:
(Note: This is just one potential approach to generate multi-modal outputs in vLLM. Other methods are still up for discussion.)
Some issues to be addressed:
Feedback Period.
Around 2 weeks? See when I have time to work on this...
CC List.
@simon-mo @youkaichao @robertgshaw2-redhat @ywang96 @Isotr0py @maxdebayser @flaviabeo @HwwwwwwwH
Any Other Things.
Since the regular model runner can also return hidden states, we should consider merging the functionality of
PoolingModelRunner
with the regularModelRunner
in V1 (#8779) to simplify our codebase. I think the only difference is thatPoolingModelRunner
uses dummy KV caches?Before submitting a new issue...
The text was updated successfully, but these errors were encountered: