Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Support serving encoder/decoder models #7258

Merged
merged 35 commits into from
Aug 9, 2024

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Aug 7, 2024

This PR cleans up the parsing logic introduced in #4942 to pass mypy, and updates the async engine to follow the same code structure, enabling encoder/decoder models to be served using the OpenAI-compatible server. I have added a basic test accordingly.

Other related changes:

  • Moved parsing-related code from vllm.inputs.data to a new module vllm.inputs.parse. To declutter import statements, these functions now have to be imported from vllm.inputs.parse explicitly rather than from vllm.inputs.
  • Most (but not all) of the warnings spammed by test_bart.py due to mismatched text have been fixed by padding the vLLM output string with BOS/EOS tokens.
  • Added is_list_of helper to vllm.utils (ported from #7126). This avoids the need for type casts in parse_and_batch_prompt.
  • Replaced all existing uses of TypeGuard with the newly-introduced TypeIs construct. The minimum versions of mypy and typing_extensions have been bumped accordingly.

cc @afeldman-nm

Copy link

github-actions bot commented Aug 7, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Member Author

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some explanations

logit_bias: Dict[int, float],
token_ids: List[int],
logits: torch.Tensor,
) -> torch.Tensor:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error appeared after updating mypy so I had to fix it in this PR.

) -> List[Tuple[PromptInputs, PromptInputs]]:
return [(enc_dec_prompt['encoder_prompt'],
enc_dec_prompt['decoder_prompt'])
for enc_dec_prompt in enc_dec_prompts]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since vllm.inputs could import utils but these functions require vllm.inputs, I have moved them to vllm.inputs.parse to avoid circular imports.

raise ValueError("prompt must be a string, array of strings, "
"array of tokens, or array of token arrays")


Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These have been moved to vllm.inputs.parse.


raise ValueError(f"Invalid prompt {prompt}")


Copy link
Member Author

@DarkLight1337 DarkLight1337 Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer required as I have moved the parsing logic inside the engine class itself.

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 7, 2024
@DarkLight1337 DarkLight1337 changed the title [Core] Refactor encoder/decoder prompt parsing logic [Core] Support encoder/decoder prompt in AsyncLLMEngine Aug 7, 2024
@DarkLight1337 DarkLight1337 changed the title [Core] Support encoder/decoder prompt in AsyncLLMEngine [Core] Support serving encoder/decoder models Aug 7, 2024
Copy link
Contributor

@afeldman-nm afeldman-nm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @DarkLight1337 these improvements are great. I just had a few questions/suggestions. Overall LGTM.

decoder_prompt, decoder_prompt_ids, decoder_mm_data = decoder_comps

if encoder_mm_data is not None or decoder_mm_data is not None:
raise ValueError("Multi-modal data is not supported for "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: currently the encoder/decoder infrastructure doesn't support multi-modal models (fixing this is a near-term goal.)

Perhaps "Multi-modal inputs are not supported for "
"encoder-decoder models"

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically the existing multi-modal models use a vision encoder, so I wanted to make it clear that "encoder" here refers to language encoder.

Copy link
Contributor

@afeldman-nm afeldman-nm Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point I am about to make it not critical, don't block merging on addressing this - but just for context, the limitation is really that encoder/decoder models with cross-attention do not currently support multi-modal.

So for example, a model like Llava has an encoder but not cross-attention, and obviously vLLM supports Llava (and did even before encoder/decoder model support was introduced.)

However, a vision model setup like the one shown in the diagram (taken from a blog post) requires cross attention between ViT and RoBERTa in order to decode a description of the image; even though this is a vision model, it would not currently be supported by vLLM because it requires both multi-modal support and encoder/decoder cross-attention, which is a currently-unsupported combination (for example, EncoderDecoderModelRunner has an assert which checks that multimodality is not enabled.)

image

Copy link
Contributor

@afeldman-nm afeldman-nm Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course this inability to have multimodal+encoder/decoder will need to be addressed by the time we add the Whisper model to vLLM, which is a near-term goal

Copy link
Member Author

@DarkLight1337 DarkLight1337 Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing more context, I see what you mean by encoder-decoder in multi-modal context now. Yeah, we will need to address this soon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the error to say "multi-modal" encoder-decoder models are not supported yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw this comment - yes like @afeldman-nm said, this is something we should start thinking about since Llama 3.1 multi-modal has exactly the same inference pattern like what Andrew described here.

Copy link
Contributor

@afeldman-nm afeldman-nm Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ywang96 FYI I just released an RFC giving an overview of next-steps for vLLM's encoder/decoder support, and referenced this discussion in the section on multimodality support: #7366 (comment)

Feel free to leave any feedback that you have

vllm/inputs/data.py Outdated Show resolved Hide resolved
Comment on lines +44 to +49
completion = await client.completions.create(
model=model_name,
prompt=[0, 0, 0, 0, 0],
max_tokens=5,
temperature=0.0,
)
Copy link
Contributor

@afeldman-nm afeldman-nm Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarkLight1337

It is great that we support this now. Does it make sense maybe to have two different test prompts, one being a singleton encoder prompt, the other being an explicit encoder/decoder prompt? This would help confirm that the prompt processing pipeline works as it should for the async scenario (not that I can think of a reason it wouldn't.)

An example prompt could be

{
'encoder_prompt': 'The rain in spain',
'decoder_prompt': [0,0,0,0,0]
}

Copy link
Member Author

@DarkLight1337 DarkLight1337 Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this input format is specified in the OpenAI API, so users shouldn't run into problems unless they use the async engine directly. Since this PR is pretty big already, let's leave this for future work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay then LGTM.

@@ -334,17 +438,19 @@ async def add_request_async(
trace_headers: Optional[Mapping[str, str]] = None,
prompt_adapter_request: Optional[PromptAdapterRequest] = None,
) -> None:
"""Async version of :meth:`add_request`."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not change in this PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't quite get what you mean by this. Could you elaborate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this function needs changes in this PR - just a nit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - this is just to keep the order of arguments consistent with the new ordering of parameters in process_model_inputs_async (which has been updated alongside process_model_inputs).

@robertgshaw2-redhat
Copy link
Collaborator

I think this looks good.

Separately @DarkLight1337 (for follow up) perhaps we should create a new class called AsyncLLMEngineEncoderDecoder rather than branching inside AsyncLLMEngine - WDYT

@DarkLight1337
Copy link
Member Author

I think this looks good.

Separately @DarkLight1337 (for follow up) perhaps we should create a new class called AsyncLLMEngineEncoderDecoder rather than branching inside AsyncLLMEngine - WDYT

I think that it would be best to keep the same class structure as the base LLMEngine. If the goal is to slim down the engine classes, perhaps we could factor out both the sync and async versions of the parsing code.

request_id=request_id,
)

encoder_comps, decoder_comps = await asyncio.gather(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the order matter here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's just how asyncio.gather works.

@DarkLight1337 DarkLight1337 merged commit 7eb4a51 into vllm-project:main Aug 9, 2024
68 checks passed
@DarkLight1337 DarkLight1337 deleted the inputs-parser branch August 9, 2024 02:42
sfc-gh-mkeralapura pushed a commit to sfc-gh-mkeralapura/vllm that referenced this pull request Aug 12, 2024
kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024
fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request Aug 22, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants