Support beam search & parallel generation #7

WoosukKwon · 2023-03-09T20:40:36Z

This PR adds support for beam search and parallel generation (i.e., n > 1).

NOTE: The correctness is only checked for beam search, but not for random sampling methods.

Tested models:

OPT-125M
OPT-350M
OPT-1.3B
OPT-2.7B
OPT-6.7B
OPT-13B

Tested GPUs:

A100

zhuohan123

LGTM in general. Left some small comments.

cacheflow/models/sample.py

zhuohan123 · 2023-03-10T05:11:53Z

cacheflow/models/sample.py

+    probs: torch.Tensor,
+    p: torch.Tensor,
+) -> torch.Tensor:
+    # TODO(woosuk): Optimize.


Maybe it's faster to simply mask out the tokens whose cumulative gradient is smaller than top_p (example code)?

Thanks for letting me know the code. I feel that implementation would not be remarkably more efficient than ours, because it includes 2 softmax rather than 1.

remove not needed files

add torch.cuda.empty_cache()

Updated OpenVINO version in dockerfile

* Return support for other models apart from jamba * Support n>1 * Revert 2 commits d054737 'Support n>1' b5167cc 'Return support for other models apart from jamba' * TP on input and output * Basic TP impl , working, correctness not working * TP is working * Roll back the verification that everything in the weights fits into the model * Cleanup * Use world size func * clean up * Import * Apply whitespace suggestions from code review * Organize imports * Add comment on the unsqueeze in conv1d * Organize and remove redundant code in forward pass * Remove print * Add comments Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> * White spaces * Set as A * better comment --------- Co-authored-by: Mor Zusman <morz@ai21.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>

arctic load quantized checkpoint

Add support for LMCache

* llama support * flash_attention * sharded * expend * fix: remove redunctant info * change main * llama and opt model supported --------- Co-authored-by: Shao Siyang FYP PDCL <shaosy@scsehg.cm.cluster> Co-authored-by: lairuiqi <lrq619@outlook.com> Co-authored-by: LaiRuiqi <58351056+lrq619@users.noreply.github.com>

WoosukKwon added 30 commits March 6, 2023 19:06

Minor

e0a8519

Add get_seqs

a320382

Minor

cf9536f

max_context_len -> context_window_size

19ff0d0

Add __repr__ to SamplingParams

c6ae9e1

Minor

22822d8

Enhance Frontend and SamplingParams

271d5df

Add get_last_token_id

ab13c30

Add InputSequenceGroup

2f04d46

Support temperature & top_p sampling

f743df9

Support parallel generation

f10fbac

Use n=2 for test inputs

f1f49f8

Enforce zero temperature for beam search

ac85d81

Remove group_id from seq_groups

b790887

Refactor Sampler

bdbb3f9

Minor

261f3cd

Use replacement=True for torch.multinomial

6184793

Fix a bug in block copy

c158f6e

InputSequenceGroup -> SequenceGroupInput

4340cdb

SequenceGroupInput -> SequenceGroupInputs

893c1a0

Add SequenceOutputs & Stre logprobs for sequences

7b8889c

Add num_logprobs to SamplingParams

f8493e6

Use num_logprobs in sampling_params

38244e4

[WIP] Refactor

2a4b8bb

Minor

d449b3d

[WIP] Refactor

2ac01dc

Implement beam search

de1c3d7

Minor

0daed38

Shallow copy -> deep copy

a0a55b0

Bugfix for beam search

e1f359a

WoosukKwon requested a review from zhuohan123 March 10, 2023 00:03

WoosukKwon added 2 commits March 10, 2023 00:13

Add seed

7239d4c

Minor change in comment

2ccb084

zhuohan123 approved these changes Mar 10, 2023

View reviewed changes

Minor

b6684f6

WoosukKwon merged commit 1a7eb7d into main Mar 10, 2023

WoosukKwon deleted the parallel-generation branch March 10, 2023 17:58

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Support beam search & parallel generation (vllm-project#7)

47b163f

This was referenced Jul 5, 2024

Support W4A8 quantization for vllm #5218

Merged

[Bug]: call for stack trace for "Watchdog caught collective operation timeout" #6042

Closed

xinzaifeixiang1992 mentioned this pull request Jul 24, 2024

[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型，刚开始服务正常，但是并发高的时候就报错 #6734

Closed

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

Minami-su mentioned this pull request Aug 11, 2024

[Bug]: vllm is crashed on v0.5.3.post1 #7161

Closed

liulisi16323 mentioned this pull request Sep 24, 2024

[Bug]: v0.5.5 crash: "AssertionError: expected running sequences" #8016

Closed

1 task

SpaceHunterInf mentioned this pull request Sep 30, 2024

[Bug]: Bus error (core dumped) #8974

Closed

1 task

hteeyeoh mentioned this pull request Dec 6, 2024

[Bug]: Not able to install/compile vllm using alpine linux base image #10924

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Support beam search & parallel generation #7

Support beam search & parallel generation #7

WoosukKwon commented Mar 9, 2023 •

edited

Loading

zhuohan123 left a comment

zhuohan123 Mar 10, 2023

WoosukKwon Mar 10, 2023

Support beam search & parallel generation #7

Support beam search & parallel generation #7

Conversation

WoosukKwon commented Mar 9, 2023 • edited Loading

zhuohan123 left a comment

Choose a reason for hiding this comment

zhuohan123 Mar 10, 2023

Choose a reason for hiding this comment

WoosukKwon Mar 10, 2023

Choose a reason for hiding this comment

WoosukKwon commented Mar 9, 2023 •

edited

Loading