[pull] main from vllm-project:main #5

pull · 2024-04-24T18:43:45Z

See Commits and Changes for more details.

Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

Signed-off-by: Tao He <sighingnow@gmail.com>

…he revision parameter (#4217)

Co-authored-by: Harry Mellor <hmellor@oxts.com>

…#4161)

…s tests. (#3951)

…4016)

Co-authored-by: mgoin <michael@neuralmagic.com>

… to be opened (#4292)

This PR is the first step towards fixing #3208 It implements dynamic per-tensor scaling (see #4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this: ```python from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8") outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` **Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in #3954). With this PR, the results are as follows: <img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03"> **Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows: ``` | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7018|± |0.0036| | - humanities |N/A |none | 5|acc |0.6472|± |0.0065| | - other |N/A |none | 5|acc |0.7673|± |0.0072| | - social_sciences|N/A |none | 5|acc |0.8099|± |0.0070| | - stem |N/A |none | 5|acc |0.6131|± |0.0083| ``` this compares favorably with the fp16 results which are ``` | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7020|± |0.1313| | - humanities |N/A |none | 5|acc |0.6425|± |0.1349| | - other |N/A |none | 5|acc |0.7744|± |0.1038| | - social_sciences|N/A |none | 5|acc |0.8131|± |0.0695| | - stem |N/A |none | 5|acc |0.6108|± |0.1383| ``` Happy hacking!

Fixes fp8 iterface which broke in AQLM merge.

[Core][Distributed] use existing torch.cuda.device context manager (#4318)

This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187. The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.

openshift-ci · 2024-04-24T18:43:57Z

Hi @pull[bot]. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Co-authored-by: Simon Mo <simon.mo@hey.com>

…formers 4.40.0 (#4324) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

…ssing. (#4213)

Co-authored-by: Caio Mendes <caiocesart@microsoft.com>

Co-authored-by: WoosukKwon <woosuk.kwon@berkeley.edu>

Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>

z103cb · 2024-04-26T11:42:53Z

/lgtm

z103cb · 2024-04-26T11:44:05Z

/ok-to-test

…fill (#4309)

z103cb · 2024-04-26T13:05:16Z

/lgtm

Xaenalt · 2024-04-26T14:12:38Z

/approve

openshift-ci · 2024-04-26T14:12:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pull[bot], Xaenalt

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sighingnow and others added 25 commits April 22, 2024 09:19

[Frontend] Enable support for CPU backend in AsyncLLMEngine. (#3993)

077f0a2

Signed-off-by: Tao He <sighingnow@gmail.com>

[Bugfix] Ensure download_weights_from_hf(..) inside loader is using t…

1543680

…he revision parameter (#4217)

Add example scripts to documentation (#4225)

3d92516

Co-authored-by: Harry Mellor <hmellor@oxts.com>

[Core] Scheduler perf fix (#4270)

ad8d696

[Doc] Update the SkyPilot doc with serving and Llama-3 (#4276)

ceaf4ed

[Core][Distributed] use absolute path for library file (#4271)

c1b4e41

Fix autodoc directives (#4272)

34128a6

Co-authored-by: Harry Mellor <hmellor@oxts.com>

[Mypy] Part 3 fix typing for nested directories for most of directory (…

0ae11f7

…#4161)

[Core] Some simplification of WorkerWrapper changes (#4183)

8f2ea22

[Core] Scheduling optimization 2 (#4280)

050f285

[Speculative decoding 7/9] Speculative decoding end-to-end correctnes…

62b8aeb

…s tests. (#3951)

[Bugfix] Fixing max token error message for openai compatible server (#…

d3c8180

…4016)

[Bugfix] Add init_cached_hf_modules to RayWorkerWrapper (#4286)

d87f39e

[Core][Logging] Add last frame information for better debugging (#4278)

d86285a

[CI] Add ccache for wheel builds job (#4281)

62b5166

AQLM CUDA support (#3287)

2b7949c

Co-authored-by: mgoin <michael@neuralmagic.com>

[Bugfix][Frontend] Raise exception when file-like chat template fails…

1e8f425

… to be opened (#4292)

[BUG] fixed fp8 conflict with aqlm (#4307)

79a268c

Fixes fp8 iterface which broke in AQLM merge.

[Core][Distributed] use cpu/gloo to initialize pynccl (#4248)

91f50a6

[CI][Build] change pynvml to nvidia-ml-py (#4302)

e4bf860

[Misc] Reduce supported Punica dtypes (#4304)

468d761

[Core][Distributed] use existing torch.cuda.device (#4318)

3cd9b5b

[Core][Distributed] use existing torch.cuda.device context manager (#4318)

[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark (#4279)

7923dca

openshift-ci bot added the needs-ok-to-test label Apr 24, 2024

pull bot added ⤵️ pull and removed needs-ok-to-test labels Apr 24, 2024

[Doc] Add note for docker user (#4340)

2768884

Co-authored-by: Simon Mo <simon.mo@hey.com>

zifeitong and others added 15 commits April 24, 2024 21:10

[Misc] Use public API in benchmark_throughput (#4300)

a395a63

[Model] Adds Phi-3 support (#4298)

96e90fd

[Core] Move ray_utils.py from engine to executor package (#4347)

479d69f

[Bugfix][Model] Refactor OLMo model to support new HF format in trans…

fbf152d

…formers 4.40.0 (#4324) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

[CI/Build] Adding functionality to reset the node's GPUs before proce…

7ee82be

…ssing. (#4213)

[Doc] README Phi-3 name fix. (#4372)

bd7a8ee

Co-authored-by: Caio Mendes <caiocesart@microsoft.com>

[Core]refactor aqlm quant ops (#4351)

f4bc4de

[Mypy] Typing lora folder (#4337)

b5b4a39

[Misc] Fix flash attention backend log (#4368)

b6dcb4d

[Core] Add shutdown() method to ExecutorBase (#4349)

15e7c67

[Core] Move function tracing setup to util function (#4352)

efffb63

[ROCm][Hardware][AMD][Doc] Documentation update for ROCm (#4376)

cf29b7e

Co-authored-by: WoosukKwon <woosuk.kwon@berkeley.edu>

[Bugfix] Fix parameter name in get_tokenizer (#4107)

a74dee9

[Frontend] Add --log-level option to api server (#4377)

2f30e7c

[CI] Disable non-lazy string operation on logging (#4326)

a88081b

Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>

openshift-ci bot assigned z103cb Apr 26, 2024

openshift-ci bot added the lgtm label Apr 26, 2024

openshift-ci bot added the ok-to-test label Apr 26, 2024

[Core] Refactoring sampler and support prompt logprob for chunked pre…

603ad84

…fill (#4309)

openshift-ci bot removed the lgtm label Apr 26, 2024

openshift-ci bot added the lgtm label Apr 26, 2024

z103cb merged commit df73690 into opendatahub-io:main Apr 26, 2024
13 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from vllm-project:main #5

[pull] main from vllm-project:main #5

pull bot commented Apr 24, 2024 •

edited

Loading

openshift-ci bot commented Apr 24, 2024

z103cb commented Apr 26, 2024

z103cb commented Apr 26, 2024

z103cb commented Apr 26, 2024

Xaenalt commented Apr 26, 2024

openshift-ci bot commented Apr 26, 2024

[pull] main from vllm-project:main #5

[pull] main from vllm-project:main #5

Conversation

pull bot commented Apr 24, 2024 • edited Loading

openshift-ci bot commented Apr 24, 2024

z103cb commented Apr 26, 2024

z103cb commented Apr 26, 2024

z103cb commented Apr 26, 2024

Xaenalt commented Apr 26, 2024

openshift-ci bot commented Apr 26, 2024

pull bot commented Apr 24, 2024 •

edited

Loading