[ CI ] Upstream sync to `v0.4.3` branch #377

robertgshaw2-neuralmagic · 2024-07-14T21:46:41Z

SUMMARY:

upstream sync to v0.4.3 of vllm
git cherry-pick f68470e803df575f294e67167b4b83adfe004cfa..1197e02141df1a7442f21ff6922c98ec0bba153e
vllm-project@f68470e
vllm-project@1197e02 (corresponds to upstream v0.4.3

Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com>

Allow dummy load format for fp8, torch.uniform_ doesn't support FP8 at the moment Co-authored-by: Mor Zusman <morz@ai21.com>

…project#4920)

Signed-off-by: kerthcet <kerthcet@gmail.com>

…llm-project#4944)

…llm-project#4722)

…#4977)

Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs

…ct#4893) The 2nd PR for vllm-project#4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).

…llm-project#4894)

…Config (vllm-project#4991)

…e) (vllm-project#4983)

…ot defined (vllm-project#5009)

Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>

Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

…m-project#5092)

…tual encoder/decoder model support) (vllm-project#4837)

…rs` (vllm-project#5096)

…llm-project#5097)

…m-project#5099)

…#4795 (vllm-project#5031)

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

…llm-project#5108)

…#5112) Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com> Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com> Co-authored-by: Alexei V. Ivanov <alexei.ivanov@amd.com> Co-authored-by: omkarkakarparthi <okakarpa>

Co-authored-by: Breno Faria <breno.faria@intrafind.com>

…er.py (vllm-project#5129)

Co-authored-by: Roger Wang <ywang@roblox.com>

…ject#5120)

…red_metadata modifier (introduced with PTX 8.5) (vllm-project#5136)

Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

…e ::ordered_metadata modifier (introduced with PTX 8.5)" (vllm-project#5149)

Co-authored-by: xuhao <xuhao@cambricon.com>

…roject#5039)

…ect#5171)

…ernels (vllm-project#5168)

robertgshaw2-neuralmagic and others added 30 commits July 14, 2024 21:22

fixed version

37d144e

[Kernel] Add marlin_24 unit tests (vllm-project#4901)

c0191ae

[Kernel] Add flash-attn back (vllm-project#4907)

b25e7e8

[Model] LLaVA model refactor (vllm-project#4910)

1e71c78

Remove marlin warning (vllm-project#4918)

f7ed26b

[Misc]: allow user to specify port in distributed setting (vllm-proje…

bcb951d

…ct#4914)

[Build/CI] Enabling AMD Entrypoints Test (vllm-project#4834)

33643a4

Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com>

[Bugfix] Fix dummy weight for fp8 (vllm-project#4916)

247dd03

Allow dummy load format for fp8, torch.uniform_ doesn't support FP8 at the moment Co-authored-by: Mor Zusman <morz@ai21.com>

[Core] Sharded State Loader download from HF (vllm-project#4889)

ca46064

[Doc]Add documentation to benchmarking script when running TGI (vllm-…

38a41b9

…project#4920)

[Core] Fix scheduler considering "no LoRA" as "LoRA" (vllm-project#4897)

3066421

[Model] add rope_scaling support for qwen2 (vllm-project#4930)

88bc88b

[Model] Add Phi-2 LoRA support (vllm-project#4886)

a5cd7df

[Docs] Add acknowledgment for sponsors (vllm-project#4925)

610f6a1

[CI/Build] Codespell ignore build/ directory (vllm-project#4945)

f270b9c

[Bugfix] Fix flag name for max_seq_len_to_capture (vllm-project#4935)

42abcff

Signed-off-by: kerthcet <kerthcet@gmail.com>

[Bugfix][Kernel] Add head size check for attention backend selection (v…

5451fa4

…llm-project#4944)

[Frontend] Dynamic RoPE scaling (vllm-project#4638)

2989ade

[CI/Build] Enforce style for C++ and CUDA code with clang-format (v…

5ccd7ce

…llm-project#4722)

[misc] remove comments that were supposed to be removed (vllm-project…

09ba2c0

…#4977)

[Kernel] Fixup for CUTLASS kernels in CUDA graphs (vllm-project#4954)

bb60970

Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs

[Misc] Load FP8 kv-cache scaling factors from checkpoints (vllm-proje…

db09329

…ct#4893) The 2nd PR for vllm-project#4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).

[Model] LoRA gptbigcode implementation (vllm-project#3949)

d9e9332

[Core] Eliminate parallel worker per-step task scheduling overhead (v…

9f3dac0

…llm-project#4894)

[Minor] Fix small typo in llama.py: QKVParallelLinear -> Quantization…

2cffbda

…Config (vllm-project#4991)

[Misc] Take user preference in attention selector (vllm-project#4960)

6360c9c

Marlin 24 prefill performance improvement (about 25% better on averag…

b4e50de

…e) (vllm-project#4983)

[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is n…

afe8526

…ot defined (vllm-project#5009)

[Core][1/N] Support send/recv in PyNCCL Groups (vllm-project#4988)

7306301

Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>

[Kernel] Initial Activation Quantization Support (vllm-project#4525)

db8e4c4

Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

youkaichao and others added 29 commits July 14, 2024 21:40

[Core][Optimization] remove vllm-nccl (vllm-project#5091)

bf29f34

[Bugfix] Fix arguments passed to Sequence in stop checker test (vll…

c420496

…m-project#5092)

[Core][Distributed] improve p2p access check (vllm-project#4992)

3986c3e

[Core] Cross-attention KV caching and memory-management (towards even…

de49140

…tual encoder/decoder model support) (vllm-project#4837)

[Doc]Replace deprecated flag in readme (vllm-project#4526)

874a8e7

[Bugfix][CI/Build] Fix test and improve code for `merge_async_iterato…

365a276

…rs` (vllm-project#5096)

[Bugfix][CI/Build] Fix codespell failing to skip files in git diff (v…

491f240

…llm-project#5097)

[Core] Avoid the need to pass None values to Sequence.inputs (vll…

ce2c120

…m-project#5099)

[Bugfix] logprobs is not compatible with the OpenAI spec vllm-project…

627199b

…#4795 (vllm-project#5031)

[Doc][Build] update after removing vllm-nccl (vllm-project#5103)

ec82368

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

[Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter (v…

1da22e6

…llm-project#5108)

[BUGFIX] [FRONTEND] Correct chat logprobs (vllm-project#5029)

ed71c6b

Co-authored-by: Breno Faria <breno.faria@intrafind.com>

[Bugfix] Automatically Detect SparseML models (vllm-project#5119)

fbab69a

[CI/Build] increase wheel size limit to 200 MB (vllm-project#5130)

e33970a

[Misc] remove duplicate definition of seq_lens_tensor in model_runn…

60e63d5

…er.py (vllm-project#5129)

[Doc] Use intersphinx and update entrypoints docs (vllm-project#5125)

b726245

add doc about serving option on dstack (vllm-project#3074)

572316b

Co-authored-by: Roger Wang <ywang@roblox.com>

Bump version to v0.4.3 (vllm-project#5046)

ed9e1ee

[Build] Disable sm_90a in cu11 (vllm-project#5141)

55da1ff

[Bugfix] Avoid Warnings in SparseML Activation Quantization (vllm-pro…

9480a2a

…ject#5120)

[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::orde…

1324c62

…red_metadata modifier (introduced with PTX 8.5) (vllm-project#5136)

Fix cutlass sm_90a vesrion in CMakeList

67fd5f0

[Model] Support MAP-NEO model (vllm-project#5081)

48d5cea

Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using th…

5fe5e47

…e ::ordered_metadata modifier (introduced with PTX 8.5)" (vllm-project#5149)

[Misc]: optimize eager mode host time (vllm-project#4196)

fe4fd55

Co-authored-by: xuhao <xuhao@cambricon.com>

[Model] Enable FP8 QKV in MoE and refine kernel tuning script (vllm-p…

a38e490

…roject#5039)

[Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support (vllm-proj…

852b763

…ect#5171)

[Build] Guard against older CUDA versions when building CUTLASS 3.x k…

7244a18

…ernels (vllm-project#5168)

robertgshaw2-neuralmagic closed this Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ CI ] Upstream sync to `v0.4.3` branch #377

[ CI ] Upstream sync to `v0.4.3` branch #377

robertgshaw2-neuralmagic commented Jul 14, 2024

[ CI ] Upstream sync to v0.4.3 branch #377

[ CI ] Upstream sync to v0.4.3 branch #377

Conversation

robertgshaw2-neuralmagic commented Jul 14, 2024

[ CI ] Upstream sync to `v0.4.3` branch #377

[ CI ] Upstream sync to `v0.4.3` branch #377