Fix a bug in tying OPT embeddings #1

WoosukKwon · 2023-02-25T00:27:11Z

This PR fixes a bug in supporting OPT-350m/OPT-6.7b/OPT-13b and OPT-IML models.

The bug happened because our model code didn't include some methods that were required to tie the input and output embeddings.

add rope scaling as a cli arg so openai server can load rope scaled models

Fix key cache block shape.

Deterministic OpenVINO inference

* Porting vllm to HPU * add hpu cache allocate * move slot_mapping to cpu and add is_prompt in cache_ops.reshape_and_cache * add bucket to input metadata * 1. limit max block number for lazy mode (TODO) 2. set some inpu metadata from cuda to cpu * remove bucket for block tables * add run bash script and change benchmark config * 1. modify kv cache structure to tensors 2. update hpu paged attention API (for hpu graph compatibility) * add attention mask for generation * add multi_query_kv_attention attn_bias * Temp commit * Integrate fused kernels for RMSNorm and RoPE * Resolve merge conflicts * Minor Gaudi workarounds, add debugging to stock vLLM API server * Fix post-merge pinned memory segfaults * Re-enable sequence decode * Maintain GPU compatibility in cache_engine * Adjust HPU RoPE for non-query runs * Integrate HPU primitive implementations * Add xops bindings * Cast paged attention inputs to bfloat16 * Remove leftover debug calls * Update comments on HPU ops * Restoring NVIDIA compatibility in setup.py * vllm.hpu cleanup * Added HPU-specific requirements * Restored full functionality on NVIDIA * vllm.core cleanup * vllm init cleanup * vllm.hpu cleanup * vllm.benchmarks cleanup * vllm.entrypoint cleanup * Changed is_hpu logic * vllm.benchmark cleanup * Fixed importing condition * tests cleanup * removed dummy printings * Update test_api_server.py * restored attention and logprobs tests functionality on Nvidia * throughput benchmark cleanup * Changed Habana copyright header * Restored alibi in bloom * Added BSD license header --------- Co-authored-by: Xiaotong Chen <xchen@habana.ai> Co-authored-by: Jinyan Chen <jychen@habana.ai> Co-authored-by: Mikhail Dvoretckii <mdvoretckii@habana.ai> Co-authored-by: Sebastian Urwan <surwan@habana.ai>

merge code

BA-78554: Jurassic 2.5 * worked on jurasic2.5 configuration file, updated jurassic2_5 modeling file to support alternating experts/attn layers * finished working the forward pass of jurassic3.py * finished working the forward pass of jurassic3.py * finished working the forward pass of jurassic3.py * jurassic_3 modeling file works, uses dummy weights initialized by "dummy" flag. Tokenizer raises issues, for now copying the mixtral tokenizer * changed default tokenizer vocab values, loading of custom .pt weight files works. * removed notebook * merging master to jurassic-2.5 to reset head * Merge branch 'master' into jurassic-2.5 * align to master Approved-by: Tomer Asida Approved-by: Mor Zusman

Triton compilation fix

Group Gemm Version

* feat: powv per token * feat: add justfile * fix: justfile * fix: missing link in powv pass * fix: powv calculation * ref: powv to separate function * fix: move to parent class * feat: initial verify endpoint * feat: initial verify endpoint * fix: actually add as route * feat(WIP): verfiy endpoint * fix: sequence of ints instead of list for chat completion * fix: loosen restrictions on verify chat completion * fix: verifychatcompletion for get_powv * fix: using wrong field * fix: add very into rpc layer * fix: await verify * fix: non-async fields * fix: async handling * fix: no more destruct * feat: return powv to the top * fix: send back via socket * feat: add endpoint for completion * feat: add version guards

* Enable vineyard llm kv cache in vLLM Based on another version of vllm: sighingnow@d347dab Cherry-pick from commit d347dab Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> (cherry picked from commit 1545f6bf7edcd667e305d3fbcadd913066f04747) resolving vllm update diff temporarily comment out torch.distributed for single node env add VineyardCacheConfig with https://github.com/v6d-io/v6d/blob/ebe8f077e3d3780a27d49238c501854b6b8e29df/modules/llm-cache/ds/kv_cache_block.cc#L163 commented out; cache_ops fix remove CacheConfig from argument (configure through ENV) v6d: fix integration w/ v1 APIs Signed-off-by: Haiyang Shi <haiyang.shi@bytedance.com> Change model_runner to latest version cherry pick model_runner from d347dab source sighingnow@d347dab fix reshape_and_cache_flash argument add cache prefetch/update to work_base clean up Fix after rebase to 029c71d remove tensor copy from cache managed address to pin memory clean up * Add fixes to address comments --------- Co-authored-by: Tao He <linzhu.ht@alibaba-inc.com>

Add OWNER file

Fix OPT errors

44735b4

WoosukKwon merged commit cbf8779 into main Feb 25, 2023

WoosukKwon deleted the fix-opt branch February 25, 2023 00:29

murongweibo mentioned this pull request Jul 11, 2023

NCCL Error 5: invalid usage #427

Closed

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

CZT0 referenced this pull request in semedia-tech/vllm Sep 11, 2023

#1 测试部署vllm

cc4f1ce

orangetin referenced this pull request in togethercomputer/vllm-ttgi Sep 14, 2023

Merge pull request #1 from winglian/longchat-args

b9012fb

add rope scaling as a cli arg so openai server can load rope scaled models

xiangyuT pushed a commit to xiangyuT/vllm that referenced this pull request Oct 18, 2023

Add function invoke call for underlying models (vllm-project#1)

9895bbd

bigPYJ1151 added a commit to bigPYJ1151/vllm that referenced this pull request Oct 30, 2023

Merge pull request vllm-project#1 from bigPYJ1151/fix_ans

b5e7066

Fix key cache block shape.

l1cacheDell added a commit to CaspianFang/vllm that referenced this pull request Nov 15, 2023

blora LlaMa support vllm-project#1

424df61

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang referenced this pull request in hongxiayang/vllm Feb 13, 2024

Fix a bug in tying OPT embeddings (#1)

2cb721d

kvikk mentioned this pull request Feb 15, 2024

ERROR: Could not build wheels for vllm, which is required to install pyproject.toml-based projects #2735

Closed

ilya-lavrenov referenced this pull request in ilya-lavrenov/vllm Feb 19, 2024

Merge pull request #1 from ilya-lavrenov/cpu-works

e3d65e0

Deterministic OpenVINO inference

daniel-geon-park added a commit to gmlwns2000/vllm-timber that referenced this pull request Apr 15, 2024

Merge pull request vllm-project#1 from DeepAuto-AI/geon-dev

d9d746e

merge code

afeldman-nm mentioned this pull request Apr 30, 2024

Adding support for encoder-decoder models, like T5 or BART #187

Closed

dlopes78 mentioned this pull request May 8, 2024

[Bug]: VLLM + tritonserver #4695

Closed

fmmoret mentioned this pull request May 8, 2024

[Bug]: Chunked prefill returning gibberish in some cases. #4697

Closed

Bellk17 added a commit to Bellk17/vllm that referenced this pull request May 10, 2024

Merge pull request vllm-project#1 from Bellk17/main

b36d574

Triton compilation fix

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

afeldman-nm mentioned this pull request Jun 3, 2024

[Bug]: VLLM_ATTENTION_BACKEND set to ROCM_FLASH only in GHA environment, overriding automatic backend selection; this breaks other kernel unit tests. #5208

Closed

ykim362 referenced this pull request in ykim362/vllm Jun 17, 2024

Wenxh/fp8 on a100 v5 (#1)

aca4a33

Group Gemm Version

xiejibing mentioned this pull request Jun 24, 2024

[Bug]: vLLM 0.4.2 8xH100 init failed #5785

Closed

llmpros mentioned this pull request Jun 27, 2024

[Frontend]: Support base64 embedding #5935

Merged

Juelianqvq mentioned this pull request Jul 3, 2024

[Bug]: Flashinfer stuck with CUDA Graph #6086

Closed

oliver-li mentioned this pull request Jul 5, 2024

[Bug]: NCCL hangs and causes timeout #5484

Closed

haichuan1221 mentioned this pull request Jul 5, 2024

Support W4A8 quantization for vllm #5218

Merged

haichuan1221 mentioned this pull request Jul 8, 2024

[Bug]: call for stack trace for "Watchdog caught collective operation timeout" #6042

Closed

ehuaa mentioned this pull request Jul 19, 2024

[Bug]: The vllm is disconnected after running for some time #5084

Closed

xinzaifeixiang1992 mentioned this pull request Jul 24, 2024

[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型，刚开始服务正常，但是并发高的时候就报错 #6734

Open

alixiaodi mentioned this pull request Aug 2, 2024

[Bug]: #7072

Closed

Minami-su mentioned this pull request Aug 11, 2024

[Bug]: vllm is crashed on v0.5.3.post1 #7161

Closed

wangwensuo mentioned this pull request Aug 22, 2024

[Bug]: llama3-405b-fp8 NCCL communication #7775

Open

fu1996 mentioned this pull request Aug 27, 2024

[Bug]: After VLLM successfully starts the service, a prompt will appear during the first inference and the inference cannot proceed normally #7893

Closed

1 task

robinren03 added a commit to robinren03/vllm that referenced this pull request Sep 10, 2024

Add debug vllm-project#1

e04838f

liulisi16323 mentioned this pull request Sep 24, 2024

[Bug]: v0.5.5 crash: "AssertionError: expected running sequences" #8016

Closed

1 task

Clint-chan mentioned this pull request Sep 29, 2024

[Bug]: Vllm0.6.2 UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown #8933

Open

1 task

SpaceHunterInf mentioned this pull request Sep 30, 2024

[Bug]: Bus error (core dumped) #8974

Closed

1 task

This was referenced Oct 12, 2024

[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered #6976

Closed

[Bug]: Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered #9306

Open

mgoin mentioned this pull request Oct 21, 2024

[Model][Bugfix] Fix batching with multi-image in PixtralHF #9518

Merged

xxzhang0927 mentioned this pull request Oct 30, 2024

[Bug]: Engine iteration timed out. This should never happen! #9839

Open

1 task

Pl4tiNuM mentioned this pull request Nov 23, 2024

[Bug]: NCCL error with 2-way pipeline parallelism. #10419

Closed

1 task

russellb mentioned this pull request Nov 25, 2024

[misc] do not read HOST_IP #10644

Merged

Xaenalt pushed a commit to Xaenalt/vllm that referenced this pull request Dec 9, 2024

Merge pull request vllm-project#1 from vaibhavjainwiz/add_owner

ec12772

Add OWNER file

warlockedward mentioned this pull request Dec 17, 2024

[Bug]: Hermes tool choice can not supprot format 'string' #11250

Open

1 task

G1017 mentioned this pull request Dec 24, 2024

[Usage]: Trying to add codeshell 7b model, but got an error #11451

Closed

1 task

ccolas mentioned this pull request Jan 24, 2025

[Bug]: AssertionError when using automatic prefix caching and prompt_logprobs #8268

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a bug in tying OPT embeddings #1

Fix a bug in tying OPT embeddings #1

WoosukKwon commented Feb 25, 2023

Fix a bug in tying OPT embeddings #1

Fix a bug in tying OPT embeddings #1

Conversation

WoosukKwon commented Feb 25, 2023