-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a bug in tying OPT embeddings #1
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
orangetin
referenced
this pull request
in togethercomputer/vllm-ttgi
Sep 14, 2023
add rope scaling as a cli arg so openai server can load rope scaled models
xiangyuT
pushed a commit
to xiangyuT/vllm
that referenced
this pull request
Oct 18, 2023
bigPYJ1151
added a commit
to bigPYJ1151/vllm
that referenced
this pull request
Oct 30, 2023
Fix key cache block shape.
l1cacheDell
added a commit
to CaspianFang/vllm
that referenced
this pull request
Nov 15, 2023
hongxiayang
referenced
this pull request
in hongxiayang/vllm
Feb 13, 2024
ilya-lavrenov
referenced
this pull request
in ilya-lavrenov/vllm
Feb 19, 2024
Deterministic OpenVINO inference
Spycsh
pushed a commit
to Spycsh/vllm
that referenced
this pull request
Feb 27, 2024
* Porting vllm to HPU * add hpu cache allocate * move slot_mapping to cpu and add is_prompt in cache_ops.reshape_and_cache * add bucket to input metadata * 1. limit max block number for lazy mode (TODO) 2. set some inpu metadata from cuda to cpu * remove bucket for block tables * add run bash script and change benchmark config * 1. modify kv cache structure to tensors 2. update hpu paged attention API (for hpu graph compatibility) * add attention mask for generation * add multi_query_kv_attention attn_bias * Temp commit * Integrate fused kernels for RMSNorm and RoPE * Resolve merge conflicts * Minor Gaudi workarounds, add debugging to stock vLLM API server * Fix post-merge pinned memory segfaults * Re-enable sequence decode * Maintain GPU compatibility in cache_engine * Adjust HPU RoPE for non-query runs * Integrate HPU primitive implementations * Add xops bindings * Cast paged attention inputs to bfloat16 * Remove leftover debug calls * Update comments on HPU ops * Restoring NVIDIA compatibility in setup.py * vllm.hpu cleanup * Added HPU-specific requirements * Restored full functionality on NVIDIA * vllm.core cleanup * vllm init cleanup * vllm.hpu cleanup * vllm.benchmarks cleanup * vllm.entrypoint cleanup * Changed is_hpu logic * vllm.benchmark cleanup * Fixed importing condition * tests cleanup * removed dummy printings * Update test_api_server.py * restored attention and logprobs tests functionality on Nvidia * throughput benchmark cleanup * Changed Habana copyright header * Restored alibi in bloom * Added BSD license header --------- Co-authored-by: Xiaotong Chen <xchen@habana.ai> Co-authored-by: Jinyan Chen <jychen@habana.ai> Co-authored-by: Mikhail Dvoretckii <mdvoretckii@habana.ai> Co-authored-by: Sebastian Urwan <surwan@habana.ai>
daniel-geon-park
added a commit
to gmlwns2000/vllm-timber
that referenced
this pull request
Apr 15, 2024
mzusman
pushed a commit
to mzusman/vllm
that referenced
this pull request
Apr 16, 2024
BA-78554: Jurassic 2.5 * worked on jurasic2.5 configuration file, updated jurassic2_5 modeling file to support alternating experts/attn layers * finished working the forward pass of jurassic3.py * finished working the forward pass of jurassic3.py * finished working the forward pass of jurassic3.py * jurassic_3 modeling file works, uses dummy weights initialized by "dummy" flag. Tokenizer raises issues, for now copying the mixtral tokenizer * changed default tokenizer vocab values, loading of custom .pt weight files works. * removed notebook * merging master to jurassic-2.5 to reset head * Merge branch 'master' into jurassic-2.5 * align to master Approved-by: Tomer Asida Approved-by: Mor Zusman
Bellk17
added a commit
to Bellk17/vllm
that referenced
this pull request
May 10, 2024
Triton compilation fix
Closed
1 task
robinren03
added a commit
to robinren03/vllm
that referenced
this pull request
Sep 10, 2024
1 task
1 task
1 task
1 task
ZhijieWang
pushed a commit
to ZhijieWang/vllm
that referenced
this pull request
Oct 19, 2024
* feat: powv per token * feat: add justfile * fix: justfile * fix: missing link in powv pass * fix: powv calculation * ref: powv to separate function * fix: move to parent class * feat: initial verify endpoint * feat: initial verify endpoint * fix: actually add as route * feat(WIP): verfiy endpoint * fix: sequence of ints instead of list for chat completion * fix: loosen restrictions on verify chat completion * fix: verifychatcompletion for get_powv * fix: using wrong field * fix: add very into rpc layer * fix: await verify * fix: non-async fields * fix: async handling * fix: no more destruct * feat: return powv to the top * fix: send back via socket * feat: add endpoint for completion * feat: add version guards
Jeffwan
pushed a commit
to Jeffwan/vllm
that referenced
this pull request
Oct 21, 2024
* Enable vineyard llm kv cache in vLLM Based on another version of vllm: sighingnow@d347dab Cherry-pick from commit d347dab Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> (cherry picked from commit 1545f6bf7edcd667e305d3fbcadd913066f04747) resolving vllm update diff temporarily comment out torch.distributed for single node env add VineyardCacheConfig with https://github.com/v6d-io/v6d/blob/ebe8f077e3d3780a27d49238c501854b6b8e29df/modules/llm-cache/ds/kv_cache_block.cc#L163 commented out; cache_ops fix remove CacheConfig from argument (configure through ENV) v6d: fix integration w/ v1 APIs Signed-off-by: Haiyang Shi <haiyang.shi@bytedance.com> Change model_runner to latest version cherry pick model_runner from d347dab source sighingnow@d347dab fix reshape_and_cache_flash argument add cache prefetch/update to work_base clean up Fix after rebase to 029c71d remove tensor copy from cache managed address to pin memory clean up * Add fixes to address comments --------- Co-authored-by: Tao He <linzhu.ht@alibaba-inc.com>
1 task
1 task
Xaenalt
pushed a commit
to Xaenalt/vllm
that referenced
this pull request
Dec 9, 2024
Add OWNER file
1 task
1 task
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes a bug in supporting OPT-350m/OPT-6.7b/OPT-13b and OPT-IML models.
The bug happened because our model code didn't include some methods that were required to tie the input and output embeddings.