-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PREFIX CACHING FOLLOW UP] OrderedDict-based evictor #3431
Conversation
cc @cadedaniel |
This PR addresses the performance issues in the initial version of automatic prefix caching which focused on correctness. Specifically, in the initial implementation, APC suffered from poor performance after the server has been warmed up due to slow eviction from the cache evictor. This issue will not arise until the server is properly warmed up (i.e. once "all" the blocks are marked as cached"). This PR updates the eviction logic to use an ordered dict, which dramatically improves the performance (turning an @ElizaWszola I added some more datasets to the benchmark scripts to properly run the server benchmark analysis.
The analysis below looks at Mistral-7b on A100-80GB. The Input shapes and request rate result in medium concurrency with ~10 active requests at a time. ShareGPTFirst, we look at the ShareGPT dataset, which has ~no opportunity for prefix caching since none of the prompts are repeated. The goal is to have as little overhead from automatic prefix caching as possible in this case. Current MainOn the A100-80GB, the performance of the eviction logic has a big negative impact on performance Prefix Caching Off
Prefix Caching On
^ note: this requires warming up the server for ~1000 requests to trigger eviction logic to occur. This PRThis PR resolves the performance issues on main, driving down to ~3.5% overhead Prefix Caching Off
Prefix Caching On
Sonnet DatasetNext, we look at the Sonnet dataset, where we repeatedly send prompts with the same exact prompt. This is a best case scenario for APC, and we should expect to see performance gains. Note: Ignore the Below --- need to re-run with Llama or enable turning off sliding window for Mistral. Zephyr is Mistral based, so it ignores prefix caching on the forward pass due to sliding window. This PRPrefix Caching Off
Prefix Caching On
Re-create ResultsLaunch Server
python3 -m vllm.entrypoints.openai.api_server --model HuggingFaceH4/zephyr-7b-beta --disable-log-requests --max-model-len 4096
python3 -m vllm.entrypoints.openai.api_server --model HuggingFaceH4/zephyr-7b-beta --disable-log-requests --max-model-len 4096 --enable-prefix-caching Launch Client
python3 benchmark_serving_new.py --model HuggingFaceH4/zephyr-7b-beta --dataset ultrachat --num-prompts 1000 --backend openai --endpoint /v1/completions
python3 benchmark_serving_new.py --model HuggingFaceH4/zephyr-7b-beta --dataset sharegpt --request-rate 2.5 --num-prompts 1000 --backend openai --endpoint /v1/completions
python3 benchmark_serving_new.py --model HuggingFaceH4/zephyr-7b-beta --dataset sonnet --request-rate 2.5 --num-prompts 500 --backend openai --endpoint /v1/completions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments I had before our offline chat. Let me know when this PR is ready for review again :)
benchmarks/sonnet.txt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this file to benchmarks/data/sonnet.txt
to make the directory look cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do it in separate PR.
benchmarks/benchmark_serving_new.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between this and old benchmark_serving.py
? If it's just WIP code it's fine. Otherwise let's include this in another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make a separate PR that rewrites benchmark_serving.py
a bit.
@zhuohan123 I've nuked the benchmark stuff and will move it to a different PR - this PR is ready for a re-review :) |
for _, block in self.free_table.items(): | ||
if evicted_block.last_accessed < block.last_accessed: | ||
break | ||
if evicted_block.num_hashed_tokens < block.num_hashed_tokens: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem correct to me? Say if a block's last_accessed time is changed, it's relative position in the ordereddict will stay the same and will not be updated. Then we might evict a block that is accessed recently but was added to the dict early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The evictor contains only blocks with ref count zero. If a block is allocated, it will be popped (allocate()
function in CachedBlockAllocator
). So when we access the block, we first pop it, then update access time and process it, and then push it again if its ref count goes down to zero again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I got it! This is smart lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the clever fix!
* upstream/main: [Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config (vllm-project#3551) [Misc][Log] Add log for tokenizer length not equal to vocabulary size (vllm-project#3500) [🚀 Ready to be merged] Added support for Jais models (vllm-project#3183) Fix 1D query issue from `_prune_hidden_states` (vllm-project#3539) [PREFIX CACHING FOLLOW UP] OrderedDict-based evictor (vllm-project#3431) [BugFix] Hot fix in setup.py for neuron build (vllm-project#3537) Migrate `logits` computation and gather to `model_runner` (vllm-project#3233) [1/n][Chunked Prefill] Refactor input query shapes (vllm-project#3236) [1/n] Triton sampling kernel (vllm-project#3186) [Bugfix] Fix ROCm support in CMakeLists.txt (vllm-project#3534)
Co-authored-by: rsnm2 <rshaw@neuralmagic.com> Co-authored-by: Luka <luka@paperspace>
Make the evictor based on
OrderedDict
rather thanDict
, so we obtain faster eviction without damaging the performance of adding and removingPhysicalTokenBlock
s. This lets us make the gap between the cached and uncached runtime smaller in throughput benchmarks.Results of
python benchmark_throughput_cache.py --backend vllm --model huggyllama/llama-7b --dataset ../data/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 500
(10 runs each):OrderedDict-based Evictor (this PR)
Throughput: 8.42 requests/s, 4122.29 tokens/s
Throughput: 8.45 requests/s, 4137.79 tokens/s
Throughput: 8.46 requests/s, 4141.09 tokens/s
Throughput: 8.48 requests/s, 4154.49 tokens/s
Throughput: 8.48 requests/s, 4153.40 tokens/s
Throughput: 8.50 requests/s, 4161.52 tokens/s
Throughput: 8.52 requests/s, 4174.35 tokens/s
Throughput: 8.53 requests/s, 4175.47 tokens/s
Throughput: 8.53 requests/s, 4175.94 tokens/s
Throughput: 8.56 requests/s, 4192.30 tokens/s
Dict-based Evictor (old)
Throughput: 8.25 requests/s, 4040.61 tokens/s
Throughput: 8.34 requests/s, 4085.23 tokens/s
Throughput: 8.26 requests/s, 4045.35 tokens/s
Throughput: 8.27 requests/s, 4049.02 tokens/s
Throughput: 8.32 requests/s, 4075.07 tokens/s
Throughput: 8.26 requests/s, 4043.52 tokens/s
Throughput: 8.25 requests/s, 4038.19 tokens/s
Throughput: 8.34 requests/s, 4082.67 tokens/s
Throughput: 8.17 requests/s, 3998.23 tokens/s
Throughput: 8.24 requests/s, 4034.38 tokens/s
No prefix caching (with improvements from PR #3357)
Throughput: 8.50 requests/s, 4163.69 tokens/s
Throughput: 8.54 requests/s, 4183.18 tokens/s
Throughput: 8.56 requests/s, 4193.75 tokens/s
Throughput: 8.57 requests/s, 4198.46 tokens/s
Throughput: 8.60 requests/s, 4211.83 tokens/s
Throughput: 8.63 requests/s, 4228.16 tokens/s
Throughput: 8.67 requests/s, 4247.58 tokens/s
Throughput: 8.69 requests/s, 4253.06 tokens/s
Throughput: 8.75 requests/s, 4286.50 tokens/s
Throughput: 8.78 requests/s, 4296.93 tokens/s
Results of running benchmarks from https://github.com/neuralmagic/nm-vllm/pull/102/files
with
--enable-prefix-caching
:without
--enable-prefix-caching
:New Sonnet Dataset results with
huggyllama/llama-7b
model:(See @robertgshaw2-neuralmagic 's comment below for more info)
This PR
Prefix Caching Off
Prefix Caching On