[PREFIX CACHING FOLLOW UP] OrderedDict-based evictor #3431

ElizaWszola · 2024-03-15T13:52:33Z

Make the evictor based on OrderedDict rather than Dict, so we obtain faster eviction without damaging the performance of adding and removing PhysicalTokenBlocks. This lets us make the gap between the cached and uncached runtime smaller in throughput benchmarks.

Results of python benchmark_throughput_cache.py --backend vllm --model huggyllama/llama-7b --dataset ../data/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 500 (10 runs each):

OrderedDict-based Evictor (this PR)
Throughput: 8.42 requests/s, 4122.29 tokens/s
Throughput: 8.45 requests/s, 4137.79 tokens/s
Throughput: 8.46 requests/s, 4141.09 tokens/s
Throughput: 8.48 requests/s, 4154.49 tokens/s
Throughput: 8.48 requests/s, 4153.40 tokens/s
Throughput: 8.50 requests/s, 4161.52 tokens/s
Throughput: 8.52 requests/s, 4174.35 tokens/s
Throughput: 8.53 requests/s, 4175.47 tokens/s
Throughput: 8.53 requests/s, 4175.94 tokens/s
Throughput: 8.56 requests/s, 4192.30 tokens/s

Dict-based Evictor (old)
Throughput: 8.25 requests/s, 4040.61 tokens/s
Throughput: 8.34 requests/s, 4085.23 tokens/s
Throughput: 8.26 requests/s, 4045.35 tokens/s
Throughput: 8.27 requests/s, 4049.02 tokens/s
Throughput: 8.32 requests/s, 4075.07 tokens/s
Throughput: 8.26 requests/s, 4043.52 tokens/s
Throughput: 8.25 requests/s, 4038.19 tokens/s
Throughput: 8.34 requests/s, 4082.67 tokens/s
Throughput: 8.17 requests/s, 3998.23 tokens/s
Throughput: 8.24 requests/s, 4034.38 tokens/s

No prefix caching (with improvements from PR #3357)
Throughput: 8.50 requests/s, 4163.69 tokens/s
Throughput: 8.54 requests/s, 4183.18 tokens/s
Throughput: 8.56 requests/s, 4193.75 tokens/s
Throughput: 8.57 requests/s, 4198.46 tokens/s
Throughput: 8.60 requests/s, 4211.83 tokens/s
Throughput: 8.63 requests/s, 4228.16 tokens/s
Throughput: 8.67 requests/s, 4247.58 tokens/s
Throughput: 8.69 requests/s, 4253.06 tokens/s
Throughput: 8.75 requests/s, 4286.50 tokens/s
Throughput: 8.78 requests/s, 4296.93 tokens/s

Results of running benchmarks from https://github.com/neuralmagic/nm-vllm/pull/102/files

with --enable-prefix-caching:

Successful requests: 2000
Benchmark duration: 809.242636 s
Total input tokens: 491513
Total generated tokens: 325079
Request throughput: 2.47 requests/s
Input token throughput: 607.37 tokens/s
Output token throughput: 401.71 tokens/s
Mean TTFT: 38.35 ms
Median TTFT: 29.28 ms
P99 TTFT: 95.89 ms
Mean TPOT: 14.23 ms
Median TPOT: 14.05 ms
P99 TPOT: 17.75 ms

without --enable-prefix-caching:

Successful requests: 2000
Benchmark duration: 809.252614 s
Total input tokens: 491513
Total generated tokens: 325096
Request throughput: 2.47 requests/s
Input token throughput: 607.37 tokens/s
Output token throughput: 401.72 tokens/s
Mean TTFT: 38.37 ms
Median TTFT: 29.02 ms
P99 TTFT: 95.69 ms
Mean TPOT: 14.16 ms
Median TPOT: 14.01 ms
P99 TPOT: 17.67 ms

New Sonnet Dataset results with `huggyllama/llama-7b` model:

(See @robertgshaw2-neuralmagic 's comment below for more info)

This PR

Prefix Caching Off

Mean TPOT: 28.02 ms
Mean TTFT: 117.66 ms

Prefix Caching On

Mean TPOT: 15.44 ms
Mean TTFT: 41.87 ms

richardliaw · 2024-03-15T16:50:23Z

cc @cadedaniel

robertgshaw2-neuralmagic · 2024-03-16T02:53:33Z

This PR addresses the performance issues in the initial version of automatic prefix caching which focused on correctness.

Specifically, in the initial implementation, APC suffered from poor performance after the server has been warmed up due to slow eviction from the cache evictor. This issue will not arise until the server is properly warmed up (i.e. once "all" the blocks are marked as cached"). This PR updates the eviction logic to use an ordered dict, which dramatically improves the performance (turning an O(N) lookup into an O(1) lookup.

@ElizaWszola I added some more datasets to the benchmark scripts to properly run the server benchmark analysis.

Note --> sending ~600 ultrachat requests will cause the eviction logic to trigger

The analysis below looks at Mistral-7b on A100-80GB. The Input shapes and request rate result in medium concurrency with ~10 active requests at a time.

ShareGPT

First, we look at the ShareGPT dataset, which has ~no opportunity for prefix caching since none of the prompts are repeated. The goal is to have as little overhead from automatic prefix caching as possible in this case.

Current Main

On the A100-80GB, the performance of the eviction logic has a big negative impact on performance

Prefix Caching Off

Mean TPOT: 13.53ms
Mean TTFT: 50.00ms

Prefix Caching On

Mean TPOT: 59.79ms
Mean TTFT: 487.24ms

^ note: this requires warming up the server for ~1000 requests to trigger eviction logic to occur.

This PR

This PR resolves the performance issues on main, driving down to ~3.5% overhead

Prefix Caching Off

Mean TPOT: 13.53ms
Mean TTFT: 63.33ms

Prefix Caching On

Mean TPOT: 14.05ms
Mean TTFT: 64.99ms

Sonnet Dataset

Next, we look at the Sonnet dataset, where we repeatedly send prompts with the same exact prompt. This is a best case scenario for APC, and we should expect to see performance gains.

Note: Ignore the Below --- need to re-run with Llama or enable turning off sliding window for Mistral. Zephyr is Mistral based, so it ignores prefix caching on the forward pass due to sliding window.

This PR

Prefix Caching Off

Mean TPOT: 27.31ms
Mean TTFT: 118.01ms

Prefix Caching On

Mean TPOT: 28.40ms
Mean TTFT: 118.55ms

Re-create Results

Launch Server

Launch Server without Prefix Caching:

python3 -m vllm.entrypoints.openai.api_server --model HuggingFaceH4/zephyr-7b-beta --disable-log-requests --max-model-len 4096

Launch Server with Prefix Caching:

python3 -m vllm.entrypoints.openai.api_server --model HuggingFaceH4/zephyr-7b-beta --disable-log-requests --max-model-len 4096 --enable-prefix-caching

Launch Client

Warm-up server (to fill up for prefix caching on case):

python3 benchmark_serving_new.py  --model HuggingFaceH4/zephyr-7b-beta --dataset ultrachat --num-prompts 1000 --backend openai --endpoint /v1/completions

Run sharegpt requests:

python3 benchmark_serving_new.py  --model HuggingFaceH4/zephyr-7b-beta --dataset sharegpt --request-rate 2.5 --num-prompts 1000 --backend openai --endpoint /v1/completions

Run sonnet requests:

python3 benchmark_serving_new.py  --model HuggingFaceH4/zephyr-7b-beta --dataset sonnet --request-rate 2.5 --num-prompts 500 --backend openai --endpoint /v1/completions

zhuohan123

Some initial comments I had before our offline chat. Let me know when this PR is ready for review again :)

zhuohan123 · 2024-03-19T07:50:57Z

benchmarks/sonnet.txt

Let's move this file to benchmarks/data/sonnet.txt to make the directory look cleaner?

Will do it in separate PR.

zhuohan123 · 2024-03-19T07:52:07Z

benchmarks/benchmark_serving_new.py

What's the difference between this and old benchmark_serving.py? If it's just WIP code it's fine. Otherwise let's include this in another PR?

I'll make a separate PR that rewrites benchmark_serving.py a bit.

ElizaWszola · 2024-03-20T11:49:07Z

@zhuohan123 I've nuked the benchmark stuff and will move it to a different PR - this PR is ready for a re-review :)

zhuohan123 · 2024-03-20T23:04:03Z

vllm/core/evictor.py

+        for _, block in self.free_table.items():
+            if evicted_block.last_accessed < block.last_accessed:
+                break
+            if evicted_block.num_hashed_tokens < block.num_hashed_tokens:


This doesn't seem correct to me? Say if a block's last_accessed time is changed, it's relative position in the ordereddict will stay the same and will not be updated. Then we might evict a block that is accessed recently but was added to the dict early.

The evictor contains only blocks with ref count zero. If a block is allocated, it will be popped (allocate() function in CachedBlockAllocator). So when we access the block, we first pop it, then update access time and process it, and then push it again if its ref count goes down to zero again.

Oh I got it! This is smart lol

zhuohan123

LGTM! Thanks for the clever fix!

* upstream/main: [Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config (vllm-project#3551) [Misc][Log] Add log for tokenizer length not equal to vocabulary size (vllm-project#3500) [🚀 Ready to be merged] Added support for Jais models (vllm-project#3183) Fix 1D query issue from `_prune_hidden_states` (vllm-project#3539) [PREFIX CACHING FOLLOW UP] OrderedDict-based evictor (vllm-project#3431) [BugFix] Hot fix in setup.py for neuron build (vllm-project#3537) Migrate `logits` computation and gather to `model_runner` (vllm-project#3233) [1/n][Chunked Prefill] Refactor input query shapes (vllm-project#3236) [1/n] Triton sampling kernel (vllm-project#3186) [Bugfix] Fix ROCm support in CMakeLists.txt (vllm-project#3534)

Co-authored-by: rsnm2 <rshaw@neuralmagic.com> Co-authored-by: Luka <luka@paperspace>

OrderedDict-based evictor

95599c3

ElizaWszola changed the title ~~[WIP] OrderedDict-based evictor~~ [WIP][PREFIX CACHING FOLLOW UP] OrderedDict-based evictor Mar 15, 2024

format

b98277c

added benchmark expansion

434e6bc

ElizaWszola added 3 commits March 18, 2024 01:59

format

51409bf

ruff

8873203

codespell

84570cd

ElizaWszola mentioned this pull request Mar 18, 2024

[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled #3357

Merged

zhuohan123 self-assigned this Mar 18, 2024

ElizaWszola changed the title ~~[WIP][PREFIX CACHING FOLLOW UP] OrderedDict-based evictor~~ [PREFIX CACHING FOLLOW UP] OrderedDict-based evictor Mar 19, 2024

zhuohan123 reviewed Mar 19, 2024

View reviewed changes

zhuohan123 added the action-required label Mar 19, 2024

Clean up benchmark to move to a different pr

6bb58f5

ElizaWszola and others added 2 commits March 20, 2024 12:53

Merge branch 'main' into ordered-evictor

40db779

Post-merge fix

39a9573

ElizaWszola mentioned this pull request Mar 20, 2024

[Misc][Benchmarking] Enable benchmarks to create request from file #3530

Closed

zhuohan123 reviewed Mar 20, 2024

View reviewed changes

zhuohan123 approved these changes Mar 21, 2024

View reviewed changes

zhuohan123 merged commit 6ebd02b into vllm-project:main Mar 21, 2024
31 checks passed

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[PREFIX CACHING FOLLOW UP] OrderedDict-based evictor (vllm-project#3431)

32b8078

Co-authored-by: rsnm2 <rshaw@neuralmagic.com> Co-authored-by: Luka <luka@paperspace>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PREFIX CACHING FOLLOW UP] OrderedDict-based evictor #3431

[PREFIX CACHING FOLLOW UP] OrderedDict-based evictor #3431

ElizaWszola commented Mar 15, 2024 •

edited

Loading

richardliaw commented Mar 15, 2024

robertgshaw2-neuralmagic commented Mar 16, 2024 •

edited

Loading

zhuohan123 left a comment

zhuohan123 Mar 19, 2024

ElizaWszola Mar 20, 2024

zhuohan123 Mar 19, 2024

ElizaWszola Mar 20, 2024

ElizaWszola commented Mar 20, 2024

zhuohan123 Mar 20, 2024

ElizaWszola Mar 21, 2024 •

edited

Loading

zhuohan123 Mar 21, 2024

zhuohan123 left a comment

[PREFIX CACHING FOLLOW UP] OrderedDict-based evictor #3431

[PREFIX CACHING FOLLOW UP] OrderedDict-based evictor #3431

Conversation

ElizaWszola commented Mar 15, 2024 • edited Loading

New Sonnet Dataset results with huggyllama/llama-7b model:

This PR

richardliaw commented Mar 15, 2024

robertgshaw2-neuralmagic commented Mar 16, 2024 • edited Loading

ShareGPT

Current Main

Prefix Caching Off

Prefix Caching On

This PR

Prefix Caching Off

Prefix Caching On

Sonnet Dataset

This PR

Prefix Caching Off

Prefix Caching On

Re-create Results

Launch Server

Launch Client

zhuohan123 left a comment

Choose a reason for hiding this comment

zhuohan123 Mar 19, 2024

Choose a reason for hiding this comment

ElizaWszola Mar 20, 2024

Choose a reason for hiding this comment

zhuohan123 Mar 19, 2024

Choose a reason for hiding this comment

ElizaWszola Mar 20, 2024

Choose a reason for hiding this comment

ElizaWszola commented Mar 20, 2024

zhuohan123 Mar 20, 2024

Choose a reason for hiding this comment

ElizaWszola Mar 21, 2024 • edited Loading

Choose a reason for hiding this comment

zhuohan123 Mar 21, 2024

Choose a reason for hiding this comment

zhuohan123 left a comment

Choose a reason for hiding this comment

ElizaWszola commented Mar 15, 2024 •

edited

Loading

New Sonnet Dataset results with `huggyllama/llama-7b` model:

robertgshaw2-neuralmagic commented Mar 16, 2024 •

edited

Loading

ElizaWszola Mar 21, 2024 •

edited

Loading