Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : Smart selection of available slot using Longest Common Prefix #7728

Merged
merged 6 commits into from
Jun 8, 2024

Conversation

sasha0552
Copy link
Contributor

@sasha0552 sasha0552 commented Jun 4, 2024

In the current implementation, an available slot is selected using LRU (Least Recently Used). This PR adds slot selection by LCS (Longest Common Substring) LCP (Longest Common Prefix) algorithm to select a slot with a prompt that has at least n% similarity to the requested prompt. This reduces prompt processing in multi-user scenarios.

Additionally, this PR:

  1. Defers processing if a slot is unavailable and the user has explicitly requested a specific slot.
  2. Fixes erase/save/load slot endpoints. In the current implementation, these endpoints use a slot by LRU if the requested slot is not available, which is not correct behavior.

Copy link
Contributor

github-actions bot commented Jun 4, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 522 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8935.67ms p(95)=22613.33ms fails=, finish reason: stop=462 truncated=60
  • Prompt processing (pp): avg=98.01tk/s p(95)=403.94tk/s
  • Token generation (tg): avg=36.92tk/s p(95)=45.07tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=use-slot-by-lcs commit=a8842fdf56dc725b69c19332d46dc8bbf612069e

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 522 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717771793 --> 1717772417
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 284.9, 284.9, 284.9, 284.9, 284.9, 851.42, 851.42, 851.42, 851.42, 851.42, 707.0, 707.0, 707.0, 707.0, 707.0, 729.04, 729.04, 729.04, 729.04, 729.04, 779.94, 779.94, 779.94, 779.94, 779.94, 777.42, 777.42, 777.42, 777.42, 777.42, 775.13, 775.13, 775.13, 775.13, 775.13, 791.34, 791.34, 791.34, 791.34, 791.34, 792.14, 792.14, 792.14, 792.14, 792.14, 811.26, 811.26, 811.26, 811.26, 811.26, 811.5, 811.5, 811.5, 811.5, 811.5, 829.12, 829.12, 829.12, 829.12, 829.12, 832.7, 832.7, 832.7, 832.7, 832.7, 866.69, 866.69, 866.69, 866.69, 866.69, 882.91, 882.91, 882.91, 882.91, 882.91, 882.63, 882.63, 882.63, 882.63, 882.63, 887.03, 887.03, 887.03, 887.03, 887.03, 882.63, 882.63, 882.63, 882.63, 882.63, 903.39, 903.39, 903.39, 903.39, 903.39, 901.43, 901.43, 901.43, 901.43, 901.43, 900.38, 900.38, 900.38, 900.38, 900.38, 904.75, 904.75, 904.75, 904.75, 904.75, 907.27, 907.27, 907.27, 907.27, 907.27, 870.57, 870.57, 870.57, 870.57, 870.57, 874.37, 874.37, 874.37, 874.37, 874.37, 876.96, 876.96, 876.96, 876.96, 876.96, 886.65, 886.65, 886.65, 886.65, 886.65, 884.76, 884.76, 884.76, 884.76, 884.76, 883.17, 883.17, 883.17, 883.17, 883.17, 883.62, 883.62, 883.62, 883.62, 883.62, 886.38, 886.38, 886.38, 886.38, 886.38, 883.57, 883.57, 883.57, 883.57, 883.57, 887.22, 887.22, 887.22, 887.22, 887.22, 890.42, 890.42, 890.42, 890.42, 890.42, 894.54, 894.54, 894.54, 894.54, 894.54, 897.76, 897.76, 897.76, 897.76, 897.76, 888.41, 888.41, 888.41, 888.41, 888.41, 886.07, 886.07, 886.07, 886.07, 886.07, 883.21, 883.21, 883.21, 883.21, 883.21, 882.46, 882.46, 882.46, 882.46, 882.46, 886.76, 886.76, 886.76, 886.76, 886.76, 886.93, 886.93, 886.93, 886.93, 886.93, 894.83, 894.83, 894.83, 894.83, 894.83, 898.85, 898.85, 898.85, 898.85, 898.85, 902.29, 902.29, 902.29, 902.29, 902.29, 901.12, 901.12, 901.12, 901.12, 901.12, 898.02, 898.02, 898.02, 898.02, 898.02, 899.5, 899.5, 899.5, 899.5, 899.5, 900.98, 900.98, 900.98, 900.98, 900.98, 898.9, 898.9, 898.9, 898.9, 898.9, 901.54, 901.54, 901.54, 901.54, 901.54, 901.71, 901.71, 901.71, 901.71, 901.71, 901.32, 901.32, 901.32, 901.32, 901.32, 896.9, 896.9, 896.9, 896.9, 896.9, 898.91, 898.91, 898.91, 898.91, 898.91, 892.24, 892.24, 892.24, 892.24, 892.24, 892.75, 892.75, 892.75, 892.75, 892.75, 890.95, 890.95, 890.95, 890.95, 890.95, 890.24, 890.24, 890.24, 890.24, 890.24, 889.68, 889.68, 889.68, 889.68, 889.68, 890.33, 890.33, 890.33]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 522 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717771793 --> 1717772417
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.21, 41.21, 41.21, 41.21, 41.21, 32.02, 32.02, 32.02, 32.02, 32.02, 30.16, 30.16, 30.16, 30.16, 30.16, 31.09, 31.09, 31.09, 31.09, 31.09, 32.74, 32.74, 32.74, 32.74, 32.74, 32.83, 32.83, 32.83, 32.83, 32.83, 32.94, 32.94, 32.94, 32.94, 32.94, 33.64, 33.64, 33.64, 33.64, 33.64, 34.09, 34.09, 34.09, 34.09, 34.09, 34.09, 34.09, 34.09, 34.09, 34.09, 34.1, 34.1, 34.1, 34.1, 34.1, 33.95, 33.95, 33.95, 33.95, 33.95, 33.16, 33.16, 33.16, 33.16, 33.16, 33.17, 33.17, 33.17, 33.17, 33.17, 32.49, 32.49, 32.49, 32.49, 32.49, 31.48, 31.48, 31.48, 31.48, 31.48, 30.48, 30.48, 30.48, 30.48, 30.48, 30.68, 30.68, 30.68, 30.68, 30.68, 30.46, 30.46, 30.46, 30.46, 30.46, 30.32, 30.32, 30.32, 30.32, 30.32, 30.26, 30.26, 30.26, 30.26, 30.26, 30.35, 30.35, 30.35, 30.35, 30.35, 30.58, 30.58, 30.58, 30.58, 30.58, 30.6, 30.6, 30.6, 30.6, 30.6, 30.52, 30.52, 30.52, 30.52, 30.52, 30.81, 30.81, 30.81, 30.81, 30.81, 30.84, 30.84, 30.84, 30.84, 30.84, 30.71, 30.71, 30.71, 30.71, 30.71, 30.84, 30.84, 30.84, 30.84, 30.84, 31.05, 31.05, 31.05, 31.05, 31.05, 31.18, 31.18, 31.18, 31.18, 31.18, 31.26, 31.26, 31.26, 31.26, 31.26, 31.45, 31.45, 31.45, 31.45, 31.45, 31.44, 31.44, 31.44, 31.44, 31.44, 31.4, 31.4, 31.4, 31.4, 31.4, 31.21, 31.21, 31.21, 31.21, 31.21, 31.16, 31.16, 31.16, 31.16, 31.16, 30.68, 30.68, 30.68, 30.68, 30.68, 30.43, 30.43, 30.43, 30.43, 30.43, 30.44, 30.44, 30.44, 30.44, 30.44, 30.65, 30.65, 30.65, 30.65, 30.65, 30.66, 30.66, 30.66, 30.66, 30.66, 30.8, 30.8, 30.8, 30.8, 30.8, 30.62, 30.62, 30.62, 30.62, 30.62, 30.44, 30.44, 30.44, 30.44, 30.44, 29.83, 29.83, 29.83, 29.83, 29.83, 29.26, 29.26, 29.26, 29.26, 29.26, 29.15, 29.15, 29.15, 29.15, 29.15, 29.11, 29.11, 29.11, 29.11, 29.11, 28.97, 28.97, 28.97, 28.97, 28.97, 29.04, 29.04, 29.04, 29.04, 29.04, 29.06, 29.06, 29.06, 29.06, 29.06, 29.06, 29.06, 29.06, 29.06, 29.06, 29.05, 29.05, 29.05, 29.05, 29.05, 28.97, 28.97, 28.97, 28.97, 28.97, 28.95, 28.95, 28.95, 28.95, 28.95, 28.93, 28.93, 28.93, 28.93, 28.93, 28.88, 28.88, 28.88, 28.88, 28.88, 28.8, 28.8, 28.8, 28.8, 28.8, 28.9, 28.9, 28.9, 28.9, 28.9, 28.99, 28.99, 28.99]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 522 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717771793 --> 1717772417
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.08, 0.08, 0.08, 0.08, 0.08, 0.38, 0.38, 0.38, 0.38, 0.38, 0.25, 0.25, 0.25, 0.25, 0.25, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.24, 0.24, 0.24, 0.24, 0.24, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.27, 0.27, 0.27, 0.27, 0.27, 0.33, 0.33, 0.33, 0.33, 0.33, 0.28, 0.28, 0.28, 0.28, 0.28, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.27, 0.27, 0.27, 0.27, 0.27, 0.22, 0.22, 0.22, 0.22, 0.22, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.36, 0.36, 0.36, 0.36, 0.36, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.31, 0.31, 0.31, 0.31, 0.31, 0.19, 0.19, 0.19, 0.19, 0.19, 0.08, 0.08, 0.08, 0.08, 0.08, 0.15, 0.15, 0.15, 0.15, 0.15, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.32, 0.32, 0.32, 0.32, 0.32, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.28, 0.28, 0.28, 0.28, 0.28, 0.39, 0.39, 0.39, 0.39, 0.39, 0.5, 0.5, 0.5, 0.5, 0.5, 0.37, 0.37, 0.37, 0.37, 0.37, 0.29, 0.29, 0.29, 0.29, 0.29, 0.22, 0.22, 0.22, 0.22, 0.22, 0.32, 0.32, 0.32, 0.32, 0.32, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.24, 0.24, 0.24, 0.24, 0.24, 0.3, 0.3, 0.3, 0.3, 0.3, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 522 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717771793 --> 1717772417
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0]
                    
Loading

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 5, 2024
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to rebase to latest master and we can merge

@sasha0552
Copy link
Contributor Author

I'll test again and then mark the PR as ready for review.

@sasha0552
Copy link
Contributor Author

By the way, should this be on by default? Or is it better to leave it off as it is now?

@sasha0552 sasha0552 marked this pull request as ready for review June 5, 2024 09:17
@ggerganov
Copy link
Owner

This PR adds slot selection by LCS (Longest Common Substring) algorithm to select a slot with a prompt that has at least n% similarity to the requested prompt. This reduces prompt processing in multi-user scenarios.

The LCS algorithm is an overkill for this purpose. All you need to look for is the longest common prefix, which is much simpler to compute

@sasha0552
Copy link
Contributor Author

sasha0552 commented Jun 5, 2024

As far as I know, the server can reuse not only the prompt prefix, but also the suffix (llama_kv_cache_seq_rm & llama_kv_cache_seq_add sequence, also known as context shifting).

// apply context-shift if needed
// TODO: simplify and improve
for (server_slot & slot : slots) {
if (slot.ga_n == 1) {
if (slot.is_processing() && (int) system_tokens.size() + slot.n_past >= slot.n_ctx - 1) {
// Shift context
const int n_keep = slot.params.n_keep + add_bos_token;
const int n_left = (int) system_tokens.size() + slot.n_past - n_keep;
const int n_discard = slot.params.n_discard ? slot.params.n_discard : (n_left / 2);
LOG_INFO("slot context shift", {
{"id_slot", slot.id},
{"id_task", slot.id_task},
{"n_keep", n_keep},
{"n_left", n_left},
{"n_discard", n_discard},
{"n_ctx", n_ctx},
{"n_past", slot.n_past},
{"n_system_tokens", system_tokens.size()},
{"n_cache_tokens", slot.cache_tokens.size()}
});
llama_kv_cache_seq_rm (ctx, slot.id + 1, n_keep , n_keep + n_discard);
llama_kv_cache_seq_add(ctx, slot.id + 1, n_keep + n_discard, system_tokens.size() + slot.n_past, -n_discard);
if (slot.params.cache_prompt) {
for (size_t i = n_keep + n_discard; i < slot.cache_tokens.size(); i++) {
slot.cache_tokens[i - n_discard] = slot.cache_tokens[i];
}
slot.cache_tokens.resize(slot.cache_tokens.size() - n_discard);
}
slot.n_past -= n_discard;
slot.truncated = true;
}
}
}

@ggerganov
Copy link
Owner

Although the llama interface allows various operations with the KV cache, such as shifting the tokens' position and removing tokens, the server only looks for a common prefix when the prompt cache is enabled:

// reuse any previously computed tokens that are common with the new prompt
slot.n_past = common_part(slot.cache_tokens, prompt_tokens);

So for now the slot selection logic is better to follow the prompt caching logic and look just at the prefix