server : Smart selection of available slot using Longest Common Prefix #7728

sasha0552 · 2024-06-04T06:18:05Z

In the current implementation, an available slot is selected using LRU (Least Recently Used). This PR adds slot selection by ~~LCS (Longest Common Substring)~~ LCP (Longest Common Prefix) algorithm to select a slot with a prompt that has at least n% similarity to the requested prompt. This reduces prompt processing in multi-user scenarios.

Additionally, this PR:

Defers processing if a slot is unavailable and the user has explicitly requested a specific slot.
Fixes erase/save/load slot endpoints. In the current implementation, these endpoints use a slot by LRU if the requested slot is not available, which is not correct behavior.

…ring

github-actions · 2024-06-04T07:13:44Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 522 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8935.67ms p(95)=22613.33ms fails=, finish reason: stop=462 truncated=60
Prompt processing (pp): avg=98.01tk/s p(95)=403.94tk/s
Token generation (tg): avg=36.92tk/s p(95)=45.07tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=use-slot-by-lcs commit=a8842fdf56dc725b69c19332d46dc8bbf612069e

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 522 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717771793 --> 1717772417
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 284.9, 284.9, 284.9, 284.9, 284.9, 851.42, 851.42, 851.42, 851.42, 851.42, 707.0, 707.0, 707.0, 707.0, 707.0, 729.04, 729.04, 729.04, 729.04, 729.04, 779.94, 779.94, 779.94, 779.94, 779.94, 777.42, 777.42, 777.42, 777.42, 777.42, 775.13, 775.13, 775.13, 775.13, 775.13, 791.34, 791.34, 791.34, 791.34, 791.34, 792.14, 792.14, 792.14, 792.14, 792.14, 811.26, 811.26, 811.26, 811.26, 811.26, 811.5, 811.5, 811.5, 811.5, 811.5, 829.12, 829.12, 829.12, 829.12, 829.12, 832.7, 832.7, 832.7, 832.7, 832.7, 866.69, 866.69, 866.69, 866.69, 866.69, 882.91, 882.91, 882.91, 882.91, 882.91, 882.63, 882.63, 882.63, 882.63, 882.63, 887.03, 887.03, 887.03, 887.03, 887.03, 882.63, 882.63, 882.63, 882.63, 882.63, 903.39, 903.39, 903.39, 903.39, 903.39, 901.43, 901.43, 901.43, 901.43, 901.43, 900.38, 900.38, 900.38, 900.38, 900.38, 904.75, 904.75, 904.75, 904.75, 904.75, 907.27, 907.27, 907.27, 907.27, 907.27, 870.57, 870.57, 870.57, 870.57, 870.57, 874.37, 874.37, 874.37, 874.37, 874.37, 876.96, 876.96, 876.96, 876.96, 876.96, 886.65, 886.65, 886.65, 886.65, 886.65, 884.76, 884.76, 884.76, 884.76, 884.76, 883.17, 883.17, 883.17, 883.17, 883.17, 883.62, 883.62, 883.62, 883.62, 883.62, 886.38, 886.38, 886.38, 886.38, 886.38, 883.57, 883.57, 883.57, 883.57, 883.57, 887.22, 887.22, 887.22, 887.22, 887.22, 890.42, 890.42, 890.42, 890.42, 890.42, 894.54, 894.54, 894.54, 894.54, 894.54, 897.76, 897.76, 897.76, 897.76, 897.76, 888.41, 888.41, 888.41, 888.41, 888.41, 886.07, 886.07, 886.07, 886.07, 886.07, 883.21, 883.21, 883.21, 883.21, 883.21, 882.46, 882.46, 882.46, 882.46, 882.46, 886.76, 886.76, 886.76, 886.76, 886.76, 886.93, 886.93, 886.93, 886.93, 886.93, 894.83, 894.83, 894.83, 894.83, 894.83, 898.85, 898.85, 898.85, 898.85, 898.85, 902.29, 902.29, 902.29, 902.29, 902.29, 901.12, 901.12, 901.12, 901.12, 901.12, 898.02, 898.02, 898.02, 898.02, 898.02, 899.5, 899.5, 899.5, 899.5, 899.5, 900.98, 900.98, 900.98, 900.98, 900.98, 898.9, 898.9, 898.9, 898.9, 898.9, 901.54, 901.54, 901.54, 901.54, 901.54, 901.71, 901.71, 901.71, 901.71, 901.71, 901.32, 901.32, 901.32, 901.32, 901.32, 896.9, 896.9, 896.9, 896.9, 896.9, 898.91, 898.91, 898.91, 898.91, 898.91, 892.24, 892.24, 892.24, 892.24, 892.24, 892.75, 892.75, 892.75, 892.75, 892.75, 890.95, 890.95, 890.95, 890.95, 890.95, 890.24, 890.24, 890.24, 890.24, 890.24, 889.68, 889.68, 889.68, 889.68, 889.68, 890.33, 890.33, 890.33]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 522 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717771793 --> 1717772417
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.21, 41.21, 41.21, 41.21, 41.21, 32.02, 32.02, 32.02, 32.02, 32.02, 30.16, 30.16, 30.16, 30.16, 30.16, 31.09, 31.09, 31.09, 31.09, 31.09, 32.74, 32.74, 32.74, 32.74, 32.74, 32.83, 32.83, 32.83, 32.83, 32.83, 32.94, 32.94, 32.94, 32.94, 32.94, 33.64, 33.64, 33.64, 33.64, 33.64, 34.09, 34.09, 34.09, 34.09, 34.09, 34.09, 34.09, 34.09, 34.09, 34.09, 34.1, 34.1, 34.1, 34.1, 34.1, 33.95, 33.95, 33.95, 33.95, 33.95, 33.16, 33.16, 33.16, 33.16, 33.16, 33.17, 33.17, 33.17, 33.17, 33.17, 32.49, 32.49, 32.49, 32.49, 32.49, 31.48, 31.48, 31.48, 31.48, 31.48, 30.48, 30.48, 30.48, 30.48, 30.48, 30.68, 30.68, 30.68, 30.68, 30.68, 30.46, 30.46, 30.46, 30.46, 30.46, 30.32, 30.32, 30.32, 30.32, 30.32, 30.26, 30.26, 30.26, 30.26, 30.26, 30.35, 30.35, 30.35, 30.35, 30.35, 30.58, 30.58, 30.58, 30.58, 30.58, 30.6, 30.6, 30.6, 30.6, 30.6, 30.52, 30.52, 30.52, 30.52, 30.52, 30.81, 30.81, 30.81, 30.81, 30.81, 30.84, 30.84, 30.84, 30.84, 30.84, 30.71, 30.71, 30.71, 30.71, 30.71, 30.84, 30.84, 30.84, 30.84, 30.84, 31.05, 31.05, 31.05, 31.05, 31.05, 31.18, 31.18, 31.18, 31.18, 31.18, 31.26, 31.26, 31.26, 31.26, 31.26, 31.45, 31.45, 31.45, 31.45, 31.45, 31.44, 31.44, 31.44, 31.44, 31.44, 31.4, 31.4, 31.4, 31.4, 31.4, 31.21, 31.21, 31.21, 31.21, 31.21, 31.16, 31.16, 31.16, 31.16, 31.16, 30.68, 30.68, 30.68, 30.68, 30.68, 30.43, 30.43, 30.43, 30.43, 30.43, 30.44, 30.44, 30.44, 30.44, 30.44, 30.65, 30.65, 30.65, 30.65, 30.65, 30.66, 30.66, 30.66, 30.66, 30.66, 30.8, 30.8, 30.8, 30.8, 30.8, 30.62, 30.62, 30.62, 30.62, 30.62, 30.44, 30.44, 30.44, 30.44, 30.44, 29.83, 29.83, 29.83, 29.83, 29.83, 29.26, 29.26, 29.26, 29.26, 29.26, 29.15, 29.15, 29.15, 29.15, 29.15, 29.11, 29.11, 29.11, 29.11, 29.11, 28.97, 28.97, 28.97, 28.97, 28.97, 29.04, 29.04, 29.04, 29.04, 29.04, 29.06, 29.06, 29.06, 29.06, 29.06, 29.06, 29.06, 29.06, 29.06, 29.06, 29.05, 29.05, 29.05, 29.05, 29.05, 28.97, 28.97, 28.97, 28.97, 28.97, 28.95, 28.95, 28.95, 28.95, 28.95, 28.93, 28.93, 28.93, 28.93, 28.93, 28.88, 28.88, 28.88, 28.88, 28.88, 28.8, 28.8, 28.8, 28.8, 28.8, 28.9, 28.9, 28.9, 28.9, 28.9, 28.99, 28.99, 28.99]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 522 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717771793 --> 1717772417
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.08, 0.08, 0.08, 0.08, 0.08, 0.38, 0.38, 0.38, 0.38, 0.38, 0.25, 0.25, 0.25, 0.25, 0.25, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.24, 0.24, 0.24, 0.24, 0.24, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.27, 0.27, 0.27, 0.27, 0.27, 0.33, 0.33, 0.33, 0.33, 0.33, 0.28, 0.28, 0.28, 0.28, 0.28, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.27, 0.27, 0.27, 0.27, 0.27, 0.22, 0.22, 0.22, 0.22, 0.22, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.36, 0.36, 0.36, 0.36, 0.36, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.31, 0.31, 0.31, 0.31, 0.31, 0.19, 0.19, 0.19, 0.19, 0.19, 0.08, 0.08, 0.08, 0.08, 0.08, 0.15, 0.15, 0.15, 0.15, 0.15, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.32, 0.32, 0.32, 0.32, 0.32, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.28, 0.28, 0.28, 0.28, 0.28, 0.39, 0.39, 0.39, 0.39, 0.39, 0.5, 0.5, 0.5, 0.5, 0.5, 0.37, 0.37, 0.37, 0.37, 0.37, 0.29, 0.29, 0.29, 0.29, 0.29, 0.22, 0.22, 0.22, 0.22, 0.22, 0.32, 0.32, 0.32, 0.32, 0.32, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.24, 0.24, 0.24, 0.24, 0.24, 0.3, 0.3, 0.3, 0.3, 0.3, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 522 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717771793 --> 1717772417
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0]

ggerganov

Need to rebase to latest master and we can merge

common/common.cpp

sasha0552 · 2024-06-05T08:44:27Z

I'll test again and then mark the PR as ready for review.

sasha0552 · 2024-06-05T08:47:38Z

By the way, should this be on by default? Or is it better to leave it off as it is now?

ggerganov · 2024-06-05T11:21:00Z

This PR adds slot selection by LCS (Longest Common Substring) algorithm to select a slot with a prompt that has at least n% similarity to the requested prompt. This reduces prompt processing in multi-user scenarios.

The LCS algorithm is an overkill for this purpose. All you need to look for is the longest common prefix, which is much simpler to compute

sasha0552 · 2024-06-05T11:39:16Z

As far as I know, the server can reuse not only the prompt prefix, but also the suffix (llama_kv_cache_seq_rm & llama_kv_cache_seq_add sequence, also known as context shifting).

llama.cpp/examples/server/server.cpp

Lines 1819 to 1857 in 2b33896

    
           // apply context-shift if needed 
        
           // TODO: simplify and improve 
        
           for (server_slot & slot : slots) { 
        
               if (slot.ga_n == 1) { 
        
                   if (slot.is_processing() && (int) system_tokens.size() + slot.n_past >= slot.n_ctx - 1) { 
        
                       // Shift context 
        
                       const int n_keep    = slot.params.n_keep + add_bos_token; 
        
                       const int n_left    = (int) system_tokens.size() + slot.n_past - n_keep; 
        
                       const int n_discard = slot.params.n_discard ? slot.params.n_discard : (n_left / 2); 
        
                       LOG_INFO("slot context shift", { 
        
                           {"id_slot",         slot.id}, 
        
                           {"id_task",         slot.id_task}, 
        
                           {"n_keep",          n_keep}, 
        
                           {"n_left",          n_left}, 
        
                           {"n_discard",       n_discard}, 
        
                           {"n_ctx",           n_ctx}, 
        
                           {"n_past",          slot.n_past}, 
        
                           {"n_system_tokens", system_tokens.size()}, 
        
                           {"n_cache_tokens",  slot.cache_tokens.size()} 
        
                       }); 
        
                       llama_kv_cache_seq_rm (ctx, slot.id + 1, n_keep            , n_keep + n_discard); 
        
                       llama_kv_cache_seq_add(ctx, slot.id + 1, n_keep + n_discard, system_tokens.size() + slot.n_past, -n_discard); 
        
                       if (slot.params.cache_prompt) { 
        
                           for (size_t i = n_keep + n_discard; i < slot.cache_tokens.size(); i++) { 
        
                               slot.cache_tokens[i - n_discard] = slot.cache_tokens[i]; 
        
                           } 
        
                           slot.cache_tokens.resize(slot.cache_tokens.size() - n_discard); 
        
                       } 
        
                       slot.n_past -= n_discard; 
        
                       slot.truncated = true; 
        
                   } 
        
               } 
        
           }

ggerganov · 2024-06-07T12:13:22Z

Although the llama interface allows various operations with the KV cache, such as shifting the tokens' position and removing tokens, the server only looks for a common prefix when the prompt cache is enabled:

llama.cpp/examples/server/server.cpp

Lines 2022 to 2025 in d5c938c

    
           // reuse any previously computed tokens that are common with the new prompt 
        
           slot.n_past = common_part(slot.cache_tokens, prompt_tokens);

So for now the slot selection logic is better to follow the prompt caching logic and look just at the prefix

server : Smart selection of available slot using Longest Common Subst…

1ecb6a6

…ring

github-actions bot added examples server labels Jun 4, 2024

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 5, 2024

ggerganov approved these changes Jun 5, 2024

View reviewed changes

Merge branch 'master' into use-slot-by-lcs

d15e662

ggerganov reviewed Jun 5, 2024

View reviewed changes

common/common.cpp Show resolved Hide resolved

add usage

2df61bf

remove trailing whitespaces

f116411

sasha0552 marked this pull request as ready for review June 5, 2024 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : Smart selection of available slot using Longest Common Prefix #7728

server : Smart selection of available slot using Longest Common Prefix #7728

sasha0552 commented Jun 4, 2024 •

edited

Loading

github-actions bot commented Jun 4, 2024 •

edited

Loading

ggerganov left a comment •

edited

Loading

sasha0552 commented Jun 5, 2024

sasha0552 commented Jun 5, 2024

ggerganov commented Jun 5, 2024

sasha0552 commented Jun 5, 2024 •

edited

Loading

ggerganov commented Jun 7, 2024

server : Smart selection of available slot using Longest Common Prefix #7728

server : Smart selection of available slot using Longest Common Prefix #7728

Conversation

sasha0552 commented Jun 4, 2024 • edited Loading

github-actions bot commented Jun 4, 2024 • edited Loading

ggerganov left a comment • edited Loading

Choose a reason for hiding this comment

sasha0552 commented Jun 5, 2024

sasha0552 commented Jun 5, 2024

ggerganov commented Jun 5, 2024

sasha0552 commented Jun 5, 2024 • edited Loading

ggerganov commented Jun 7, 2024

sasha0552 commented Jun 4, 2024 •

edited

Loading

github-actions bot commented Jun 4, 2024 •

edited

Loading

ggerganov left a comment •

edited

Loading

sasha0552 commented Jun 5, 2024 •

edited

Loading