Guide on Hyperparameter Tuning

Achieving Peak Throughput

Achieving a large batch size is the most important thing for attaining high throughput.

When the server is running at full load, look for the following in the log:

[gpu_id=0] Decode batch. #running-req: 233, #token: 370959, token usage: 0.82, gen throughput (token/s): 4594.01, #queue-req: 417

Tune Your Request Submission Speed

#queue-req indicates the number of requests in the queue. If you frequently see #queue-req == 0, it suggests you are bottlenecked by the request submission speed. A healthy range for #queue-req is 100 - 3000.

Tune `--schedule-conservativeness`

token usage indicates the KV cache memory utilization of the server. token usage > 0.9 means good utilization. If you frequently see token usage < 0.9 and #queue-req > 0, it means the server is too conservative about taking in new requests. You can decrease --schedule-conservativeness to a value like 0.3. The case of serving being too conservative can happen when users send many requests with a large max_new_tokens but the requests stop very early due to EOS or stop strings.

On the other hand, if you see token usage very high and you frequently see warnings like decode out of memory happened, #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000, you can increase --schedule-conservativeness to a value like 1.3.

Tune `--dp-size` and `--tp-size`

Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput.

(Minor) Tune `--max-prefill-tokens`, `--mem-fraction-static`, `--max-running-requests`

If you see out of memory (OOM) errors, you can decrease these parameters.
If OOM happens during prefill, try to decrease --max-prefill-tokens.
If OOM happens during decoding, try to decrease --max-running-requests.
You can also try to decrease --mem-fraction-static, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.

(Minor) Tune `--schedule-heuristic`

If you have many shared prefixes, use the default --schedule-heuristic lpm. lpm stands for longest prefix match. When you have no shared prefixes at all or you always send the requests with the shared prefixes together, you can try --schedule-heuristic fcfs. fcfs stands for first come first serve.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hyperparameter_tuning.md

hyperparameter_tuning.md

Guide on Hyperparameter Tuning

Achieving Peak Throughput

Tune Your Request Submission Speed

Tune `--schedule-conservativeness`

Tune `--dp-size` and `--tp-size`

(Minor) Tune `--max-prefill-tokens`, `--mem-fraction-static`, `--max-running-requests`

(Minor) Tune `--schedule-heuristic`

Files

hyperparameter_tuning.md

Latest commit

History

hyperparameter_tuning.md

File metadata and controls

Guide on Hyperparameter Tuning

Achieving Peak Throughput

Tune Your Request Submission Speed

Tune --schedule-conservativeness

Tune --dp-size and --tp-size

(Minor) Tune --max-prefill-tokens, --mem-fraction-static, --max-running-requests

(Minor) Tune --schedule-heuristic

Tune `--schedule-conservativeness`

Tune `--dp-size` and `--tp-size`

(Minor) Tune `--max-prefill-tokens`, `--mem-fraction-static`, `--max-running-requests`

(Minor) Tune `--schedule-heuristic`