TGI benchmark with llmperf #564

dacorvo · 2024-04-11T12:34:17Z

What does this PR do?

This adds scripts to test TGI deployments using several TGI servers on the same host and a load-balancer to achieve Data Parallelism.

The test client is llmperf.

It also includes results for LLama 7b and Mistral v2 deployed on a inf2.48xlarge in a DP3 TP8 configuration.

michaelbenayoun

Just left a few questions and nits. LGTM

benchmark/text-generation-inference/README.md

michaelbenayoun · 2024-04-12T09:15:41Z

benchmark/text-generation-inference/benchmark.sh

@@ -0,0 +1,29 @@
+#!/bin/bash
+
+model=${1:-NousResearch/Llama-2-7b-chat-hf}


Do we really want a default model for the benchmark?

It also serves as an explaination of the args that can be passed

michaelbenayoun · 2024-04-12T09:17:00Z

benchmark/text-generation-inference/generate_csv.py

+filenames = glob.glob("tgi_bench_results/*/*summary.json")
+
+results = []
+
+for filename in filenames:
+    with open(filename) as f:
+        summary = json.load(f)
+        d = {
+            "model_id": summary["model"],
+            "concurrent requests": summary["num_concurrent_requests"],
+            "throughput (t/s)": summary["results_mean_output_throughput_token_per_s"],
+            "Time-to-first-token @ P50 (s)": summary["results_ttft_s_quantiles_p50"],
+            "average latency (ms)": summary["results_inter_token_latency_s_quantiles_p50"] * 1000,
+        }
+        results.append(pd.DataFrame.from_dict(d, orient="index").transpose())
+
+df = pd.concat(results).sort_values(by="concurrent requests")
+df.to_csv("tgi-results.csv", index=False)


nit: I would just guard that with a if __name__ == "__main__".

michaelbenayoun · 2024-04-12T09:17:24Z

benchmark/text-generation-inference/llama-7b/.env

+MODEL_ID='NousResearch/Llama-2-7b-chat-hf'
+HF_BATCH_SIZE=32
+HF_SEQUENCE_LENGTH=4096
+HF_AUTO_CAST_TYPE='fp16'


Is it fp16 or bf16?

Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>

dacorvo requested review from michaelbenayoun, philschmid and JingyaHuang April 11, 2024 12:34

dacorvo added 3 commits April 12, 2024 07:02

feat(tgi): add TGI benchmark on multiple replicas

df03848

feat(tgi): add mistral 7b results

c296baa

feat(tgi): add llama-7b benchmark results

b8f310f

dacorvo force-pushed the tgi-benchmark-with-llmperf branch from b6696a4 to b8f310f Compare April 12, 2024 07:02

michaelbenayoun approved these changes Apr 12, 2024

View reviewed changes

dacorvo and others added 2 commits April 12, 2024 11:51

review: apply suggestions

bc68496

Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>

fix(bench): wrap in main + style

613371b

dacorvo merged commit eacf343 into main Apr 12, 2024

dacorvo deleted the tgi-benchmark-with-llmperf branch April 12, 2024 09:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TGI benchmark with llmperf #564

TGI benchmark with llmperf #564

dacorvo commented Apr 11, 2024

michaelbenayoun left a comment

michaelbenayoun Apr 12, 2024

dacorvo Apr 12, 2024

michaelbenayoun Apr 12, 2024

michaelbenayoun Apr 12, 2024

dacorvo Apr 12, 2024

		@@ -0,0 +1,29 @@
		#!/bin/bash

		model=${1:-NousResearch/Llama-2-7b-chat-hf}

TGI benchmark with llmperf #564

TGI benchmark with llmperf #564

Conversation

dacorvo commented Apr 11, 2024

What does this PR do?

michaelbenayoun left a comment

Choose a reason for hiding this comment

michaelbenayoun Apr 12, 2024

Choose a reason for hiding this comment

dacorvo Apr 12, 2024

Choose a reason for hiding this comment

michaelbenayoun Apr 12, 2024

Choose a reason for hiding this comment

michaelbenayoun Apr 12, 2024

Choose a reason for hiding this comment

dacorvo Apr 12, 2024

Choose a reason for hiding this comment