-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TGI benchmark with llmperf #564
Conversation
b6696a4
to
b8f310f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left a few questions and nits. LGTM
@@ -0,0 +1,29 @@ | |||
#!/bin/bash | |||
|
|||
model=${1:-NousResearch/Llama-2-7b-chat-hf} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want a default model for the benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also serves as an explaination of the args that can be passed
filenames = glob.glob("tgi_bench_results/*/*summary.json") | ||
|
||
results = [] | ||
|
||
for filename in filenames: | ||
with open(filename) as f: | ||
summary = json.load(f) | ||
d = { | ||
"model_id": summary["model"], | ||
"concurrent requests": summary["num_concurrent_requests"], | ||
"throughput (t/s)": summary["results_mean_output_throughput_token_per_s"], | ||
"Time-to-first-token @ P50 (s)": summary["results_ttft_s_quantiles_p50"], | ||
"average latency (ms)": summary["results_inter_token_latency_s_quantiles_p50"] * 1000, | ||
} | ||
results.append(pd.DataFrame.from_dict(d, orient="index").transpose()) | ||
|
||
df = pd.concat(results).sort_values(by="concurrent requests") | ||
df.to_csv("tgi-results.csv", index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would just guard that with a if __name__ == "__main__"
.
MODEL_ID='NousResearch/Llama-2-7b-chat-hf' | ||
HF_BATCH_SIZE=32 | ||
HF_SEQUENCE_LENGTH=4096 | ||
HF_AUTO_CAST_TYPE='fp16' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it fp16
or bf16
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fp16
Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>
What does this PR do?
This adds scripts to test TGI deployments using several TGI servers on the same host and a load-balancer to achieve Data Parallelism.
The test client is llmperf.
It also includes results for LLama 7b and Mistral v2 deployed on a inf2.48xlarge in a DP3 TP8 configuration.