Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TGI benchmark with llmperf #564

Merged
merged 5 commits into from
Apr 12, 2024
Merged

TGI benchmark with llmperf #564

merged 5 commits into from
Apr 12, 2024

Conversation

dacorvo
Copy link
Collaborator

@dacorvo dacorvo commented Apr 11, 2024

What does this PR do?

This adds scripts to test TGI deployments using several TGI servers on the same host and a load-balancer to achieve Data Parallelism.

The test client is llmperf.

It also includes results for LLama 7b and Mistral v2 deployed on a inf2.48xlarge in a DP3 TP8 configuration.

@dacorvo dacorvo force-pushed the tgi-benchmark-with-llmperf branch from b6696a4 to b8f310f Compare April 12, 2024 07:02
Copy link
Member

@michaelbenayoun michaelbenayoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a few questions and nits. LGTM

benchmark/text-generation-inference/README.md Outdated Show resolved Hide resolved
benchmark/text-generation-inference/README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,29 @@
#!/bin/bash

model=${1:-NousResearch/Llama-2-7b-chat-hf}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want a default model for the benchmark?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also serves as an explaination of the args that can be passed

Comment on lines 7 to 24
filenames = glob.glob("tgi_bench_results/*/*summary.json")

results = []

for filename in filenames:
with open(filename) as f:
summary = json.load(f)
d = {
"model_id": summary["model"],
"concurrent requests": summary["num_concurrent_requests"],
"throughput (t/s)": summary["results_mean_output_throughput_token_per_s"],
"Time-to-first-token @ P50 (s)": summary["results_ttft_s_quantiles_p50"],
"average latency (ms)": summary["results_inter_token_latency_s_quantiles_p50"] * 1000,
}
results.append(pd.DataFrame.from_dict(d, orient="index").transpose())

df = pd.concat(results).sort_values(by="concurrent requests")
df.to_csv("tgi-results.csv", index=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would just guard that with a if __name__ == "__main__".

MODEL_ID='NousResearch/Llama-2-7b-chat-hf'
HF_BATCH_SIZE=32
HF_SEQUENCE_LENGTH=4096
HF_AUTO_CAST_TYPE='fp16'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fp16 or bf16?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fp16

dacorvo and others added 2 commits April 12, 2024 11:51
Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>
@dacorvo dacorvo merged commit eacf343 into main Apr 12, 2024
@dacorvo dacorvo deleted the tgi-benchmark-with-llmperf branch April 12, 2024 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants