-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc][Benchmarking] Enable benchmarks to create request from file #3530
Conversation
per_token_latencies.append( | ||
(outputs[i].latency - outputs[i].ttft) / output_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! In #3277 what I did was
if output_len > 1:
tpots.append((outputs[i].latency - outputs[i].ttft) / (output_len - 1))
I've been also thinking of making |
Hey @ElizaWszola! Thank you for the PR and we actually shared a lot of similar ideas! Some changes in this PR have been addressed (more accurate TPOT calculation, I also personally really like the idea of the data registry from #3431, but we can probably discuss it later when we really need it. |
system_message = { | ||
"content": "You are a chatbot with the explicit goal of " | ||
"helping the user as best as possible", | ||
"role": "system", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this actually won't work with Mixtral since it doesn't have a "system" role.
break | ||
|
||
# Sample num_requests from the list. | ||
assert dataset_args.num_samples <= len(dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it'll be not possible to run certain long-duration or high request-rate sessions if the dataset is too small, but we probably don't want to modify the dataset itself either.
Perhaps we should do sampling with replacements? (i.e. random.choices()
)
@ywang96 Thanks for pointing this out! The benchmarking changes did travel from your PR to the ordered dict evictor testing to this one and I somehow missed it. I think it makes more sense to close this PR and work on yours, what do you think? |
Yep of course! Feel free to review my PR, thanks! |
@ywang96 Alright, I'm closing this now. Sorry for confusion on my side! |
Enable benchmarks that read a text file and send repetitive requests of it to the server. This way of benchmarking allows to obtain best case / upper bound scenario for request caching.
This PR was split from #3431 and used to create Sonnet dataset results.
To create requests from a text file, use arguments
--dataset path/to/dataset --request-from-text
. This will make the benchmark repeatedly send the text file to the server, resulting in best case caching scenario. Example: