-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Free generation utils in Eleuther #975
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/975
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 67b3fcf with merge base f3611e5 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #975 +/- ##
===========================================
- Coverage 67.10% 26.74% -40.36%
===========================================
Files 174 174
Lines 7423 7451 +28
===========================================
- Hits 4981 1993 -2988
- Misses 2442 5458 +3016 ☔ View full report in Codecov by Sentry. |
with context.device: | ||
self._model.setup_caches(batch_size=curr_batch_size, dtype=self._dtype) | ||
|
||
temperature = generation_kwargs.get("temperature", 0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a reasonable default for temperature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the MOST common default for temperature b/c it greedily takes the best token.
@@ -53,13 +54,19 @@ def __init__( | |||
*, | |||
device: torch.device, | |||
max_seq_length: int = 4096, | |||
batch_size: int = 32, | |||
batch_size: int = 8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? Is it for memory reasons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup
@@ -31,6 +31,7 @@ seed: 217 | |||
tasks: ["truthfulqa_mc2"] | |||
limit: null | |||
max_seq_length: 4096 | |||
batch_size: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry this should match the default
Context
In order to support free generation tests like GSM8K - which is part of the suite comprising the OpenLLM Leaderboard - we need to implement our own
_model_generate
function. This PR adds that capability and tests on GSM8K.Why do you setup caches before each generate call? Doesn't that somewhat defeat the purpose? Great question, dear reader! A total valid approach would be to call
setup_caches
right after model instantiation and then callingreset_caches
after each batched generate call. However, this would also mean that the default model would have kv_caches enabled and these would be used for the normal _model_call. This is overkill and also requires some additional logic b/c if the batch size changes (common for the last batch in a dataset), we have to callsetup_caches
again. Therefore, it's simpler in code to just call setup_caches before each generate call. This still provides a performance benefit as we are not recomputing kv across every seq for every batch.Why is there a discrepancy between Eleuther's Phi3 eval on GSM8K and ours? We already acknowledged there were slight computational differences in our Phi3 implementation and HF's due to the split QKV dimension. I think this is the closest we can get, but I'll check that it looks reasonable to the Eleuther team. If this is a common thing, we should maybe consider defaulting to doing a delta calculation between the original model and the user's finetuned model. This would be the most accurate way to represent how the finetuning went and wouldn't confuse the users as to why the numbers don't match Eleuther or OpenLLM Leaderboard. The downside is this would take twice the amount of time. Curious to hear thoughts from @ebsmothers and @kartikayk.
Changelog
_model_generate
functiontok_batch_encode
function, which is necessary for_model_generate
, which takes in batched generate requestsbatch_size
param to the configTest plan
Eleuther Eval results (in 27m 28 s):
Our results (25m 10 s):