-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
help understand HELM results #1823
Comments
Sorry, I left an earlier comment that was an incorrect diagnosis. It looks like your model's output text contains the prompt and the generated text concatenated together. This is not the behavior HELM expects. When In this case, To debug this, you can look at the output |
thank you @yifanmai, I was able to fix this by removing the prompt from output. another question - is there a benchmark for an open source model that we can use to cross check our implementation accuracy? |
For Llama 2 (7B) on NaturalQuestions (1000 evaluation instances, 1 random seed), our internal results are: Exact Match: 0.28 We're hopefully publishing this very soon! In the meantime, let me know if there are any more numbers you'd like to look at. I would expect your results to be slightly different depending on which implementation you use (we use a half-precision implementation). |
hi @yifanmai, thank you for the numbers. May I ask if you have some scores for a smaller model like Pythia 1B or a smaller scenario for quick iterations (NaturalQuestions took more than 5 hours for evaluations). Also, would it be possible for you to share the run_spec configuration like if you used |
I evaluated Llama-2 7B on QuAC and BoolQ For QuAC I get (should have been around 39.7 based on benchmarks) - |
Could you send me that The very likely cause is that the end of sequence token I don't have Pythia (1B) results yet, but I can do a run on a few scenarios and send you the results. Let me know which scenarios you are interested in. I'd also suggest using MMLU for debugging and replicating previously published numbers, using HELM's selected five subject subsets:
This is because MMLU tends to be a lot cheaper and faster to run, due to having short questions and answers. The average MMLU score will usually be within a percentage points of the Hugging Face leaderboard despite the evaluation protocol differences. You can run the Hugging Face version locally using the Hugging Face Hub integration with HELM and compare it against the Lit-GPT version, which will allow you to compare the results and raw generations. |
Thanks for your input @yifanmai! I ran HF locally on MMLU and compared with Lit-GPT. The number matches closely. So, I think the implementation is almost there. Also, I will attach the |
Closing due to staleness; feel free to reopen if you have further questions.. For the record, if the model runs into this issue, I would recommend manually truncate the response in the client by using either string matching or |
I am running Llama-2 7B base model on TruthfulQA and get the following results. On inspecting the prediction outputs, I see results are
unmapped
which I don't understand what it means. Any help here would be deeply appreciated. (I am using the Neurips client with Lit-GPT for evaluation)The text was updated successfully, but these errors were encountered: