Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

help understand HELM results #1823

Closed
aniketmaurya opened this issue Sep 5, 2023 · 10 comments
Closed

help understand HELM results #1823

aniketmaurya opened this issue Sep 5, 2023 · 10 comments
Labels
competition Support for the NeurIPS Large Language Model Efficiency Challenge user question

Comments

@aniketmaurya
Copy link
Contributor

I am running Llama-2 7B base model on TruthfulQA and get the following results. On inspecting the prediction outputs, I see results are unmapped which I don't understand what it means. Any help here would be deeply appreciated. (I am using the Neurips client with Lit-GPT for evaluation)

image image
@msaroufim msaroufim added the competition Support for the NeurIPS Large Language Model Efficiency Challenge label Sep 6, 2023
@yifanmai
Copy link
Collaborator

yifanmai commented Sep 6, 2023

Sorry, I left an earlier comment that was an incorrect diagnosis.

It looks like your model's output text contains the prompt and the generated text concatenated together. This is not the behavior HELM expects.

When echo_prompt field is set to true, the output should contain the prompt and generated text concatenated together. When echo_prompt is set to false, the generated text should only contain the model's generated text, and should not contain the prompt.

In this case, echo_prompt is set to false, but the output text still contains the prompt and the generated text concatenated together. This prevents HELM from extracting the multiple-choice question answer from the output text.

To debug this, you can look at the output scenario_state.json file, and look under request_states[0].result.completions[0].text

@aniketmaurya
Copy link
Contributor Author

thank you @yifanmai, I was able to fix this by removing the prompt from output. another question - is there a benchmark for an open source model that we can use to cross check our implementation accuracy?

@yifanmai
Copy link
Collaborator

yifanmai commented Sep 7, 2023

For Llama 2 (7B) on NaturalQuestions (1000 evaluation instances, 1 random seed), our internal results are:

Exact Match: 0.28
Exact Match (Robustness): 0.229
Exact Match (Robustness): 0.219

We're hopefully publishing this very soon! In the meantime, let me know if there are any more numbers you'd like to look at.

I would expect your results to be slightly different depending on which implementation you use (we use a half-precision implementation).

@aniketmaurya
Copy link
Contributor Author

hi @yifanmai, thank you for the numbers. May I ask if you have some scores for a smaller model like Pythia 1B or a smaller scenario for quick iterations (NaturalQuestions took more than 5 hours for evaluations).

Also, would it be possible for you to share the run_spec configuration like if you used data_augmentation.

@aniketmaurya
Copy link
Contributor Author

aniketmaurya commented Sep 10, 2023

I evaluated Llama-2 7B on QuAC and BoolQ

For QuAC I get (should have been around 39.7 based on benchmarks) -
image

@aniketmaurya
Copy link
Contributor Author

aniketmaurya commented Sep 10, 2023

For BoolQ I get 0 score and 0n checking BoolQ raw results, it seems like there is still some issue with the scoring.

image
image

@yifanmai
Copy link
Collaborator

yifanmai commented Sep 12, 2023

Could you send me that scenario_state.json file from the boolq run? If you area able to send me the JSON of the first few elements of the request_states field as a Gist, that would be even better.

The very likely cause is that the end of sequence token </s> was appended to the generated text, which causes exact match to fail.

I don't have Pythia (1B) results yet, but I can do a run on a few scenarios and send you the results. Let me know which scenarios you are interested in.

I'd also suggest using MMLU for debugging and replicating previously published numbers, using HELM's selected five subject subsets:

entries: [
  {description: "mmlu:model=lightningai/lit-gpt,subject=abstract_algebra", priority: 2}
  {description: "mmlu:model=lightningai/lit-gpt,subject=college_chemistry", priority: 2}
  {description: "mmlu:model=lightningai/lit-gpt,subject=computer_security", priority: 2}
  {description: "mmlu:model=lightningai/lit-gpt,subject=econometrics", priority: 2}
  {description: "mmlu:model=lightningai/lit-gpt,subject=us_foreign_policy", priority: 2}
]

This is because MMLU tends to be a lot cheaper and faster to run, due to having short questions and answers.

The average MMLU score will usually be within a percentage points of the Hugging Face leaderboard despite the evaluation protocol differences. You can run the Hugging Face version locally using the Hugging Face Hub integration with HELM and compare it against the Lit-GPT version, which will allow you to compare the results and raw generations.

@aniketmaurya
Copy link
Contributor Author

Thanks for your input @yifanmai! I ran HF locally on MMLU and compared with Lit-GPT. The number matches closely. So, I think the implementation is almost there. Also, I will attach the scenario_state.json soon.

HF
image

Lit-GPT
image

@aniketmaurya
Copy link
Contributor Author

aniketmaurya commented Sep 15, 2023

hi @yifanmai, please find the attached link to scenario_state.json file for BoolQA.
I see one issue is that stoptoken \n is not able to break the prediction loop since the LLM produces \n\n which would have a different token value from a single \n

@yifanmai
Copy link
Collaborator

Closing due to staleness; feel free to reopen if you have further questions..

For the record, if the model runs into this issue, I would recommend manually truncate the response in the client by using either string matching or truncate_and_tokenize_response_text() (example).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
competition Support for the NeurIPS Large Language Model Efficiency Challenge user question
Projects
None yet
Development

No branches or pull requests

3 participants