help understand HELM results #1823

aniketmaurya · 2023-09-05T20:55:34Z

I am running Llama-2 7B base model on TruthfulQA and get the following results. On inspecting the prediction outputs, I see results are unmapped which I don't understand what it means. Any help here would be deeply appreciated. (I am using the Neurips client with Lit-GPT for evaluation)

The text was updated successfully, but these errors were encountered:

yifanmai · 2023-09-06T18:52:52Z

Sorry, I left an earlier comment that was an incorrect diagnosis.

It looks like your model's output text contains the prompt and the generated text concatenated together. This is not the behavior HELM expects.

When echo_prompt field is set to true, the output should contain the prompt and generated text concatenated together. When echo_prompt is set to false, the generated text should only contain the model's generated text, and should not contain the prompt.

In this case, echo_prompt is set to false, but the output text still contains the prompt and the generated text concatenated together. This prevents HELM from extracting the multiple-choice question answer from the output text.

To debug this, you can look at the output scenario_state.json file, and look under request_states[0].result.completions[0].text

aniketmaurya · 2023-09-07T16:33:50Z

thank you @yifanmai, I was able to fix this by removing the prompt from output. another question - is there a benchmark for an open source model that we can use to cross check our implementation accuracy?

yifanmai · 2023-09-07T17:02:37Z

For Llama 2 (7B) on NaturalQuestions (1000 evaluation instances, 1 random seed), our internal results are:

Exact Match: 0.28
Exact Match (Robustness): 0.229
Exact Match (Robustness): 0.219

We're hopefully publishing this very soon! In the meantime, let me know if there are any more numbers you'd like to look at.

I would expect your results to be slightly different depending on which implementation you use (we use a half-precision implementation).

aniketmaurya · 2023-09-09T22:22:55Z

hi @yifanmai, thank you for the numbers. May I ask if you have some scores for a smaller model like Pythia 1B or a smaller scenario for quick iterations (NaturalQuestions took more than 5 hours for evaluations).

Also, would it be possible for you to share the run_spec configuration like if you used data_augmentation.

aniketmaurya · 2023-09-10T09:44:41Z

I evaluated Llama-2 7B on QuAC and BoolQ

For QuAC I get (should have been around 39.7 based on benchmarks) -

aniketmaurya · 2023-09-10T09:47:57Z

For BoolQ I get 0 score and 0n checking BoolQ raw results, it seems like there is still some issue with the scoring.

yifanmai · 2023-09-12T23:50:12Z

Could you send me that scenario_state.json file from the boolq run? If you area able to send me the JSON of the first few elements of the request_states field as a Gist, that would be even better.

The very likely cause is that the end of sequence token </s> was appended to the generated text, which causes exact match to fail.

I don't have Pythia (1B) results yet, but I can do a run on a few scenarios and send you the results. Let me know which scenarios you are interested in.

I'd also suggest using MMLU for debugging and replicating previously published numbers, using HELM's selected five subject subsets:

entries: [
  {description: "mmlu:model=lightningai/lit-gpt,subject=abstract_algebra", priority: 2}
  {description: "mmlu:model=lightningai/lit-gpt,subject=college_chemistry", priority: 2}
  {description: "mmlu:model=lightningai/lit-gpt,subject=computer_security", priority: 2}
  {description: "mmlu:model=lightningai/lit-gpt,subject=econometrics", priority: 2}
  {description: "mmlu:model=lightningai/lit-gpt,subject=us_foreign_policy", priority: 2}
]

This is because MMLU tends to be a lot cheaper and faster to run, due to having short questions and answers.

The average MMLU score will usually be within a percentage points of the Hugging Face leaderboard despite the evaluation protocol differences. You can run the Hugging Face version locally using the Hugging Face Hub integration with HELM and compare it against the Lit-GPT version, which will allow you to compare the results and raw generations.

aniketmaurya · 2023-09-13T14:41:12Z

Thanks for your input @yifanmai! I ran HF locally on MMLU and compared with Lit-GPT. The number matches closely. So, I think the implementation is almost there. Also, I will attach the scenario_state.json soon.

HF

Lit-GPT

aniketmaurya · 2023-09-15T14:23:41Z

hi @yifanmai, please find the attached link to scenario_state.json file for BoolQA.
I see one issue is that stoptoken \n is not able to break the prediction loop since the LLM produces \n\n which would have a different token value from a single \n

yifanmai · 2024-10-10T17:12:00Z

Closing due to staleness; feel free to reopen if you have further questions..

For the record, if the model runs into this issue, I would recommend manually truncate the response in the client by using either string matching or truncate_and_tokenize_response_text() (example).

msaroufim added the competition Support for the NeurIPS Large Language Model Efficiency Challenge label Sep 6, 2023

yifanmai added the user question label Sep 6, 2023

aniketmaurya mentioned this issue Sep 8, 2023

fix echo prompt llm-efficiency-challenge/neurips_llm_efficiency_challenge#19

Merged

aniketmaurya mentioned this issue Sep 10, 2023

Model integration for Lit-GPT #1792

Merged

yifanmai closed this as completed Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

help understand HELM results #1823

help understand HELM results #1823

aniketmaurya commented Sep 5, 2023

yifanmai commented Sep 6, 2023

aniketmaurya commented Sep 7, 2023

yifanmai commented Sep 7, 2023

aniketmaurya commented Sep 9, 2023

aniketmaurya commented Sep 10, 2023 •

edited

Loading

aniketmaurya commented Sep 10, 2023 •

edited

Loading

yifanmai commented Sep 12, 2023 •

edited

Loading

aniketmaurya commented Sep 13, 2023

aniketmaurya commented Sep 15, 2023 •

edited

Loading

yifanmai commented Oct 10, 2024

help understand HELM results #1823

help understand HELM results #1823

Comments

aniketmaurya commented Sep 5, 2023

yifanmai commented Sep 6, 2023

aniketmaurya commented Sep 7, 2023

yifanmai commented Sep 7, 2023

aniketmaurya commented Sep 9, 2023

aniketmaurya commented Sep 10, 2023 • edited Loading

aniketmaurya commented Sep 10, 2023 • edited Loading

yifanmai commented Sep 12, 2023 • edited Loading

aniketmaurya commented Sep 13, 2023

aniketmaurya commented Sep 15, 2023 • edited Loading

yifanmai commented Oct 10, 2024

aniketmaurya commented Sep 10, 2023 •

edited

Loading

aniketmaurya commented Sep 10, 2023 •

edited

Loading

yifanmai commented Sep 12, 2023 •

edited

Loading

aniketmaurya commented Sep 15, 2023 •

edited

Loading