It is important to evaluate the performance of your LLM using a single metric. This allows you to quickly iterate on prompts, rag, and fientuned models.
Create a test set by selecting a set of prompts and their corresponding answers. A good test set has between 20 and 100 examples in it. More examples increases the accuracy, but it takes longer for a human to write and edit.
We include an example test set at ./golden_test_set.jsonl.
{"ticker": "CENT", "date": "Aug 4, 2021, 4:35 p.m. ET", "q": "2021-Q3", "question": "What is the optimal leverage range for the company in the event of M&A", "answer": " The optimal leverage range for the company in the event of M&A is between 3 to 3.5 times. For the right deal, the company would be willing to lever up into the low 4s, and then quickly deliver back down to that three to 3.5 range.", "has_value": true, "value": 3.5, "units": "times"}
This test set checks to make sure that the answer generated by the model is correct by comparison the value and units of the answer. It also checks if the text answer is similar to the reference answer using another LLM.
In addition to a custom test set, you can also use standard metrics to evaluate your model.
Run on the example test set:
cd 02_eval
python3 eval.py
You can view the detailed results at data/results/earnings_meta-llama_Meta-Llama-3-8B-Instruct_results.json
!
The HELM LLM benchmark is a popular benchmark for evaluating the performance of large language models (LLMs) in natural language processing tasks. It's a suite of tests that assess the language understanding and generation capabilities of a model.
The HELM LLM benchmark consists of a set of tasks that evaluate a model's ability to:
- Answer questions: The model is asked to answer questions based on a given passage or text.
- Generate text: The model is asked to generate text based on a prompt or topic.
- Summarize text: The model is asked to summarize a given passage or text.
- Fill in the blanks: The model is asked to fill in the blanks in a sentence or paragraph with the most likely word or phrase.
The HELM LLM benchmark is designed to test a model's ability to understand and generate human-like language, and it's widely used in the natural language processing (NLP) community to evaluate the performance of LLMs.
The benchmark is named after the HELM framework, which is an open-source toolkit for building and evaluating LLMs. The HELM LLM benchmark is a valuable tool for researchers and developers to evaluate the performance of their models and compare them to others in the field.