-
Notifications
You must be signed in to change notification settings - Fork 224
Evaluation
HumanEval benchmark is used as the evaluation set in the work Evaluating Large Language Models Trained on Code. It comprises of 164 Human written Programming Problems. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem. It aims to evaluate, Functional Correctness unlike the standard BLEU or CodeBLEU metrics, which does comparison via Syntactical and Semantic Nature of the groundtruth code to that of the generated code. We benchmark the evaluation below. OpenAI uses a special pass@k
metric to evaluate the generated code using the provided unit tests. pass@k
is the average number of tests that the generated code passes when evaluating k
different generated samples from the model. We follow the same procedure as OpenAI, and use the same temperature (0.8) and top-p
value for nucleus sampling (0.95). However, due to compute and time constraints, we only evaluate the generated code with the k's being 1, 2, 5, and 10.
The performance of each model we evaluated is shown on the Models page.
To reproduce our evaluation of the HumanEval
benchmark, please follow the installation instructions for HumanEval
. Clone our repository and install the necessary requirements.txt
. Then, run the following command:
cd evaluation
python evaluate.py --model_name_or_path=model_name_or_path --human_eval_path=<path/to/human-eval/data/HumanEval.jsonl.gz> --out_path=./model_results
The APPS Benchmark, is similar to the HumanEval
benchmark. However, the problems are sourced from programming competitions rather than being hand crafted. Additionally the format of the APPS benchmark take a different form. Specifically, their is a problem description that explain the problem in standard natural language (i.e. no docstrings) and some input/output examples. Additionally, there may be start code such as an initial function or part of a function given to help solve the problem. The model is then expected to generate a function or finish the starter code to solve the problem. The generated code is then ran against several unit tests and checked for correctness. Their are several metrics, but the most important is the acurracy of the generated code (i.e. how many of the problems had all of their tests passed).
We attempted to also evaluate various models we trained on the APPS benchmark. However, we ran into issue with the validation of functional correctness. We hope to fix this issue in the future and evaluate our models on the APPS benchmark in the future.