Aggregation of Reasoning (AoR): A Hierarchical Framework for Enhancing Answer Selection in Large Language Models
LREC-COLING 2024: Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models
This repository contains the code and data related to the paper "Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models". In this paper, we introduce AoR (Aggregation of Reasoning), a hierarchical reasoning aggregation framework designed to enhance the reasoning capabilities of Large Language Models (LLMs) in complex tasks. Unlike traditional methods that rely on answer frequency from multiple reasoning chains, AoR evaluates reasoning chains themselves to select the most accurate answers. Additionally, AoR employs dynamic sampling to adjust the number of reasoning chains based on task complexity.
Please ensure you have the following requirements installed:
openai
>= 1.14.0backoff
tenacity
jsonlines
ipdb
Our dataset originates from Large Language Models are Zero-Shot Reasoners, generously shared by Takeshi Kojima. We employ prompts from Chain-of-Thought Prompting Elicits Reasoning in Large Language Models to guide models in generating initial reasoning processes.
Additionally, we share some of the outputs from the Local Scoring, Global Evaluation, and Dynamic Sampling phases through Google Drive.
We provide a reference startup script in main.py. Below is a comprehensive explanation of the arguments for launching the AoR script, allowing for customized runs.
To run the GSM8K dataset with AoR, use the following script:
python aor.py \
--task GSM8K \
--data-path aor_outputs/GSM8K_CoT_gpt-3.5-turbo-0301.jsonl \
--record-path aor_records/GSM8K_AoR_log_gpt-3.5-turbo-0301.jsonl \
--inference-model gpt-35-turbo-0301
You'll need to replace the paths with your own directories:
--data-path
: Path to your initial reasoning data in JSONL format.--record-path
: Path where you want to save the output results.
When running aor.py
, you can customize AoR with the following input arguments:
--task
: The reasoning task AoR needs to complete, e.g., GSM8K, AQuA, CSQA.--data-path
: The storage path for initial reasoning outcomes, in JSONL format, requiringquestion
,answer
, andresponse_list
fields.--record-path
: The storage location for output results.--inference-model
: The model used for inference, default isgpt-35-turbo-0301
.
Hyperparameters:
--initial-sample-size
: Initial sample size for reasoning chains.--max-sample-size
: Upper limit for dynamic sampling of reasoning chains.--batch-size
: Batch size for sampling reasoning chains.--representative-count
: Representative count for the local scoring phase.--scoring-threshold
: Threshold for scoring during the local scoring phase.--termination-threshold
: Threshold for dynamic sampling termination.--additional-sample-size
: Number of additional reasoning chains sampled in each iteration.
-
API Configuration:
- Please fill in your API keys in
code/inference.py
. Our code supports both OpenAI and Azure APIs. We recommend using the Azure API due to observed instability with the OpenAI API.
- Please fill in your API keys in
-
Prompt Customization:
- In
code/prompt.py
, we provide general prompt templates for evaluation. We observed that evaluation criteria significantly affect model performance. We strongly recommend crafting specific evaluation rules for particular tasks to ensure reasonable and accurate assessments. Including appropriate demonstrations can help models better understand the evaluation criteria. You can add your custom prompts incode/prompt.py
and specify them in theprompt_dict
incode/aor.py
corresponding to your desired task.
- In
-
Parsing Methods:
- We offer two parsing methods for evaluating reasoning chains: regular expression extraction and model-based automatic extraction. We found that regular expressions might not adapt well to the diverse outputs of large models. We recommend using the model-based automatic extraction method (
automated_parse_local_scoring_response
andautomated_parse_global_evaluation_response
). In our experiments, model-based parsing proved to be more stable. You can switch between methods based on your needs.
- We offer two parsing methods for evaluating reasoning chains: regular expression extraction and model-based automatic extraction. We found that regular expressions might not adapt well to the diverse outputs of large models. We recommend using the model-based automatic extraction method (
Happy researching and coding! 🚀
In metric.py, we provide evaluation methods for various tasks, utilizing accuracy to assess the model's reasoning performance. We offer the process_pred
function for evaluating a single reasoning chain and the process_pred_list
function for evaluating multiple reasoning chains.
If you have any suggestions or questions, feel free to reach out via email at yinzhangyue@126.com. If you encounter any issues while using the code or find any bugs, please open a new issue on GitHub. We appreciate any constructive feedback that could help us improve. Thank you for your attention!
If you find our work useful, please cite our paper:
@inproceedings{yin-etal-2024-aggregation,
title = "Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models",
author = "Yin, Zhangyue and
Sun, Qiushi and
Guo, Qipeng and
Zeng, Zhiyuan and
Li, Xiaonan and
Sun, Tianxiang and
Chang, Cheng and
Cheng, Qinyuan and
Wang, Ding and
Mou, Xiaofeng and
Qiu, Xipeng and
Huang, Xuanjing",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.53",
pages = "609--625",
}