This repository contains the code, dataset, and models in our paper: ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting. We release:
- The 44K CoT data generated based on our proposed CoTGenius framework.
- The code for generating the data.
- The code for fine-tuning.
- The code for evaluating the model.
- The code for CoT debating.
CoTGenius is a Chain-of-Thought improvement framework to synthesize more complicated, diverse, and detailed CoT rationales. In this framework, we introduce three evolution strategies for improving CoT, i.e., complicate, diversify, and specify. Following CoTGenius, we generate a large-scale CoT dataset that contains 44335 samples covering commonsense reasoning, mathematical reasoning, scientific reasoning, and symbolic reasoning. Furthermore, we fine-tune open-source LLMs (i.e., Llama 2-Chat 7B and 13B) with our evolved CoT data, called ChainLM, and compare ChainLM to existing popular LLMs on 9 complex reasoning datasets. Finally, based on our ChainLM model, we propose a CoT reasoning strategy,step-level debating.
The Overall Framework of CoTGenius
The directory data
contains 44k CoT samples generated after 4 rounds based on CoTGenius.
train_data.json
is all the improved CoT data in the 4 rounds.no_cs.json
is the data after removing commonsense reasoning categoriesno_math.json
is the data after removing mathematical reasoning categoriesno_sci.json
is the data after removing scientific reasoning categoriesno_sym.json
is the data after removing symbolic reasoning categoriesseed.json
is the seed dataset used for generation.
Our data generation process is a combination of three pipelines.
- Complicate: Firstly, we use complication strategy to complicate the questions of the origin data. Secondly, conduct evolutionary success judgement based on the complexity of the new questions. Then, generate answers to new questions. Finally, conduct correctness verification for new <question, CoT> samples.
- Diversify: Similar to complication, but use diversification methods to guide question generation.
- Specify: First rewrite the CoTs in the seed dataset and then conduct evolutionary success judgement.
To perform the generation process using CoTGenius, three scripts [complicate.sh
, diversify.sh
, specify.sh
] are provided in generate.
cd generate
bash complicate.sh
bash diversify.sh
bash specify.sh
We fine-tune Llama 2-Chat 7B and 13B models with our dataset. We call the CoT fine-tuning model ChainLM. The fine-tuning code is adopted from Alpaca.
cd fine-tune
bash run.sh
We conduct evaluation on 9 datasets independent of the seed dataset and present the performance.
cd evaluate
bash test.sh
Based on the MagicLM, we propose Step-level CoT Debating strategy. To evaluate with CoT debating:
cd debate
bash run.sh