This is the repository for our paper Prompting Large Language Models to Plan and Execute Actions Over Long Documents.
Please make sure openai
package is installed, and the API key has been exported to env variable OPENAI_API_KEY_OAI
.
- Download QuALITY data and unzip to
./data/raw
folder - Run
python data_preproc.py
. This step produces two files indata/processed
folder:quality_dev_q.csv
andquality_train_q.csv
PEARL mines actions from data of similar distribution (in this repo, the training data of the QuALITY dataset) instead of assuming a pre-defined action space. To mine the actions from the training set, run
bash ./script.sh action_mining
The above command will return file ./output/mined_actions_init.txt
which stores the actions in the format of
ANALYZE(CTX, X, Y) #Analyze the relationship, attitude, or feelings between X and Y, or the character, language, tone, or symbolism of X given the input CTX.
Notice that we find the generation process is not entirely deterministic even after setting both temperature and top_p
to 0. We provide examples of mined actions in output/mined_actions_init_example.txt
.
In our experiments, the total length of all mined actions exceeds the maximum context length of GPT-4-8k, thus we added a step to simplify the mined actions:
bash ./script.sh action_simplification
Example actions simplified from output/mined_actions_init_example.txt
are provided in output/mined_actions_simplified_example.txt
. The number of actions can be adjusted via going through multiple rounds of action simplification. More details are included in our paper Section 4.1.
We evaluate PEARL on a subset of QuALITY questions that are annotated requiring long context to answer. For both baselines and PEARL, the output will be stored in ./output
folder following the format {prompt_type}_out.{split}.{ctx_type}.csv
. The {split}
and {ctx_type}
denote placeholder for the original QuALITY split (train
or dev
) from which we extract the example, and the context size required to answer the question respectively.
To run the multiple-choice question baseline, run
bash ./script.sh baseline_mcq
We provide output in ./output/baseline_mcq_out.{split}.{ctx_type}.csv
To run the free-form answer baseline, run
bash ./script.sh baseline_gqa
We provide output in ./output/baseline_gqa_out.{split}.{ctx_type}.csv
To run PEARL on the challenge subset of QuALITY, run
bash ./script.sh pearl
For PEARL, two files are generated:
- The
.csv
file that contains the plan and answer and mapped answer with field names as follows:
qid
: qidplan
: generated planopen-answer
: the freeform answer generated by executing the planmap-answer
: the letter choice selected by GPT-4 based on the open-answergold
: the gold label
- The output
.pkl
file that stores the intermediate output where the keys are the output variables in the plan, and values are the executed results assigned to the output variables.
We provide example .csv
output of two runs with gpt-4-0314 checkpoint in ./output/pearl_out.{split}.{ctx_type}.csv
, as well as the intermediate output of one run in .pkl
file.
To see the intermediate step output, run the the command in ./script.sh
with --debug
. Example output for executing one action is shown below: the parsed action and the executed output.
{'action': 'FIND_RELATION', 'args': ['CTX', '"Ro"', '"mother"'], 'output_var': 'ro_mother', 'detailed_action': 'Find and summarize the relationship between Ro and his mother in the input article'}
In the input article, Ro is a young Martian who has returned to his home ... The relationship between Ro and his mother seems to be one of respect and learning, as he remembers her words and uses them to navigate the challenges he faces.
Note that the code currently uses the provided examples in the prompt_bank
for plan generation. To generate demonstration with GPT-4 along with self-refinement, run
bash ./script.sh refine
The generated demonstrations will be printed out, and can be later incorporated into prompt_bank/plan_gen.txt
.
To compute the mapped answer accuracy for each method, run
python comp_acc.py baseline_mcq_out
# File: ./output/baseline_mcq_out.dev.ctx_eval_long.csv, accuracy: 81.2
# File: ./output/baseline_mcq_out.dev.ctx_eval_short.csv, accuracy: 84.4
# File: ./output/baseline_mcq_out.train.ctx_eval_long.csv, accuracy: 71.7
# Total accuracy: 78.7
python comp_acc.py baseline_gqa_out
# File: ./output/baseline_gqa_out.dev.ctx_eval_long.csv, accuracy: 71.5
# File: ./output/baseline_gqa_out.dev.ctx_eval_short.csv, accuracy: 79.1
# File: ./output/baseline_gqa_out.train.ctx_eval_long.csv, accuracy: 57.9
# Total accuracy: 68.8
python comp_acc.py pearl_out
# File: ./output/pearl_out.dev.ctx_eval_long.csv, accuracy: 77.4
# File: ./output/pearl_out.dev.ctx_eval_short.csv, accuracy: 76.7
# File: ./output/pearl_out.train.ctx_eval_long.csv, accuracy: 63.8
# Total accuracy: 72.2
@misc{sun2023pearl,
title={PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents},
author={Simeng Sun and Yang Liu and Shuohang Wang and Chenguang Zhu and Mohit Iyyer},
year={2023},
eprint={2305.14564},
archivePrefix={arXiv},
primaryClass={cs.CL}
}