Evidence from counterfactual tasks supports emergent analogical reasoning in large language models

Code for the paper Evidence from counterfactual tasks supports emergent analogical reasoning in large language models.

Counterfactual letter string analogy problem sets are included in all_prob_synthetic_int1.npz (for interval-size-1) and all_prob_synthetic_int2.npz (for interval-size-2).

To evaluate GPT-4 on these problems (without code execution), run the following command for interval-size-1:

python3 ./eval_GPT4_letterstring.py --interval_size 1

and the following command for interval-size-2:

python3 ./eval_GPT4_letterstring.py --interval_size 2

To run experiments using the older GPT-4 engine, include the argument --gpt4_engine gpt-4-1106-preview.

To analyze GPT-4's performance, run the following command for interval-size-1 (again specifying the older GPT-4 engine if desired):

python3 ./analyze_GPT4_letterstring.py --interval_size 1

and the following command for interval-size-2:

python3 ./analyze_GPT4_letterstring.py --interval_size 2

To evaluate GPT-4 + code execution on these problems, run the following command, specifying the interval size and problem type:

python3 ./eval_GPT4_code_execution_letterstring.py --interval_size 1 --prob_type succ

The full set of problem types is ['succ', 'pred', 'add_letter', 'remove_redundant', 'fix_alphabet', 'sort']. To run experiments using the alternative synthetic alphabet, include the argument --alt_alphabet. This evaluation script is interactive. Each response from GPT-4 is presented along with the correct answer, and the user is prompted to specify whether the answer is correct (by entering 1), incorrect (by entering 0), or did not provide an answer (by entering -1).

To analyze the performance of GPT-4 + code execution, run the following command (specifying the interval size, and whether the alternative synthetic alphabet was used):

python3 ./analyze_GPT4_code_execution_letterstring.py

To analyze the errors made by GPT-4 + code execution, run the following command:

python3 ./analyze_error_types_GPT4_code_execution.py

The full response and correct answer for each error problem will be presented, and the user is prompted to indicate whether the answer is based on a valid alternative rule (by entering 1) or simply wrong (by entering 0). To summarize the results of this analysis, run the following command:

python3 ./display_error_types_GPT4_code_execution.py

To run the analysis comparing the performance of GPT-4 + code execution on the original vs. alternative synthetic alphabets, run the following command:

python3 ./compare_synthetic_alphabets.py

To run the analysis comparing the performance of GPT-4 with the old (1106) vs. new (0125) engines, run the following command:

python3 ./compare_GPT4_engines.py

To create CSV files for statistical analyses, run the following command:

python3 ./create_csv.py

To perform statistical analyses, run the following R script:

./analysis.R

All data for the results presented in the paper (including both human behavioral data and results for the evaluations of GPT-4 and GPT-4 + code execution) are included in this repository.

Prerequisites

Authorship

All code was written by Taylor Webb.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
behavioral_data_int1		behavioral_data_int1
behavioral_data_int2		behavioral_data_int2
gpt-4-0125-preview_code_execution		gpt-4-0125-preview_code_execution
gpt-4-0125-preview_code_execution_int1		gpt-4-0125-preview_code_execution_int1
gpt-4-0125-preview_code_execution_int1_altalphabet		gpt-4-0125-preview_code_execution_int1_altalphabet
gpt-4-0125-preview_code_execution_int2		gpt-4-0125-preview_code_execution_int2
gpt-4-0125-preview_code_execution_int2_altalphabet		gpt-4-0125-preview_code_execution_int2_altalphabet
gpt-4-0125-preview_int1		gpt-4-0125-preview_int1
gpt-4-0125-preview_int2		gpt-4-0125-preview_int2
gpt-4-1106-preview_int1		gpt-4-1106-preview_int1
gpt-4-1106-preview_int2		gpt-4-1106-preview_int2
GPT4_code_execution_comparing_alphabets.csv		GPT4_code_execution_comparing_alphabets.csv
GPT4_comparing_engines.csv		GPT4_comparing_engines.csv
LICENSE		LICENSE
README.md		README.md
all_prob_synthetic_int1.npz		all_prob_synthetic_int1.npz
all_prob_synthetic_int2.npz		all_prob_synthetic_int2.npz
analysis.R		analysis.R
analyze_GPT4_code_execution_letterstring.py		analyze_GPT4_code_execution_letterstring.py
analyze_GPT4_letterstring.py		analyze_GPT4_letterstring.py
analyze_error_types_GPT4_code_execution.py		analyze_error_types_GPT4_code_execution.py
combined_analysis.py		combined_analysis.py
combined_results.pdf		combined_results.pdf
compare_GPT4_engines.py		compare_GPT4_engines.py
compare_synthetic_alphabets.py		compare_synthetic_alphabets.py
create_csv.py		create_csv.py
display_error_types_GPT4_code_execution.py		display_error_types_GPT4_code_execution.py
eval_GPT4_code_execution_letterstring.py		eval_GPT4_code_execution_letterstring.py
eval_GPT4_letterstring.py		eval_GPT4_letterstring.py
gpt-4-0125-preview_int1_results.npz		gpt-4-0125-preview_int1_results.npz
gpt-4-0125-preview_int2_results.npz		gpt-4-0125-preview_int2_results.npz
gpt-4-1106-preview_int1_results.npz		gpt-4-1106-preview_int1_results.npz
gpt-4-1106-preview_int2_results.npz		gpt-4-1106-preview_int2_results.npz
human_vs_GPT4.csv		human_vs_GPT4.csv
human_vs_GPT4_code_execution.csv		human_vs_GPT4_code_execution.csv
int1_results.pdf		int1_results.pdf
int1_results_old_vs_new_GPT4_engine.pdf		int1_results_old_vs_new_GPT4_engine.pdf
int1_results_old_vs_new_alphabet.pdf		int1_results_old_vs_new_alphabet.pdf
int2_results.pdf		int2_results.pdf
int2_results_old_vs_new_GPT4_engine.pdf		int2_results_old_vs_new_GPT4_engine.pdf
int2_results_old_vs_new_alphabet.pdf		int2_results_old_vs_new_alphabet.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evidence from counterfactual tasks supports emergent analogical reasoning in large language models

Prerequisites

Authorship

About

Releases 1

Packages

Languages

License

taylorwwebb/counterfactual_analogies

Folders and files

Latest commit

History

Repository files navigation

Evidence from counterfactual tasks supports emergent analogical reasoning in large language models

Prerequisites

Authorship

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages