Benchmarking causal discovery using ChatGPT: The cause-effect pairs challenge

Does A cause B? Or does B cause A?

Pairwise causal discovery is a fundamental open problem. Given two variables, the task is to determine which variable causes the other. As one of the key benchmarks for this task, Mooij et al. (2016) released the Tuebingen cause-effect pairs dataset with 108 pairs of real world variables.

As a fun exploration, we present these pairs of variables as prompts to ChatGPT to study the capabilities of large language models in inferring causality. ChatGPT performs significantly better than current SoTA algorithms on the Tuebingen benchmark. In the 74 pairs we have tried so far, ChatGPT obtains an accuracy of 92.5%. In comparison, the best known accuracy using conventional discovery methods is 70-80% [Mooij et al. (2016), Tagasovska et al. (2020), Compton et al. (2020), Salem et al. (2022)].

Crucially, ChatGPT does not need access to the data for each variable. It can infer causality simply from the variable names. We use the following prompt for each cause-effect pair:

Does changing [varA] cause a change in [varB]? Please answer in a single word: Yes or No.

We adopt the following protocol:

Fetch the README.txt file from the Tuebingen benchmark website.
Use the variable names provided in the README file. In case the variable names are ambiguous, refer to the dataset description provided on the same webpage and choose a descriptive variable name.
Input two prompts to ChatGPT, one for causality from A to B, and another for causality from B to A. Record whether the answers are correct (1) or not (0).
The accuracy is the average of the answers to the two questions.

This repository contains four files:

results.txt: A csv file containing the results for each cause-effect pair. The first two columns signify the result of Does A cause B, and Does B cause A, respectively. 1 means that ChatGPT outputted the correct answer and 0 means it outputted the incorrect answer. This file is based on the README.txt file provided by Tuebingen benchmark.
prompts.txt: For reproducibility, we provide the example prompt used for each cause-effect pair.
pairmeta.txt: This file contains the recommended weights to be used when computing the overall accuracy on the benchmark.
compute_benchmark_accuracy.ipynb: A simple notebook that uses results.txt and pairmeta.txt to compute the overall accuracy on the benchmark.

We'll soon be updating all 108 pairs! To add a new cause-effect pair,

Refer to results.txt to find a cause-effect pair that has not been scored.
Follow the protocol above to construct a prompt and get answers from ChatGPT.
Update the first two columns of results.txt and then rerun compute_benchmark_accuracy.ipynb notebook.

WARNING: ChatGPT is a large language model and has no guarantee of providing the correct causality direction. Answers from ChatGPT or this repo should not be considered causal and we provide these results only for the purpose of exploratory research. In practice, we expect that domain experts will need to verify such results before using the inferred causal relationships for any downstream application.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
actual-causality		actual-causality
arcticsea		arcticsea
crass-cf		crass-cf
cw-rw-bb		cw-rw-bb
neuropathic-pain-diagnosis		neuropathic-pain-diagnosis
new-tubingen		new-tubingen
LICENSE		LICENSE
README.md		README.md
accuracy_results.log		accuracy_results.log
ada_results.csv		ada_results.csv
ada_results.jsonl		ada_results.jsonl
babbage_results.csv		babbage_results.csv
babbage_results.jsonl		babbage_results.jsonl
comp_prompts_aug.csv		comp_prompts_aug.csv
compute_benchmark_accuracy.ipynb		compute_benchmark_accuracy.ipynb
create_prompts2.py		create_prompts2.py
curie_results.csv		curie_results.csv
curie_results.jsonl		curie_results.jsonl
davinci_results.csv		davinci_results.csv
davinci_results.jsonl		davinci_results.jsonl
extract_cause_effect.py		extract_cause_effect.py
gpt-3.5-turbo_results.csv		gpt-3.5-turbo_results.csv
gpt-3.5-turbo_results.jsonl		gpt-3.5-turbo_results.jsonl
gpt-3.5-turbo_system_results.csv		gpt-3.5-turbo_system_results.csv
gpt-3.5-turbo_system_results.jsonl		gpt-3.5-turbo_system_results.jsonl
gpt-3.5-turbo_system_results_singleprompt.csv		gpt-3.5-turbo_system_results_singleprompt.csv
gpt-3.5-turbo_system_results_singleprompt.jsonl		gpt-3.5-turbo_system_results_singleprompt.jsonl
gpt-4_system_results_singleprompt.csv		gpt-4_system_results_singleprompt.csv
gpt-4_system_results_singleprompt.jsonl		gpt-4_system_results_singleprompt.jsonl
pairmeta.txt		pairmeta.txt
prompts.csv		prompts.csv
prompts.txt		prompts.txt
prompts_aug.csv		prompts_aug.csv
query_gpt.py		query_gpt.py
query_gpt2.py		query_gpt2.py
results.txt		results.txt
text-babbage-001_results.csv		text-babbage-001_results.csv
text-babbage-001_results.jsonl		text-babbage-001_results.jsonl
text-curie-001_results.csv		text-curie-001_results.csv
text-curie-001_results.jsonl		text-curie-001_results.jsonl
text-davinci-001_results.csv		text-davinci-001_results.csv
text-davinci-001_results.jsonl		text-davinci-001_results.jsonl
text-davinci-002_results.csv		text-davinci-002_results.csv
text-davinci-002_results.jsonl		text-davinci-002_results.jsonl
text-davinci-003_results.csv		text-davinci-003_results.csv
text-davinci-003_results.jsonl		text-davinci-003_results.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmarking causal discovery using ChatGPT: The cause-effect pairs challenge

About

Releases

Packages

Contributors 2

Languages

License

amit-sharma/chatgpt-causality-pairs

Folders and files

Latest commit

History

Repository files navigation

Benchmarking causal discovery using ChatGPT: The cause-effect pairs challenge

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages