Skip to content

Outstanding paper @ [ALTA 2024] Data and Code for A Closer Look at Logical Reasoning with LLMs: The Choice of Tool Matters

Notifications You must be signed in to change notification settings

Mattylam/Logic_Symbolic_Solvers_Experiment

Repository files navigation

Logic_Symbolic_Solvers_Experiment

Data and Code for A Closer Look at Logical Reasoning with LLMs: The Choice of Tool Matters

Authors: Matthew Lam, Ramya Thatikonda, Ehsan Shareghi

Monash University

Introduction

The emergence of Large Language Models~(LLMs) has demonstrated promising progress in solving logical reasoning tasks effectively. Several recent approaches have proposed to change the role of the LLM from the reasoner into a translator between natural language statements and symbolic representations which are then sent to external symbolic solvers to resolve. This paradigm has established the current state-of-the-art result in logical reasoning (i.e., deductive reasoning). However, it remains unclear whether the variance in performance of these approaches stems from the methodologies employed or the specific symbolic solvers utilized. There is a lack of consistent comparison between symbolic solvers and how they influence the overall reported performance. This is important, as each symbolic solver also has its own input symbolic language, presenting varying degrees of challenge in the translation process. To address this gap, we perform experiments on 3 deductive reasoning benchmarks with LLMs augmented with widely used symbolic solvers: Z3, Pyke, and Prover9. The tool-executable rates of symbolic translation generated by different LLMs exhibit a performance variation of up-to 30%. This highlights a significant difference in performance rooted in very basic choices of tools. The almost linear correlation between the executable rate of translations and the accuracy of the outcomes from Prover9 highlight a strong alignment between LLMs ability to translate into Prover9 symbolic language, and the correctness of those translations.

First, install all the required packages:

pip install -r requirements.txt

Datasets

We evaluate on the following datasets:

  • ProntoQA: Deductive resoning dataset.
  • ProofWriter: Deductive resoning dataset.
  • FOLIO: First-Order Logic reasoning dataset.

Logic Program Generation

To generate logic programs for logical reasoning problems in each dataset, at the root directory, run the following commands:

python logic_program.py \
    --api_key "Your OpenAI API Key" \
    --solver "Z3/Pyke/Prover9"
    --depth "d2/d3/d5"
    --dataset_name "Dataset Name [ProntoQA | ProofWriter | FOLIO]" \
    --model_name "Model Name [gpt-3.5-turbo | gpt-4o | command-r-plus | gemini-1.0-pro-latest]" \
    --shot "1/2/4"
    --max_new_tokens 2000 \

The generated logic programs will be saved in Answered_Datasets.

Logic Inference with Symbolic Solver

After generating logic programs, we can perform inference with symbolic solvers. At the root directory, run the following commands:

python logic_inference.py \
    --solver "Z3/Pyke/Prover9"
    --depth "d2/d3/d5"
    --dataset_name "Dataset Name [ProntoQA | ProofWriter | FOLIO]" \
    --model_name "Model Name [gpt-3.5-turbo | gpt-4o | command-r-plus | gemini-1.0-pro-latest]" \
    --shot "1/2/4" \

The logic reasoning results will be saved in Processed_Datasets.

Evaluation

To evaluate the logic reasoning results, please run the following commands:

python evaluation.py \
    --solver "Z3/Pyke/Prover9"
    --depth "d2/d3/d5"
    --dataset_name "Dataset Name [ProntoQA | ProofWriter | FOLIO]" \
    --model_name "Model Name [gpt-3.5-turbo | gpt-4o | command-r-plus | gemini-1.0-pro-latest]" \
    --shot "1/2/4"

This paper's code was inspired by SatLM and LogicLM.

About

Outstanding paper @ [ALTA 2024] Data and Code for A Closer Look at Logical Reasoning with LLMs: The Choice of Tool Matters

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages