Skip to content

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
Shawn0918 committed May 7, 2024
1 parent ae7b05c commit 245120e
Show file tree
Hide file tree
Showing 16 changed files with 987 additions and 1 deletion.
79 changes: 78 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,78 @@
# llm-pragmatics
# llm-implicatures
This repository contains code and data used in the project **[Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom](https://arxiv.org/abs/2404.19509)**.

## Data Description 数据集介绍
**SwordsmanImp** contains the 200 Chinese dialogues extracted from the sitcom *My own Swordsman*. Within each of the dialogue, there is one character line conveying a non-literal meaning. We provides four interpretations for these lines with implicatures, which include a pragmatic and a literal explanations and two semantically related distractors. Moreover, the dialogues are analyzed with Grice's cooperative principle and are annotated with the Gricean maxims in violation. Following is an entry in the dataset.

Note that to prevent data contamination, we put the data in a zip file with the password: xxx.

**SwordsmanImp**包含了从情景喜剧《武林外传》中提取的200段人物对话。每个对话中都有一句人物台词包含言外之意。我们为这些含有言外之意的台词提供了四种解释,包括一个语用学范畴的解读,一个字面含义以及两个与上下文语境相关的干扰解读。我们用格Grice的合作原则对对话进行分析,并标注了每段对话所违反的Gricean maxim。以下是数据集中的一个条目。
![](graph/data_eg.png)

为了避免数据污染,我们将数据放在一个zip文件中,密码为:0135。

### Meta information
| |Total|Quantity|Quality|Relevance|Manner|
|----|----|----|-----|----|----|
|Number of questions|200|76|33|71|62|
|Number of turns per dialogue|6.80|7.84|5.91|6.23|6.35|
|Average dialogue length|158.22|184.53|143.67|147.20|152.79|
|Avergae utterance length|23.27|23.53|24.31|23.64|24.04|
|Average answer length| 15.08|14.47|14.85|15.82|14.86|

## Model performance
We tested eight open-source and close-source LLMs with our dataset. Here are the results:
![](graph/by_choice.png)
![](graph/by_maxim.png)
See our [paper](https://arxiv.org/abs/2404.19509) for more details.

## Run the code
**Note:** You need to copy the Questions.xlsx unzipped from data.zip to the folders ```eval_chat``` and ```eval_logit``` to run the scripts.

### Free-form evaluation paradigms
Folder ```eval_chat/``` contains code to evaluate models' performance on this pragmatic task through their free-form responses. You should first enter your API keys in ```eval_chat/collect_api.py```. To collect answers from transformer models, use ```collect_api```, ```collect_transformer.py``` or ```collect_transformer_cpu.py``` based on the model and your device. For example, if you want to evaluate GPT-4, The command is

```python3 question_form.py sequence --all | python3 produce_prompts.py | python3 collect_api.py --model gpt-4 --expl_dir <FOLDER_NAME> --index```

The responses are collected in a CSV file in the target folder. You should collect their answers through regular expressions or manually.

### Next word prediction paradigm
**Note:** The code used here is written by referring to [jennhu/lm-pragmatics](https://github.com/jennhu/lm-pragmatics).
Scripts in ```eval_logit/``` evaluate models' pragmatic understanding by collecting their prediction of the probability distribution of the next token after the prompt. Only open-source models are supported by this paradigm. You can run ```eval_logit/exp.sh``` to replicate results in our paper. Also, you can evaluate different models by changing the values after ```-n```(custom name for the model) and ```-m```(models' id on Huggingface). See more details of the parameters allowed in the scripts.

**Columns in the result file:**

```condition```: The question type (represented by the Gricean maxim violated in the dialogue).

```choice_annotation```: The types of the four choices before randomizing the choice order (segmented with ##).

```choices_aft_rand```: The explanation type after randomizing the choice order (segmented with ##).

```correct_aft_rand```: The numeric index (starts from 1) of the correct answer.

```temperature```: ranges between 0 and 1, adjusting the degree of random in LLMs' answer generation.

```distribution```: The predicted probability distribution of the four tokens (A, B, C, and D) to appear after the input prompt.

```model_answer```: Judged by the answer token assigned with the highest probability by the model.

## Acknowledgements
We thank Xinjia Qi, Qiyu Sun and Yaqian Zhang
for verifying the implicatures and improving the
dataset. We also thank all anonymous participants
for their support in this study. This project is
funded by a Pujiang program grant (22PJC063)
awarded to Hai Hu.

## Citation
```
@misc{yue2024large,
title={Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom},
author={Shisen Yue and Siyuan Song and Xinyuan Cheng and Hai Hu},
year={2024},
eprint={2404.19509},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

Binary file added data.zip
Binary file not shown.
132 changes: 132 additions & 0 deletions eval_chat/collect_api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
import openai
import json
import pandas as pd
import argparse
import os
from requests.exceptions import Timeout
import google.generativeai as genai
import sys
import csv

def get_collection_responses(prompt_, model, temp):
for _ in range(MAX_RETIRES):
try:
response = openai.Completion.create(
model=model,
prompt=prompt_,
temperature=temp,
)
break
except Timeout:
print("Request timed out. Retrying...")
except Exception as e:
print(f"An error occurred: {e}")
break # Handle other exceptions as needed
return response['choices'][0]['text']

def get_gemini(message, model, temp):
for _ in range(MAX_RETIRES):
try:
response = model.generate_content(
message
)
break
except Timeout:
print("Request timed out. Retrying...")
except Exception as e:
print(f"An error occurred: {e}")
break # Handle other exceptions as needed
return response.parts

def get_Chat_responses(message, model, temp):
for _ in range(MAX_RETIRES):
try:
response = openai.ChatCompletion.create(
model=model,
messages=message,
temperature=temp,
)
break
except Timeout:
print("Request timed out. Retrying...")
except Exception as e:
print(f"An error occurred: {e}")
break
return response['choices'][0]['message']['content']


if __name__ == "__main__":
legacy_models = ["text-davinci-003", "babbage-002", "davinci-002", "text-davinci-002", "davinci", "curie", "ada", "babbage"]
newer_models = ["gpt-4", "gpt-3.5-turbo", 'default-model']

argparser = argparse.ArgumentParser()
argparser.add_argument('--style', choices=['verbose', 'concise'], default="concise", help="choose verbose to have responses kept in txt files.")
argparser.add_argument('-m', '--model', type=str, default="gpt-4")
argparser.add_argument('--expl_dir', type=str, default="expl")
argparser.add_argument('--index', action="store_true", default=False, help="Whether to name the response files with indices of questions in the original form.")
argparser.add_argument('-t', '--temperature', type=float, default=0)

args = argparser.parse_args()
model_name = args.model
expl_dir = args.expl_dir
temperature = args.temperature

# Set up API interface
if model_name == 'chatglm3':
openai.api_base = "http://127.0.0.1:8000/v1"
model_name = "default-model"
query_func = get_Chat_responses
model = model_name
elif model_name =='gemini_pro':
genai.configure(api_key="") # Add your Gemini API key here
model = genai.GenerativeModel(model_name='gemini-pro')
query_func = get_gemini
else:
api_key = '' # Add your OpenAI API key here
openai.api_key = api_key
model = model_name
if model_name in newer_models:
query_func = get_Chat_responses
else:
query_func = get_collection_responses

MAX_RETIRES = 3

# Check if expl_dir exists
if not os.path.exists(expl_dir):
os.makedirs(expl_dir)
if not os.path.exists(f"{expl_dir}/{model_name}"):
os.makedirs(f"{expl_dir}/{model_name}")

df = pd.read_csv(sys.stdin)
# query_func = get_Chat_responses if model in newer_models else get_collection_responses
inputs = df['Question']
csv_file = f"{model_name}_response.csv"
if csv_file not in os.listdir(f"{expl_dir}/{model_name}"):
with open(f"{expl_dir}/{model_name}/{csv_file}", mode='w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Index', 'Response', 'maxim'])
for i, input in enumerate(inputs):
# Embedding
if model_name in newer_models:
prompt = [{"role": "system", "content": "你现在是一个中文母语者。"},{"role": "user", "content": input}]
else:
prompt = "你现在是一个中文母语者。" + input
response = query_func(prompt, model, temperature)
# save the response to a file
if args.index:
idx = df['Index'][i]
else:
idx = i + 1
if args.style == 'verbose':
with open(f"{expl_dir}/{model_name}/{model_name}_response_{idx}.txt", "w") as f:
f.write(f"Qustion{idx}: {input}")
f.write(f"Response{idx}: {response}")
print(f"Response {idx} saved to response_{idx}.txt and {model_name}_response.csv")
else:
print(f"Adding Response {idx} to {model_name}_response.csv")
with open(f"{expl_dir}/{model_name}/{model_name}_response.csv", mode='a', newline='') as file:
writer = csv.writer(file)
writer.writerow([idx, response, df['maxim'][i]])


74 changes: 74 additions & 0 deletions eval_chat/collect_transformer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import sys
import argparse
import pandas as pd
import os
import csv

parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model_id", type=str, default="gpt2")
parser.add_argument("--precision", choices=['half', 'full'], default='half')
parser.add_argument("--cuda", choices=['cuda:1', 'cuda:0', 'cuda'], default='cuda')
parser.add_argument("--expl_dir", type=str, default="expl")
parser.add_argument("--style", choices=['verbose', 'concise'], default="concise", help="choose verbose to have responses kept in txt files.")
parser.add_argument("--legacy", action="store_true", default=False, help="Whether to use legacy tokenizer.(reported when using Chinese alpaca model)")
parser.add_argument("--index", action="store_true", default=False, help="Whether to name the response files with indices of questions in the original form.")
args = parser.parse_args()

if not os.path.exists(args.expl_dir):
os.mkdir(args.expl_dir)

expl_dir = args.expl_dir
model_id = args.model_id
model_name = model_id.split("/")[-1]
if not os.path.exists(f"{expl_dir}/{model_name}"):
os.makedirs(f"{expl_dir}/{model_name}")

if args.cuda == 'cuda':
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
else:
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto").cuda(args.cuda)
if args.precision == 'half':
model = model.half()
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, legacy=args.legacy)

df = pd.read_csv(sys.stdin)
inputs = df['Question']
csv_file = f"{model_name}_response.csv"
if csv_file not in os.listdir(f"{expl_dir}/{model_name}"):
with open(f"{expl_dir}/{model_name}/{csv_file}", mode='w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Index', 'Response', 'maxim'])
for i, input in enumerate(inputs):
# Embedding
prompt = "你现在是一个中文母语者。" + input
encoded_prompt = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
encoded_prompt = encoded_prompt.to(args.cuda)
output_sequences = model.generate(
input_ids=encoded_prompt,
max_new_tokens=50, # 300 for text generate, 50 for choice
temperature=0.9,
top_k=3, # 0 for text generate, 3 for choice
top_p=0.9,
repetition_penalty=1.0,
do_sample=True,
num_return_sequences=1,
)
response = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
# save the response to a file
if args.index:
idx = df['Index'][i]
else:
idx = i + 1
if args.style == 'verbose':
with open(f"{expl_dir}/{model_name}/{model_name}_response_{idx}.txt", "w") as f:
f.write(f"Qustion{idx}: {input}")
f.write(f"Response{idx}: {response}")
print(f"Response {idx} saved to response_{idx}.txt and {model_name}_response.csv")
else:
print(f"Adding Response {idx} to {model_name}_response.csv")
with open(f"{expl_dir}/{model_name}/{model_name}_response.csv", mode='a', newline='') as file:
writer = csv.writer(file)
writer.writerow([idx, response, df['maxim'][i]])
69 changes: 69 additions & 0 deletions eval_chat/collect_transformer_cpu.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModel
import torch
import sys
import argparse
import pandas as pd
import os
import csv

parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model_id", type=str, default="gpt2")
parser.add_argument("--expl_dir", type=str, default="expl")
parser.add_argument("--style", choices=['verbose', 'concise'], default="concise", help="choose verbose to have responses kept in txt files.")
parser.add_argument("--legacy", action="store_true", default=False, help="Whether to use legacy tokenizer.(reported when using Chinese alpaca model)")
parser.add_argument("--index", action="store_true", default=False, help="Whether to name the response files with indices of questions in the original form.")
parser.add_argument("--max_new_tokens", type=int, default=50, help="The maximum number of tokens to generate.")
args = parser.parse_args()

if not os.path.exists(args.expl_dir):
os.mkdir(args.expl_dir)

expl_dir = args.expl_dir
model_id = args.model_id
model_name = model_id.split("/")[-1]
if not os.path.exists(f"{expl_dir}/{model_name}"):
os.makedirs(f"{expl_dir}/{model_name}")


model = AutoModelForCausalLM.from_pretrained(model_id).cpu()
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_id, legacy=args.legacy)

df = pd.read_csv(sys.stdin)
inputs = df['Question']
csv_file = f"{model_name}_response.csv"
if csv_file not in os.listdir(f"{expl_dir}/{model_name}"):
with open(f"{expl_dir}/{model_name}/{csv_file}", mode='w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Index', 'Response', 'maxim'])
for i, input in enumerate(inputs):
# Embedding
prompt = "你现在是一个中文母语者。" + input
encoded_prompt = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
encoded_prompt = encoded_prompt.cpu()
output_sequences = model.generate(
input_ids=encoded_prompt,
max_new_tokens=args.max_new_tokens, # 512 for text generate, 50 for choice
temperature=0.9,
top_k=3, # 0 for text generate, 3 for choice
top_p=0.9, # 0.9 for text generate, 0.1 for choice
repetition_penalty=1.0,
do_sample=True,
num_return_sequences=1,
)
response = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
# save the response to a file
if args.index:
idx = df['Index'][i]
else:
idx = i + 1
if args.style == 'verbose':
with open(f"{expl_dir}/{model_name}/{model_name}_response_{idx}.txt", "w") as f:
f.write(f"Qustion{idx}: {input}")
f.write(f"Response{idx}: {response}")
print(f"Response {idx} saved to response_{idx}.txt and {model_name}_response.csv")
else:
print(f"Adding Response {idx} to {model_name}_response.csv")
with open(f"{expl_dir}/{model_name}/{model_name}_response.csv", mode='a', newline='') as file:
writer = csv.writer(file)
writer.writerow([idx, response, df['maxim'][i]])
Loading

0 comments on commit 245120e

Please sign in to comment.