initial commit

sjtu-compling · May 7, 2024 · 245120e · 245120e
1 parent ae7b05c
commit 245120e
Show file tree

Hide file tree

Showing 16 changed files with 987 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -1 +1,78 @@
-# llm-pragmatics
+# llm-implicatures
+This repository contains code and data used in the project **[Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom](https://arxiv.org/abs/2404.19509)**.
+
+## Data Description 数据集介绍
+**SwordsmanImp** contains the 200 Chinese dialogues extracted from the sitcom *My own Swordsman*. Within each of the dialogue, there is one character line conveying a non-literal meaning. We provides four interpretations for these lines with implicatures, which include a pragmatic and a literal explanations and two semantically related distractors. Moreover, the dialogues are analyzed with Grice's cooperative principle and are annotated with the Gricean maxims in violation. Following is an entry in the dataset.
+
+Note that to prevent data contamination, we put the data in a zip file with the password: xxx.
+
+**SwordsmanImp**包含了从情景喜剧《武林外传》中提取的200段人物对话。每个对话中都有一句人物台词包含言外之意。我们为这些含有言外之意的台词提供了四种解释，包括一个语用学范畴的解读，一个字面含义以及两个与上下文语境相关的干扰解读。我们用格Grice的合作原则对对话进行分析，并标注了每段对话所违反的Gricean maxim。以下是数据集中的一个条目。
+![](graph/data_eg.png)
+
+为了避免数据污染，我们将数据放在一个zip文件中，密码为：0135。
+
+### Meta information
+|   |Total|Quantity|Quality|Relevance|Manner|
+|----|----|----|-----|----|----|
+|Number of questions|200|76|33|71|62|
+|Number of turns per dialogue|6.80|7.84|5.91|6.23|6.35|
+|Average dialogue length|158.22|184.53|143.67|147.20|152.79|
+|Avergae utterance length|23.27|23.53|24.31|23.64|24.04|
+|Average answer length| 15.08|14.47|14.85|15.82|14.86|
+
+## Model performance
+We tested eight open-source and close-source LLMs with our dataset. Here are the results:
+![](graph/by_choice.png)
+![](graph/by_maxim.png)
+See our [paper](https://arxiv.org/abs/2404.19509) for more details.
+
+## Run the code
+**Note:** You need to copy the Questions.xlsx unzipped from data.zip to the folders ```eval_chat``` and ```eval_logit``` to run the scripts.
+
+### Free-form evaluation paradigms
+Folder ```eval_chat/``` contains code to evaluate models' performance on this pragmatic task through their free-form responses. You should first enter your API keys in ```eval_chat/collect_api.py```. To collect answers from transformer models, use ```collect_api```, ```collect_transformer.py``` or ```collect_transformer_cpu.py``` based on the model and your device. For example, if you want to evaluate GPT-4, The command is
+
+```python3 question_form.py sequence --all | python3 produce_prompts.py | python3 collect_api.py --model gpt-4 --expl_dir <FOLDER_NAME> --index```
+
+The responses are collected in a CSV file in the target folder. You should collect their answers through regular expressions or manually. 
+
+### Next word prediction paradigm
+**Note:** The code used here is written by referring to [jennhu/lm-pragmatics](https://github.com/jennhu/lm-pragmatics).
+Scripts in ```eval_logit/```  evaluate models' pragmatic understanding by collecting their prediction of the probability distribution of the next token after the prompt. Only open-source models are supported by this paradigm. You can run ```eval_logit/exp.sh``` to replicate results in our paper. Also, you can evaluate different models by changing the values after ```-n```(custom name for the model) and ```-m```(models' id on Huggingface). See more details of the parameters allowed in the scripts. 
+
+**Columns in the result file:**
+
+```condition```: The question type (represented by the Gricean maxim violated in the dialogue).
+
+```choice_annotation```: The types of the four choices before randomizing the choice order (segmented with ##).
+
+```choices_aft_rand```: The explanation type after randomizing the choice order (segmented with ##).
+
+```correct_aft_rand```: The numeric index (starts from 1) of the correct answer. 
+
+```temperature```: ranges between 0 and 1, adjusting the degree of random in LLMs' answer generation.
+
+```distribution```: The predicted probability distribution of the four tokens (A, B, C, and D) to appear after the input prompt.
+
+```model_answer```: Judged by the answer token assigned with the highest probability by the model. 
+
+## Acknowledgements
+We thank Xinjia Qi, Qiyu Sun and Yaqian Zhang
+for verifying the implicatures and improving the
+dataset. We also thank all anonymous participants
+for their support in this study. This project is
+funded by a Pujiang program grant (22PJC063)
+awarded to Hai Hu.
+
+## Citation
+```
+@misc{yue2024large,
+      title={Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom}, 
+      author={Shisen Yue and Siyuan Song and Xinyuan Cheng and Hai Hu},
+      year={2024},
+      eprint={2404.19509},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+
diff --git a/data.zip b/data.zip
diff --git a/eval_chat/collect_api.py b/eval_chat/collect_api.py
@@ -0,0 +1,132 @@
+import openai
+import json
+import pandas as pd
+import argparse
+import os
+from requests.exceptions import Timeout
+import google.generativeai as genai
+import sys
+import csv
+
+def get_collection_responses(prompt_, model, temp):
+    for _ in range(MAX_RETIRES):
+        try:
+            response = openai.Completion.create(
+                model=model,
+                prompt=prompt_,
+                temperature=temp,
+            )
+            break
+        except Timeout:
+            print("Request timed out. Retrying...")
+        except Exception as e:
+            print(f"An error occurred: {e}")
+            break  # Handle other exceptions as needed
+    return response['choices'][0]['text']
+
+def get_gemini(message, model, temp):
+    for _ in range(MAX_RETIRES):
+        try:
+            response = model.generate_content(
+                message
+            )
+            break
+        except Timeout:
+            print("Request timed out. Retrying...")
+        except Exception as e:
+            print(f"An error occurred: {e}")
+            break  # Handle other exceptions as needed
+    return response.parts
+
+def get_Chat_responses(message, model, temp):
+    for _ in range(MAX_RETIRES):
+        try:
+            response = openai.ChatCompletion.create(
+                model=model,
+                messages=message,
+                temperature=temp,
+            )
+            break
+        except Timeout:
+            print("Request timed out. Retrying...")
+        except Exception as e:
+            print(f"An error occurred: {e}")
+            break
+    return response['choices'][0]['message']['content']
+
+
+if __name__ == "__main__":
+    legacy_models = ["text-davinci-003", "babbage-002", "davinci-002", "text-davinci-002", "davinci", "curie", "ada", "babbage"]
+    newer_models = ["gpt-4", "gpt-3.5-turbo", 'default-model']
+
+    argparser = argparse.ArgumentParser()
+    argparser.add_argument('--style', choices=['verbose', 'concise'], default="concise", help="choose verbose to have responses kept in txt files.")
+    argparser.add_argument('-m', '--model', type=str, default="gpt-4")
+    argparser.add_argument('--expl_dir', type=str, default="expl")
+    argparser.add_argument('--index', action="store_true", default=False, help="Whether to name the response files with indices of questions in the original form.")
+    argparser.add_argument('-t', '--temperature', type=float, default=0)
+
+    args = argparser.parse_args()
+    model_name = args.model
+    expl_dir = args.expl_dir
+    temperature = args.temperature
+
+    # Set up API interface
+    if model_name == 'chatglm3':
+        openai.api_base = "http://127.0.0.1:8000/v1"
+        model_name = "default-model"
+        query_func = get_Chat_responses
+        model = model_name
+    elif model_name =='gemini_pro':
+        genai.configure(api_key="") # Add your Gemini API key here
+        model = genai.GenerativeModel(model_name='gemini-pro')
+        query_func = get_gemini
+    else:
+        api_key = '' # Add your OpenAI API key here
+        openai.api_key = api_key
+        model = model_name
+        if model_name in newer_models:
+            query_func = get_Chat_responses
+        else:
+            query_func = get_collection_responses
+
+    MAX_RETIRES = 3
+
+    # Check if expl_dir exists
+    if not os.path.exists(expl_dir):
+        os.makedirs(expl_dir)
+    if not os.path.exists(f"{expl_dir}/{model_name}"):
+        os.makedirs(f"{expl_dir}/{model_name}")
+
+    df = pd.read_csv(sys.stdin)
+    # query_func = get_Chat_responses if model in newer_models else get_collection_responses
+    inputs = df['Question']
+    csv_file = f"{model_name}_response.csv"
+    if csv_file not in os.listdir(f"{expl_dir}/{model_name}"):
+        with open(f"{expl_dir}/{model_name}/{csv_file}", mode='w', newline='') as file:
+            writer = csv.writer(file)
+            writer.writerow(['Index', 'Response', 'maxim'])
+    for i, input in enumerate(inputs):
+        # Embedding
+        if model_name in newer_models:
+            prompt = [{"role": "system", "content": "你现在是一个中文母语者。"},{"role": "user", "content": input}]
+        else:
+            prompt = "你现在是一个中文母语者。" + input
+        response = query_func(prompt, model, temperature)
+        # save the response to a file
+        if args.index:
+            idx = df['Index'][i]
+        else:
+            idx = i + 1
+        if args.style == 'verbose':
+            with open(f"{expl_dir}/{model_name}/{model_name}_response_{idx}.txt", "w") as f:
+                f.write(f"Qustion{idx}: {input}")
+                f.write(f"Response{idx}: {response}")
+            print(f"Response {idx} saved to response_{idx}.txt and {model_name}_response.csv")
+        else:
+            print(f"Adding Response {idx} to {model_name}_response.csv")
+        with open(f"{expl_dir}/{model_name}/{model_name}_response.csv", mode='a', newline='') as file:
+            writer = csv.writer(file)
+            writer.writerow([idx, response, df['maxim'][i]])
+
+
diff --git a/eval_chat/collect_transformer.py b/eval_chat/collect_transformer.py
@@ -0,0 +1,74 @@
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+import sys
+import argparse
+import pandas as pd
+import os
+import csv
+
+parser = argparse.ArgumentParser()
+parser.add_argument("-m", "--model_id", type=str, default="gpt2")
+parser.add_argument("--precision", choices=['half', 'full'], default='half')
+parser.add_argument("--cuda", choices=['cuda:1', 'cuda:0', 'cuda'], default='cuda')
+parser.add_argument("--expl_dir", type=str, default="expl")
+parser.add_argument("--style", choices=['verbose', 'concise'], default="concise", help="choose verbose to have responses kept in txt files.")
+parser.add_argument("--legacy", action="store_true", default=False, help="Whether to use legacy tokenizer.(reported when using Chinese alpaca model)")
+parser.add_argument("--index", action="store_true", default=False, help="Whether to name the response files with indices of questions in the original form.")
+args = parser.parse_args()
+
+if not os.path.exists(args.expl_dir):
+    os.mkdir(args.expl_dir)
+
+expl_dir = args.expl_dir
+model_id = args.model_id
+model_name = model_id.split("/")[-1]
+if not os.path.exists(f"{expl_dir}/{model_name}"):
+    os.makedirs(f"{expl_dir}/{model_name}")
+
+if args.cuda == 'cuda':
+    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
+else:
+    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto").cuda(args.cuda)
+if args.precision == 'half':
+    model = model.half()
+model.eval()
+tokenizer = AutoTokenizer.from_pretrained(model_id, legacy=args.legacy)
+
+df = pd.read_csv(sys.stdin)
+inputs = df['Question']
+csv_file = f"{model_name}_response.csv"
+if csv_file not in os.listdir(f"{expl_dir}/{model_name}"):
+    with open(f"{expl_dir}/{model_name}/{csv_file}", mode='w', newline='') as file:
+        writer = csv.writer(file)
+        writer.writerow(['Index', 'Response', 'maxim'])
+for i, input in enumerate(inputs):
+    # Embedding
+    prompt = "你现在是一个中文母语者。" + input
+    encoded_prompt = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
+    encoded_prompt = encoded_prompt.to(args.cuda)
+    output_sequences = model.generate(
+        input_ids=encoded_prompt,
+        max_new_tokens=50, # 300 for text generate, 50 for choice
+        temperature=0.9,
+        top_k=3, # 0 for text generate, 3 for choice
+        top_p=0.9,
+        repetition_penalty=1.0,
+        do_sample=True,
+        num_return_sequences=1,
+    )
+    response = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
+    # save the response to a file
+    if args.index:
+        idx = df['Index'][i]
+    else:
+        idx = i + 1
+    if args.style == 'verbose':
+        with open(f"{expl_dir}/{model_name}/{model_name}_response_{idx}.txt", "w") as f:
+            f.write(f"Qustion{idx}: {input}")
+            f.write(f"Response{idx}: {response}")
+        print(f"Response {idx} saved to response_{idx}.txt and {model_name}_response.csv")
+    else:
+        print(f"Adding Response {idx} to {model_name}_response.csv")
+    with open(f"{expl_dir}/{model_name}/{model_name}_response.csv", mode='a', newline='') as file:
+        writer = csv.writer(file)
+        writer.writerow([idx, response, df['maxim'][i]])
diff --git a/eval_chat/collect_transformer_cpu.py b/eval_chat/collect_transformer_cpu.py
@@ -0,0 +1,69 @@
+from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModel
+import torch
+import sys
+import argparse
+import pandas as pd
+import os
+import csv
+
+parser = argparse.ArgumentParser()
+parser.add_argument("-m", "--model_id", type=str, default="gpt2")
+parser.add_argument("--expl_dir", type=str, default="expl")
+parser.add_argument("--style", choices=['verbose', 'concise'], default="concise", help="choose verbose to have responses kept in txt files.")
+parser.add_argument("--legacy", action="store_true", default=False, help="Whether to use legacy tokenizer.(reported when using Chinese alpaca model)")
+parser.add_argument("--index", action="store_true", default=False, help="Whether to name the response files with indices of questions in the original form.")
+parser.add_argument("--max_new_tokens", type=int, default=50, help="The maximum number of tokens to generate.")
+args = parser.parse_args()
+
+if not os.path.exists(args.expl_dir):
+    os.mkdir(args.expl_dir)
+
+expl_dir = args.expl_dir
+model_id = args.model_id
+model_name = model_id.split("/")[-1]
+if not os.path.exists(f"{expl_dir}/{model_name}"):
+    os.makedirs(f"{expl_dir}/{model_name}")
+
+
+model = AutoModelForCausalLM.from_pretrained(model_id).cpu()
+model.eval()
+tokenizer = AutoTokenizer.from_pretrained(model_id, legacy=args.legacy)
+
+df = pd.read_csv(sys.stdin)
+inputs = df['Question']
+csv_file = f"{model_name}_response.csv"
+if csv_file not in os.listdir(f"{expl_dir}/{model_name}"):
+    with open(f"{expl_dir}/{model_name}/{csv_file}", mode='w', newline='') as file:
+        writer = csv.writer(file)
+        writer.writerow(['Index', 'Response', 'maxim'])
+for i, input in enumerate(inputs):
+    # Embedding
+    prompt = "你现在是一个中文母语者。" + input
+    encoded_prompt = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
+    encoded_prompt = encoded_prompt.cpu()
+    output_sequences = model.generate(
+        input_ids=encoded_prompt,
+        max_new_tokens=args.max_new_tokens, # 512 for text generate, 50 for choice
+        temperature=0.9,
+        top_k=3,  # 0 for text generate, 3 for choice
+        top_p=0.9, # 0.9 for text generate, 0.1 for choice 
+        repetition_penalty=1.0,
+        do_sample=True,
+        num_return_sequences=1, 
+    )
+    response = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
+    # save the response to a file
+    if args.index:
+        idx = df['Index'][i]
+    else:
+        idx = i + 1
+    if args.style == 'verbose':
+        with open(f"{expl_dir}/{model_name}/{model_name}_response_{idx}.txt", "w") as f:
+            f.write(f"Qustion{idx}: {input}")
+            f.write(f"Response{idx}: {response}")
+        print(f"Response {idx} saved to response_{idx}.txt and {model_name}_response.csv")
+    else:
+        print(f"Adding Response {idx} to {model_name}_response.csv")
+    with open(f"{expl_dir}/{model_name}/{model_name}_response.csv", mode='a', newline='') as file:
+        writer = csv.writer(file)
+        writer.writerow([idx, response, df['maxim'][i]])