Qiming Xie, Zengzhi Wang, Yi Feng, Rui Xia
Nanjing University of Science and Technology, China
📄 [Paper] 🖥️ [Homepage on PaperWithCode]
- Overview
- FOLLOW-UP QUESTIONING MECHANISM
- Evaluation
- Further Studies
- Mitigation Method Exploration
- Examples 🌰
- Any Question?
- Citation
❗️ With the emergence of generative conversational large language models (LLMs) like ChatGPT, serving as virtual assistants in various fields, the stability and reliability of their responses have become crucial. However, during usage, it has been observed that these models tend to waver in their judgments when confronted with follow-up questions from users expressing skepticism or disagreement. 🌰 Like these examples 🌰
🪛 In this work, we draw inspiration from questioning strategies in education and propose a FOLLOW-UP QUESTIONING MECHANISM along with two evaluation metrics to assess the judgment consistency of LLMs before and after exposure to disturbances. We evaluate the judgment consistency of ChatGPT, PaLM2-Bison, and Vicuna-13B under this mechanism across eight reasoning benchmarks. Empirical results show that even when the initial answers are correct, judgment consistency sharply decreases when LLMs face disturbances such as questioning, negation, or misleading.
📊 Additionally, we study these models’ judgment consistency under various settings (sampling temperature and prompts) to validate this issue further, observing the impact of prompt tone and conducting an in-depth error analysis for deeper behavioral insights. Furthermore, we also explore several prompting methods to mitigate this issue and demonstrate their effectiveness.
🗒 NOTE: We define judgment consistency as the consistency of the model’s final answers when handling objective questions with definitive answers.
To evaluate this consistency of large language models, we design a FOLLOW-UP QUESTIONING MECHANISM. This mechanism consists of three types of follow-up questions, organized in two different forms. After the model initially answers correctly, we continue dialogues to question, negate, or mislead it, then observe any judgment changes.
The prompts we used in the experiment. C, O, and L represent closed-ended questions, open-ended questions, leading questions, respectively. {M_A} denotes the misleading answers.
We employ two metrics to assess the judgment consistency of LLMs after the execution of the mechanism.
- Modification (M.) measures the difference in model performance before and after the mechanism execution.
- Modification Rate (M. Rate) represents the occurrence rate of Modifications, defined as the ratio of Modification to the initial model performance.
- Models
- ChatGPT (gpt-3.5-turbo-0301) with temperature at 0.5.
- PaLM2-Bison (chat-bison-001) with temperature at 0.4.
- Vicuna-13b (Vicuna-13B-v1.3) with temperature at 0.7.
- Benchmarks
- Arithmetic Reasoning
- GSM8K
- SVAMP
- MultiArith
- Commonsense Reasoning
- CSQA
- StrategyQA
- Symbolic Reasoning
- Last Letter Concatenation
- Coin Flip
- Knowledge Reasoning
- MMLU
- Arithmetic Reasoning
The results of ChatGPT in Direct Form.
The results of ChatGPT in Progressive Form.
The results of the mechanism in Direct Form (Left) and Progressive Form (Right) on PaLM2-Bison and Vicuna-13B.
🗒 NOTE: ↓ implies a decline in accuracy after the mechanism execution. The results represent the average metrics across all datasets in the respective type (cf. Benchmarks). Bold denotes the poorest judgment consistency.
Intuitively, the lower the sampling temperature, the more deterministic the generated outputs, whereas higher temperature lead to more diverse outputs. Given that, does this judgment consistency issue still exist when the temperature is 0?
To investigate this, we evaluate the model’s judgment consistency under the mechanism at the temperature of 0, utilizing representative datasets: StrategyQA, CoinFlip and MultiArith, and employ closed-ended, open-ended, and leading questions to disturb the model, respectively (due to their demonstrated lowest judgment consistency).
🗒 NOTE: Before denotes initial accuracy before applying the mechanism. Bold denotes the poorest judgment consistency.
Do the models waver in their judgments under other prompts as well? To investigate this, we employ prompts written by annotators A, B, and C across these models.
The impact of different prompts on Modification (Direct Form).
Considering the practical educational scenario, when students face questioning, denial, or misinformation, their judgments often experience a significant impact from the teacher’s tone intensity of speech. Therefore, we explore the influence of using different prompts on the model’s judgment consistency from the perspective of tone intensity. Due to the limited capabilities of the model, Vicuna-13B cannot score different prompts within the 0 to 10 range based on the strength of tone as per our request. In addition, compared to the other two models, Vicuna-13B shows relatively small fluctuations in judgment consistency when different prompts are used. Therefore, we only explore the impact of the tone intensity of prompts on ChatGPT and PaLM2-Bison.
Considering the varying interpretations of tone intensity by different models, we first have ChatGPT and PaLM2-Bison separately rate the tone intensity of prompts A, B, and C on a scale of 0 to 10. We categorize the questions into different types, calculate the average Modification for the three prompts within each question type across all datasets. The models’ tone intensity scores for the three prompts (cf. The Impact of Different Prompts) were taken as reference points.
Using ChatGPT’s judgment consistency as the reference, we analyze error examples in StrategyQA, CoinFlip, and MultiArith, employing closed-ended, open-ended and leading questions to mislead the model. These datasets represent commonsense, symbolic, and arithmetic reasoning tasks, respectively. Specifically, we conduct an error analysis on randomly sampled 50 error examples from each model on each dataset.
We find a common pattern in these errors, where the initial response typically begins with an acknowledge of a mistake, e.g., “I apologize for my mistake.”. Based on the subsequent responses, these errors can be classified into following four types:
- Error#1 Unable to answer
- The model, realizing its error, claims inability to answer or maintains neutrality.
- Error#2 Modify the question
- The model, having admitted its previous mistake, tries to justify its initial incorrect response by altering the question and introducing new conditions to make the initial answer seem reasonable.
- Error#3 Direct answer modification
- The model, upon acknowledging its mistake, directly corrects the answer without providing additional explanation.
- Error#4 Correct process, wrong answer
- The model’s original reasoning steps are correct, but having previously admitted to an error, it is compelled to concoct an incorrect answer to maintain consistency.
Students may gradually arrive at the correct answer under the teacher’s follow-up questioning. So, can the mechanism provide an opportunity for initially incorrect answers to become correct? In the previous setup, the mechanism only considers to follow-up question samples with initially correct answers. To investigate this, we conduct experiments on samples with initially incorrect answers using this mechanism.
Essentially, we believe that this issue originates from the misalignment between the model’s response generation process when facing disturbances and the thinking process of humans under similar disturbances. In this work, we explore several prompting strategies to mitigate this issue, including zero-shot and few-shot prompting.
- Zero-shot prompting
- Zero-shot-CoT: Let’s think step by step.
- EmotionPrompt: This is very important to my career.
- Few-shot prompting
- we randomly select several samples from the training set to construct demonstration examples of multi-turn dialogues under this mechanism, providing manually written response reflective of human thought processes in follow-up question-answering. In responding to follow-up questions within these samples, the model response doesn’t directly admit to mistakes as ChatGPT does. Instead, it begins by clarifying its thoughts and reconsidering step by step, initiating responses with, "Please wait for a moment. In order to answer your question, I need to take a moment to reconsider. I will now clear my mind of distractions and approach this step by step."
Here are examples of ChatGPT, Bard, Vicuna-13b, and some other Chinese large language models.
If you find this work helpful, please cite our paper as follows:
@inproceedings{xie-etal-2024-ask,
title = "Ask Again, Then Fail: Large Language Models{'} Vacillations in Judgment",
author = "Xie, Qiming and
Wang, Zengzhi and
Feng, Yi and
Xia, Rui",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.577",
pages = "10709--10745",
}
If you have any questions related to this work, you can open an issue with details or feel free to email Qiming(qmxie@njust.edu.cn
), Zengzhi(zzwang@njust.edu.cn
).