This project focuses on the work of understanding sycophantic behavior within LLMs.
Builds upon & leverages Nina Rimsky's work on Sycophancy: Nina Rimsky's blog post
This repository, contaiins the code for a series of experiments that have been formed to better understand sycophancy within LLMs. The specific models that are a portion of focus are listed below:
- Llama2-7B
- Mistral-7B
- MPT-7B
This repository contains the Jupyter notebook Graph_llama2_model_geometry_sycophancy_v4.ipynb which is part of a larger project aimed at analyzing sycophantic behavior using geometric modeling on the LLAMA2 language model outputs. The notebook includes processes to define what constitutes sycophantic versus non-sycophantic responses, a class definition for managing the model input, as well as evaluation and graphing code to visualize the results.
To use the code from this repository, clone it to your local machine and install the required dependencies, which are listed in the requirements.txt file.
git clone https://github.com/jpriivera44/LLM_Sycophancy.git
cd LLM_Sycophancy
pip install -r requirements.txt
After installation, the notebook can be run in a Jupyter environment. It walks through the data preprocessing steps, model evaluation, and result visualization.
To open the notebook, use:
jupyter notebook Main_notebook.ipynb to run the most recent experiments with refactored code.
Experiments that have been conducted have mainly been focused on establishing a highly repeatable baseline for activating sycophancy, and performing visualizations on various activations of sycophancy based off input datasets from Anthropic[1]. I perform a forward pass
Activation Understand of Sycophnatic vector vs. random noise:
Image 1:
Caption: Experimentation of the results from running activations on Mistral 7B to see how behavior was impacted by also incoproating random noise with the same average & variance to see the output. Above is a visualization of those activations.
Geometry of Sycophancy:
The dataset below has the input datasets consist of sentence long indications for sycophancy instead of single word answers.
Caption: Above is the showing the results from an experiment where I mesured the success of linear probes to distinguish sycophancy, however I had results that didn't make sense. The dataset used was only mesuring Sycophancy as a single word answer, and you can see above that there is no separtion in a model separting these because there is not snough signal from a single word form of Sycophancy when agreeing or dis-agreeing with a text input.
Image 3:
Caption: After the usage of PCA, I ran the same dataset with t-SNE to understand if there was a plane of separation, however instead there is a mix of clusters.
MLP Activation exploration:
Image 4:
Caption: The above image captures an experiment that I ran that involved measuring the activations of the MLPs across all the layers within the network, for Llama2. The hypotheses is that if Sycophany exists in a neural network i measuretable way it might not activate all of the MLP layers in an equal manner. The above indicates there is no discerable pattern that emerges, therefore the input dataset needs to change, or the neural network could be too small in terms of the number of parameters to find a difference.
The llm_prompt_eval.py
script is designed to evaluate Large Language Models (LLMs) by prompting them with specific evaluation data. This tool is essential for analyzing and understanding the responses of LLMs in various contexts.
- Python 3.x
openai
Python packagerequests
Python package- An API key for OpenAI (if using OpenAI's models)
- Ensure you have Python 3.x installed on your system.
- Install the required Python packages using the command:
pip install openai requests
. - Set your OpenAI API key as an environment variable:
export OPENAI_API_KEY='your-api-key'
.
- Navigate to the directory containing
llm_prompt_eval.py
. - Run the script using Python:
python llm_prompt_eval.py
. - The script will prompt for the necessary evaluation data and context.
- Observe the output, which includes the LLM's responses to the evaluation prompts.
After running the forward pass on the Antrhopic dataset of 3,000 examples of sycophantic vs. non-sycophantic pairs, I decided to plot the activations from all the layers with a t-SNE method. The results above show the separation between sycophantic & non-sycophatnic responses. The issue with the graph is that these clusters might not be truly representative of sycophancy, and mainly that they are not linearly separable.
Anthropic Model Written Evals Dataset
Nina Rimsky's blog post
Contributing I welcome contributions from the community. If you wish to contribute to this project, please follow the guidelines in the CONTRIBUTING.md file.
License This project is licensed under the MIT License - see the LICENSE.md file for details.