Tested using Python 3.11.0, miniconda 23.1.0, git 2.25.1
- Set up conda environment called
fairness
with
conda env create -f environment.yml
- Set environment variables
HF_ACCESS_TOKEN
to your huggingface API token andOPENAI_API_KEY
to your openai API key - Optional: Set
TRANSFORMERS_CACHE
to your lab's transformer cache especially on HPC environments!
We use 3 datasets.
- Bias in Bios
- HateXplain
- TwitterAAE
To setup bias in bios run these commands
wget https://storage.googleapis.com/ai2i/nullspace/biasbios/train.pickle -P path/to/data/folder/
wget https://storage.googleapis.com/ai2i/nullspace/biasbios/dev.pickle -P path/to/data/folder/
wget https://storage.googleapis.com/ai2i/nullspace/biasbios/test.pickle -P path/to/data/folder/
To setup HateExplain run this command
git clone https://github.com/hate-alert/HateXplain.git
There are a couple steps to setup twitter aae. We follow the steps found here.
We reproduce them here for your convenience
- Download TwitterAAE
wget http://slanglab.cs.umass.edu/TwitterAAE/TwitterAAE-full-v1.zip
- Clone
demog-text-removal
to prepare the data
https://github.com/yanaiela/demog-text-removal.git
- Setup environment for
demog-text-removal
(Requires python 2.7)
conda create -n adv-demog-text python==2.7 anaconda
source activate adv-demog-text
pip install -r requirements.txt
- Run
make_data.py
found indemog-text-removal/src/data
withadv-demog-text
environment activated
python make_data.py /path/to/downloaded/twitteraae_all /path/to/project/data/processed/sentiment_race sentiment race
We use a toml config (WIP) to run the main function. You can take a look at the one provided to get a feel for how to use it
To run this program, activate fairness
conda environment and run
python -m src --config /path/to/config.toml
We include tests for sanity checking to run these
python -m pytest
LLaMa models and Alpaca Models have been erroring out recently but we are still going to experiment with them
To fit them into the API, I have modified the hfoffline.py
file to compensate for their weirdness
I have successfully
- Integrated them into the api
- Loading their models
- Loaded them onto gpus using device map (dont change it from balanced_low_0 it good for generation)
- Resized the model embeddings to fit the length
- Loading their tokenizer
- Set their tokenizer padding tokens accordingly
- Set the generation parameters
I have not successfully generated anything because of CUDA OOM and CUBLAS not initalized without trying to increase gpus :(
- Converting to float16 to fit on gpus
- Lowering batch_size
- Putting CUDA_LAUNCH_BLOCKING=0
I have not tried
- Being greedy with GPUs :)