Skip to content

Latest commit

 

History

History

instruction_following

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Instruction Following

This directory contains code and instructions for using off-the-shelf small/weak models to guide the decoding of large/strong models to better follow human instructions.

Supported Models

The small/weak model pairs we currently support (in the order of tuned and untuned):

To add customized model pairs, see get_scorer function from scripts/instruction_following/utils/utils.py function.

The large/strong base models we currently support:

To add customized base model, modify get_chat_prompt_template function from scripts/instruction_following/utils/utils.py to provide chat template.

Guided Generation

We demonstrate how to test guided models on AlpacaEval.

Here are examples of steering meta-llama/Llama-2-13b-chat-hf under Zephyr guidance (HuggingFaceH4/zephyr-7b-beta, HuggingFaceH4/mistral-7b-sft-beta) on a subset of prompts from AlpacaEval (1/32 * 805):

  • CBS (Chunk-level Beam Search) with W, K, L = 2, 2, 30:

    PYTHONPATH=. python scripts/instruction_following/cbs.py \
        --rank=1 --world_size=32 \
        --gen.w=2 --gen.k=2 --gen.l=30 \
        --model_name="meta-llama/Llama-2-13b-chat-hf" \
        --scorer_name="HuggingFaceH4/zephyr-7b-beta" \
        --output_dir="output/instruction_following/cbs/w2k2l30/gen"
  • BoN (Best-of-N Sampling) with N = 4 or N = 8:

     PYTHONPATH=. python scripts/instruction_following/cbs.py \
        --rank=1 --world_size=32 \
        --gen.w=1 --gen.k=4 --gen.l=None \
        --model_name="meta-llama/Llama-2-13b-chat-hf" \
        --scorer_name="HuggingFaceH4/zephyr-7b-beta" \
        --output_dir="output/instruction_following/bon/n4/gen"
    
     PYTHONPATH=. python scripts/instruction_following/cbs.py \
        --rank=1 --world_size=32 \
        --gen.w=1 --gen.k=8 --gen.l=None \
        --model_name="meta-llama/Llama-2-13b-chat-hf" \
        --scorer_name="HuggingFaceH4/zephyr-7b-beta" \
        --output_dir="output/instruction_following/bon/n8/gen"
  • EFT (Emulated Fine-Tuning):

    PYTHONPATH=. python scripts/instruction_following/eft.py \
        --rank=1 --world_size=32 \
        --gen.beta=1.0 \
        --model_name="meta-llama/Llama-2-13b-chat-hf" \
        --scorer_name="HuggingFaceH4/zephyr-7b-beta" \
        --output_dir="output/instruction_following/eft/beta1.0/gen"

    Note that EFT is only applicable when all models share the same vocabulary.

  • Base w/o guidance:

    PYTHONPATH=. python scripts/instruction_following/cbs.py \
        --rank=1 --world_size=32 \
        --gen.w=1 --gen.k=1 --gen.l=None \
        --model_name="meta-llama/Llama-2-13b-chat-hf" \
        --output_dir="output/instruction_following/base/gen"

(Optionally) Repeat over all ranks for complete generation results, but a subset is usually enough for a sanity check.:

for i in $(seq 1 32); do
    PYTHONPATH=. python scripts/instruction_following/cbs.py \
        --rank=${rank} --world_size=32 \
        --gen.w=2 --gen.k=2 --gen.l=30 \
        --model_name="meta-llama/Llama-2-13b-chat-hf" \
        --scorer_name="HuggingFaceH4/zephyr-7b-beta" \
        --output_dir="output/instruction_following/cbs/w2k2l30/gen"
done

Evaluation

There are three ways to automatically evaluate the generated responses: 1. GPT-4 (AlpacaEval default) 2. openbmb/UltraRM-13b and 3. Nexusflow/Starling-RM-34B:

  • GPT-4:

    PYTHONPATH=. python scripts/instruction_following/eval.py \
        --evaluator_name="GPT-4" \
        --generation_dir="output/instruction_following/cbs/w2k2l30/gen" \
        --evaluation_dir="output/instruction_following/cbs/w2k2l30/eval"
    
     OPENAI_API_KEY="..." alpaca_eval --model_outputs "output/instruction_following/cbs/w2k2l30/eval/GPT-4/model_outputs.json"
  • openbmb/UltraRM-13b and Nexusflow/Starling-RM-34B:

    for evaluator_name in "openbmb/UltraRM-13b" "Nexusflow/Starling-RM-34B"; do
        PYTHONPATH=. python scripts/instruction_following/eval.py \
            --evaluator_name=${evaluator_name} \
            --generation_dir="output/instruction_following/cbs/w2k2l30/gen" \
            --evaluation_dir="output/instruction_following/cbs/w2k2l30/eval"
    done

FQA

To decode models saved locally.

If you do not save models in the default cache directory (e.g., ~/.cache/huggingface), modify scripts/configs/local_model_path.yaml to map model name to its local path. For example.

meta-llama/Meta-Llama-3-8B-Instruct: ~/models/Meta-Llama-3-8B-Instruct
meta-llama/Meta-Llama-3-70B-Instruct: ~/models//Meta-Llama-3-70B-Instruct
Out of GPU memory.

To infer a large (70B) model that doesn't fit on a single GPU, run the code as is with multiple GPUs or 4-bit quantization. For example:

# Infer on one single GPU
CUDA_VISIBLE_DEVICES=0 python ...

# Infer on one single GPU with 4-bit quant
CUDA_VISIBLE_DEVICES=0 python ... --load_in_4bit=True

# Infer on four GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 python ...

# Infer on four GPUs with 4-bit quant
CUDA_VISIBLE_DEVICES=0,1,2,3 python ... --load_in_4bit=True