Name	Name	Last commit message	Last commit date
parent directory ..
utils	utils
README.md	README.md
cbs.py	cbs.py
eft.py	eft.py
eval.py	eval.py

Instruction Following

This directory contains code and instructions for using off-the-shelf small/weak models to guide the decoding of large/strong models to better follow human instructions.

Supported Models

The small/weak model pairs we currently support (in the order of tuned and untuned):

Zephyr guidance: HuggingFaceH4/zephyr-7b-beta, HuggingFaceH4/mistral-7b-sft-beta
Starling guidance: berkeley-nest/Starling-LM-7B-alpha, openchat/openchat_3.5
Tulu guidance: allenai/tulu-2-dpo-7b, allenai/tulu-2-7b

To add customized model pairs, see get_scorer function from scripts/instruction_following/utils/utils.py function.

The large/strong base models we currently support:

Llama-2 series: meta-llama/Llama-2-7b-chat-hf, meta-llama/Llama-2-13b-chat-hf, meta-llama/Llama-2-70b-chat-hf
Llama-3 series: meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3-70B-Instruct

To add customized base model, modify get_chat_prompt_template function from scripts/instruction_following/utils/utils.py to provide chat template.

Guided Generation

We demonstrate how to test guided models on AlpacaEval.

Here are examples of steering meta-llama/Llama-2-13b-chat-hf under Zephyr guidance (HuggingFaceH4/zephyr-7b-beta, HuggingFaceH4/mistral-7b-sft-beta) on a subset of prompts from AlpacaEval (1/32 * 805):

CBS (Chunk-level Beam Search) with W, K, L = 2, 2, 30:

PYTHONPATH=. python scripts/instruction_following/cbs.py \
    --rank=1 --world_size=32 \
    --gen.w=2 --gen.k=2 --gen.l=30 \
    --model_name="meta-llama/Llama-2-13b-chat-hf" \
    --scorer_name="HuggingFaceH4/zephyr-7b-beta" \
    --output_dir="output/instruction_following/cbs/w2k2l30/gen"

BoN (Best-of-N Sampling) with N = 4 or N = 8:

 PYTHONPATH=. python scripts/instruction_following/cbs.py \
    --rank=1 --world_size=32 \
    --gen.w=1 --gen.k=4 --gen.l=None \
    --model_name="meta-llama/Llama-2-13b-chat-hf" \
    --scorer_name="HuggingFaceH4/zephyr-7b-beta" \
    --output_dir="output/instruction_following/bon/n4/gen"

 PYTHONPATH=. python scripts/instruction_following/cbs.py \
    --rank=1 --world_size=32 \
    --gen.w=1 --gen.k=8 --gen.l=None \
    --model_name="meta-llama/Llama-2-13b-chat-hf" \
    --scorer_name="HuggingFaceH4/zephyr-7b-beta" \
    --output_dir="output/instruction_following/bon/n8/gen"

EFT (Emulated Fine-Tuning):

PYTHONPATH=. python scripts/instruction_following/eft.py \
    --rank=1 --world_size=32 \
    --gen.beta=1.0 \
    --model_name="meta-llama/Llama-2-13b-chat-hf" \
    --scorer_name="HuggingFaceH4/zephyr-7b-beta" \
    --output_dir="output/instruction_following/eft/beta1.0/gen"

Note that EFT is only applicable when all models share the same vocabulary.

Base w/o guidance:

PYTHONPATH=. python scripts/instruction_following/cbs.py \
    --rank=1 --world_size=32 \
    --gen.w=1 --gen.k=1 --gen.l=None \
    --model_name="meta-llama/Llama-2-13b-chat-hf" \
    --output_dir="output/instruction_following/base/gen"

(Optionally) Repeat over all ranks for complete generation results, but a subset is usually enough for a sanity check.:

for i in $(seq 1 32); do
    PYTHONPATH=. python scripts/instruction_following/cbs.py \
        --rank=${rank} --world_size=32 \
        --gen.w=2 --gen.k=2 --gen.l=30 \
        --model_name="meta-llama/Llama-2-13b-chat-hf" \
        --scorer_name="HuggingFaceH4/zephyr-7b-beta" \
        --output_dir="output/instruction_following/cbs/w2k2l30/gen"
done

Evaluation

There are three ways to automatically evaluate the generated responses: 1. GPT-4 (AlpacaEval default) 2. openbmb/UltraRM-13b and 3. Nexusflow/Starling-RM-34B:

GPT-4:

PYTHONPATH=. python scripts/instruction_following/eval.py \
    --evaluator_name="GPT-4" \
    --generation_dir="output/instruction_following/cbs/w2k2l30/gen" \
    --evaluation_dir="output/instruction_following/cbs/w2k2l30/eval"

 OPENAI_API_KEY="..." alpaca_eval --model_outputs "output/instruction_following/cbs/w2k2l30/eval/GPT-4/model_outputs.json"

openbmb/UltraRM-13b and Nexusflow/Starling-RM-34B:

for evaluator_name in "openbmb/UltraRM-13b" "Nexusflow/Starling-RM-34B"; do
    PYTHONPATH=. python scripts/instruction_following/eval.py \
        --evaluator_name=${evaluator_name} \
        --generation_dir="output/instruction_following/cbs/w2k2l30/gen" \
        --evaluation_dir="output/instruction_following/cbs/w2k2l30/eval"
done

FQA

To decode models saved locally.

If you do not save models in the default cache directory (e.g., ~/.cache/huggingface), modify scripts/configs/local_model_path.yaml to map model name to its local path. For example.

meta-llama/Meta-Llama-3-8B-Instruct: ~/models/Meta-Llama-3-8B-Instruct
meta-llama/Meta-Llama-3-70B-Instruct: ~/models//Meta-Llama-3-70B-Instruct

Out of GPU memory.

To infer a large (70B) model that doesn't fit on a single GPU, run the code as is with multiple GPUs or 4-bit quantization. For example:

# Infer on one single GPU
CUDA_VISIBLE_DEVICES=0 python ...

# Infer on one single GPU with 4-bit quant
CUDA_VISIBLE_DEVICES=0 python ... --load_in_4bit=True

# Infer on four GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 python ...

# Infer on four GPUs with 4-bit quant
CUDA_VISIBLE_DEVICES=0,1,2,3 python ... --load_in_4bit=True

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

instruction_following

instruction_following

README.md

Instruction Following

Supported Models

Guided Generation

Evaluation

FQA

Files

instruction_following

Directory actions

More options

Directory actions

More options

Latest commit

History

instruction_following

Folders and files

parent directory

README.md

Instruction Following

Supported Models

Guided Generation

Evaluation

FQA