LongLLaMA: Focused Transformer Training for Context Scaling

LongLLaMA-Instruct-3Bv1.1

This catalog contains code for instruction/chat tuning of the LongLLaMA models. Using this code we managed to tune the LongLLaMA-3Bv1.1 using one A100 80GB GPU in 44 hours. For tuning, we used OpenOrca (instructions) and zetavg/ShareGPT-Processed (chat) datasets. We call the created model LongLLaMA-Instruct-3Bv1.1. We provide a Colab demo of the model.

For more about LongLLaMA see the paper Focused Transformer: Contrastive Training for Context Scaling.

Usage

Required packages are located in requirements.txt.
Example configs are in files:

example_inst_ft_3b_low_budget.sh - only instruction tuning, smaller context
example_instchat_ft_3bv1.1_low_budget.sh - instruction and chat tuning, config used for LongLLaMA-Instruct-3Bv1.1, the chat prompt was inspired by LongChat

To tune the model, simply run one of the scripts from the repo root directory. To manage the tuning process we use Hugging Face trainer.
For example, to create your own LongLLaMA-Instruct-3Bv1.1 run ./instruction_fine_tuning/example_instchat_ft_3bv1.1_low_budget.sh.

Brief description of files

arguments.py - see this file for the description of additional (non-Hugging Face) parameters
data_processing.py - used to process the data, this includes filtering, mixing chat and instruction data, padding etc.
fine_tuning.py - main script that runs the trainer
misc/trainer_state_of_LongLLaMA-Instruct-3v1.1.json - tuning log for LongLLaMA-Instruct-3Bv1.1

Licensing

The code is available under Apache License, Version 2.0.
Note that for fine-tuning we used OpenOrca and zetavg/ShareGPT-Processed datasets. Those datasets contain outputs of GPT models, which can affect the licensing of the models trained on them.

Misc

Note that the fine-tuning scripts are for models previously fine-tuned with FoT. In particular, we do not use the FoT method during instruction fine-tuning. In order to maintain the model's ability to utilize long context, we randomly decide (for short inputs) how much data will be loaded to memory and how much will stay in the last context window. We achieve this by randomly padding the input. One may think of this as a modified version of FoT without negatives and with only current and previous context.

Sometimes Hugging Face Trainer can pick the logger by default. If you run into problems, you can manually set the logger by adding --report_to "tensorboard" inside the script.

If you plan to use this codebase for different models, please note how the padding is applied. Note also that attention is masked for padding tokens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LongLLaMA: Focused Transformer Training for Context Scaling

Usage

Brief description of files

Licensing

Misc

Files

README.md

Latest commit

History

README.md

File metadata and controls

LongLLaMA: Focused Transformer Training for Context Scaling

Usage

Brief description of files

Licensing

Misc