Skip to content

Latest commit

 

History

History
51 lines (34 loc) · 4.01 KB

README.md

File metadata and controls

51 lines (34 loc) · 4.01 KB

LongLLaMA: Focused Transformer Training for Context Scaling

This catalog contains code for instruction/chat tuning of the LongLLaMA models. Using this code we managed to tune the LongLLaMA-3Bv1.1 using one A100 80GB GPU in 44 hours. For tuning, we used OpenOrca (instructions) and zetavg/ShareGPT-Processed (chat) datasets. We call the created model LongLLaMA-Instruct-3Bv1.1. We provide a Colab demo of the model.

For more about LongLLaMA see the paper Focused Transformer: Contrastive Training for Context Scaling.

Usage

Required packages are located in requirements.txt.
Example configs are in files:

To tune the model, simply run one of the scripts from the repo root directory. To manage the tuning process we use Hugging Face trainer.
For example, to create your own LongLLaMA-Instruct-3Bv1.1 run ./instruction_fine_tuning/example_instchat_ft_3bv1.1_low_budget.sh.

Brief description of files

Licensing

The code is available under Apache License, Version 2.0.
Note that for fine-tuning we used OpenOrca and zetavg/ShareGPT-Processed datasets. Those datasets contain outputs of GPT models, which can affect the licensing of the models trained on them.

Misc

Note that the fine-tuning scripts are for models previously fine-tuned with FoT. In particular, we do not use the FoT method during instruction fine-tuning. In order to maintain the model's ability to utilize long context, we randomly decide (for short inputs) how much data will be loaded to memory and how much will stay in the last context window. We achieve this by randomly padding the input. One may think of this as a modified version of FoT without negatives and with only current and previous context.

Sometimes Hugging Face Trainer can pick the logger by default. If you run into problems, you can manually set the logger by adding --report_to "tensorboard" inside the script.

If you plan to use this codebase for different models, please note how the padding is applied. Note also that attention is masked for padding tokens.