rafale

Rafale is a simple and opinionated transformer training CLI.

💡Purpose

Rafale provides an opinionated scaffolding for training transformers. It is solely built to be an efficient learning/research tool. It is not a fully fledged library for large scale training.

It should be thought of as a starting point for research projects to bootstrap experiments on small LMs. The best way to use rafale is to simply fork it and build on top of it for your specific purposes.

Core dependencies

Attempting to balance ergonomics and simplicity. This is meant to be easily hackable for research purposes.

torch, composer, datasets, tokenizers

🚀 Installation & Usage

Setup with uv (install uv).

$ git clone <repo url>
$ cd rafale
$ uv venv
$ . .venv/bin/activate
$ uv pip install -r cuda-requirements.txt (or cpu-requirements.txt)
$ uv pip install -e .

Launch a run with a configuration.

$ rafale-run test/pythia_tinystories.yaml

What if I just want to prepare my dataset? DATA=1 will run the data preparation and caching pipeline without launching the training run.

$ DATA=1 rafale-run test/pythia_tinystories.yaml

What if I want to test my model to make sure that its learning? DEBUG=1 will run 10 epochs on a single training batch (same for train/eval), the model should fit quickly if there are no bugs in the implementation.

$ DEBUG=1 rafale-run test/pythia_tinystories.yaml

By default we hash the configuration and to save checkpoints. If we find that the same run exists we either continue running if it has failed or was stopped. If it was sucessful it is aborted. To duplicate a run, simply set FORCE=1.

$ FORCE=1 rafale-run test/pythia_tinystories.yaml

🔧 Under the hood

The goal of rafale is to provide a single entry point for data preparation and training. You configure the model and dataset. Then call the training job.

When calling a run, first we run the datapipepline. If the dataset has already been processed (tokenized, padded, chunked, etc.), it will be loaded from the cache (default location is ~/.rafale_cache.

Note

Adding a new model

To add a new model, you need write a new configuration to rafale/models/configurations.py, and add it's key to model_config_dict in rafale/main.py.

Look at the ComposerLM wrapper class in rafale/models/decoder.py to check if all your building blocks are there. Otherwise you may need to modify/write a new wrapper.

Adding a new datapipeline

If the dataset is hosted on huggingface, simply use git lfs to clone the repo locally or use the repo name as the dataset path. Same goes for tokenizers since we use their tokenizer implementation.

You will need to add a new datapipeline class in rafale/datapipes.py where the _prepare method all data preprocessing (tokenization, chunking, truncation, etc.) EXCEPT padding. Padding will be performed by the datacollator.

📕 Docs

Append this file llm-docprompt.txt to your favorite LLM and ask away.

🦾 Supported models

Name	Implemented	Inference test	Training test
BERT	✅		✅
RoBERTa	✅
Pythia	✅	✅	✅
AIM	⏳

🔮 Roadmap

v0.1

v0.1 - initial release

Datapipelines

tokenize
concat and split w/ block size (pad w/ collator)
save to disk {source}_{tokname}_bs{int}_len{int}
data_collator: next pad (if desired), label shift right and return torch tensor # HF: does this in the model...
test with model training
tiny stories but for MLM also

v1.0

path to v1.0

cleanup and additional features

clean up tests for pythia and bert models on tinystories
move the testing in the notebook to a debug file in the modeling folder
optimizations : flash attn2, xformers layer_norm (triton) or RMSNorm, xformers fused_linear_layer
try out schedulefree, SOAP, and other optimizers
layerwise decay for fine-tuning (https://kozodoi.me/blog/20220329/discriminative-lr)
multimodality AIM (simpler autoregressive training)
integration with lm-eval-harness (guide)[https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage]

I am GPU-rich what do I use?

For large scale experiments other frameworks/libraries exist:

lingua (Facebookresearch)
torchtitan (Pytorch)
torchtune (Pytorch)
litGPT (LightningAI)
GPT-NeoX (EleutherAI)
nanotron (Huggingface)
llm-foundry (MosaicML)

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
media		media
rafale		rafale
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cpu-requirements.txt		cpu-requirements.txt
cuda-requirements.txt		cuda-requirements.txt
generate_llm_docprompt.sh		generate_llm_docprompt.sh
llm-docprompt.txt		llm-docprompt.txt
pyproject.toml		pyproject.toml
wip.md		wip.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rafale

💡Purpose

Core dependencies

🚀 Installation & Usage

🔧 Under the hood

Adding a new model

Adding a new datapipeline

📕 Docs

🦾 Supported models

🔮 Roadmap

v0.1 - initial release

path to v1.0

I am GPU-rich what do I use?

About

Releases 1

Packages

Languages

License

maxrousseau/rafale

Folders and files

Latest commit

History

Repository files navigation

rafale

💡Purpose

Core dependencies

🚀 Installation & Usage

🔧 Under the hood

Adding a new model

Adding a new datapipeline

📕 Docs

🦾 Supported models

🔮 Roadmap

v0.1 - initial release

path to v1.0

I am GPU-rich what do I use?

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages